Skip to content

fix: remove docs#9

Merged
KylinMountain merged 47 commits intomainfrom
bugfix/compile
Apr 11, 2026
Merged

fix: remove docs#9
KylinMountain merged 47 commits intomainfrom
bugfix/compile

Conversation

@rejojer
Copy link
Copy Markdown
Member

@rejojer rejojer commented Apr 9, 2026

Summary

  • Unified summary frontmatter to doc_type + full_text, removed redundant fields (sources, brief, source_doc, doc_id)
  • Concept pages now link to summaries instead of raw PDF filenames
  • Fixed duplicate frontmatter bug in both summary and concept pages (prompt fix + strip fallback)
  • Improved query agent: use full_text field, restrict get_page_content to pageindex docs, add self-talk, concise answers
  • Fixed image path mismatch in pageindex JSON content
  • Removed page marker comments from short doc source markdown
  • Fixed warning suppression (markitdown overrides filters at import time)
  • Improved init prompts with explicit defaults, American English spelling
  • Various output formatting fixes (tool call spacing, step name colons, unicode ellipsis)

Test plan

  • openkb init shows correct prompts with defaults
  • openkb add short doc: clean single frontmatter, no page markers in source
  • openkb add long doc: correct image paths in JSON content
  • openkb query on short doc: reads source via read_file, no get_page_content
  • openkb query on long doc: uses get_page_content with targeted page ranges
  • No PyPDF2 deprecation warning during any operation

KylinMountain and others added 30 commits April 9, 2026 00:21
Add _CONCEPTS_PLAN_USER (create/update/related JSON structure) and
_CONCEPT_UPDATE_USER templates; add TestParseConceptsPlan tests.
- Restore markitdown[all] extras for docx/pptx/xlsx support
- Sanitize concept names to prevent path traversal in compiler
- Add path traversal guard in copy_relative_images
- Fix _write_concept duplicate append when frontmatter lacks sources key
- Remove dead write_wiki_files function
- Fix watcher thread race in _schedule_flush
- Warn when unimplemented --fix flag is used in lint command
- Harden CI publish workflow with environment gate and SHA-pinned actions
- Fix test_indexer to actually assert IndexConfig flag values
- Fix test_converter to test correct PDF code path (pymupdf, not markitdown)
- Use str.find() instead of str.index() in frontmatter parsing to avoid ValueError
- Add _backlink_summary: ensures summary pages link to all related concepts
- Add _backlink_concepts: ensures concept pages link back to source summaries
- _update_index auto-creates index.md if missing
- Both merge into existing sections instead of duplicating
Adds parse_pages() to expand page specs like "1-3,7" into sorted
deduplicated int lists, and get_page_content() to read per-page JSON
(sources/{doc}.json) and format output with optional image paths.
Includes path-traversal guard consistent with existing tools.
Replace _SUMMARY_USER, _CONCEPT_PAGE_USER, and _CONCEPT_UPDATE_USER to
request a JSON object with "brief" (one-line summary) and "content" (full
Markdown). Add TestParseBriefContent to tests/test_compiler.py.
Replace markdown source generation with per-page JSON from PageIndex
get_page_content; remove render_source_md, _render_nodes_source,
_relocate_images, and _IMG_REF_RE. Image relocation is now done inline
per page. Update tests to assert .json output and mock get_page_content.
…or all docs

Remove _pageindex_retrieve_impl and the pageindex_retrieve tool; add
get_page_content_tool that uses the local JSON-based page store for all
long documents. Update instructions and schema description accordingly.
… indexer

- Default model changed from gpt-5.4 to gpt-5.4-mini
- Indexer get_page_content no longer uses hardcoded 9999 fallback
- Infers page_count from structure end_index when doc lacks page_count field
- Added debug logging for doc keys and page_count diagnosis
…e backlink for short docs

- index.md entries now show (short) or (pageindex) type marker
- Query agent prompt updated: guides agent to read sources for detail
- Removed list_files tool from query agent (index.md is sufficient)
- Short doc summaries now have source_doc frontmatter linking to sources/
- Reverted list_wiki_files to only list .md files
- Fixed tests for model name change and agent tool count
Replace sources/brief/source_doc/doc_id/source fields with two
consistent fields: doc_type (short|pageindex) and full_text pointing
to the actual source content under sources/.
Concept pages now reference summaries/{doc}.md instead of raw PDF
filenames. Also strips frontmatter from LLM content during concept
updates to prevent duplicate YAML blocks. Removes unused
_find_source_filename.
Add hint that summaries may omit details. Update search strategy to
reference the full_text frontmatter field instead of hardcoded paths.
rejojer and others added 14 commits April 10, 2026 04:35
Remove frontmatter format from schema to avoid LLM copying it.
Add strip as fallback in _write_summary and _write_concept create path.
Replace PageIndex get_page_content with pymupdf-based convert_pdf_to_pages
for long doc JSON generation. All image paths now use sources/images/ prefix
relative to wiki root. Removes dependency on PageIndex for source content.
Query agent can now view images referenced in source documents via
get_image tool, which returns ToolOutputImage for the LLM to inspect.
Prompt updated to use images when questions involve figures or visuals.
Accept all origin/dev changes including: image support in query agent,
robust JSON parsing with json_repair, unicode concept name support,
section-based index operations, cloud/local page extraction fallback.
@KylinMountain KylinMountain changed the base branch from dev to main April 11, 2026 02:53
@KylinMountain KylinMountain changed the title fix: compile pipeline, query agent, and frontmatter improvements fix: remove docs Apr 11, 2026
@KylinMountain KylinMountain merged commit 726336a into main Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants