Knowledge Base

Upload documents, ingest web content, and manage your document library — the foundation of your AI-powered search.

Who can access this?

editor manager superadmin

Upload Files

Upload documents that will be automatically processed, chunked, and indexed for both semantic (vector) and keyword (BM25) search.

How to Upload

Navigate to Knowledge Base from the sidebar.
In the Upload File card, either drag and drop a file onto the upload area, or click Browse to select a file.
The file will be uploaded, processed, chunked, and indexed automatically.
A progress indicator shows the processing status.
Once complete, the document appears in the Document Library below.

Screenshot: Upload file card with drag-and-drop area and browse button

Supported File Types & Limits

Property	Value
Supported formats	`.pdf`, `.txt`
Maximum file size	50 MB
Text encoding (TXT)	UTF-8, UTF-16, Latin-1, CP1252 (auto-detected)

File Requirements

PDF files must contain extractable text (scanned images without OCR will result in empty content).
Only .pdf and .txt extensions are accepted.
Files larger than 50 MB will be rejected with an error message.

How Chunking Works

When a file is uploaded, the text content is split into smaller pieces called chunks. Each chunk is then:

Embedded into a 1024-dimensional vector using Cohere's multilingual model → stored in ChromaDB
Indexed for full-text keyword search → stored in SQLite FTS5

flowchart LR A[ Uploaded File] --> B[Text Extraction] B --> C[Chunking Engine] C -->|"Chunk 1, Chunk 2, ..."| D[Cohere Embedding
1024-dim vectors] C -->|"Chunk 1, Chunk 2, ..."| E[FTS5 Indexing
BM25 keywords] D --> F[(ChromaDB)] E --> G[(SQLite)] style A fill:#3B82F6,color:#fff style D fill:#8B5CF6,color:#fff style E fill:#F59E0B,color:#fff style F fill:#8B5CF6,color:#fff style G fill:#F59E0B,color:#fff

Setting	Default	Range	Description
Chunk Size	500 characters	100 – 10,000	Number of characters per chunk. Larger chunks provide more context but may reduce precision.
Chunk Overlap	50 characters	0 – 5,000	Characters shared between consecutive chunks to prevent information loss at boundaries.

Tip: Chunk Settings

Chunk size and overlap are configured in Settings. Changing these values only affects newly uploaded documents. Use Re-index to re-chunk existing documents with new settings.

Ingest URLs

Scrape and ingest content from web pages. BABEH automatically extracts the main article content, strips navigation/ads, and processes it the same way as uploaded files.

How to Ingest URLs

Navigate to Knowledge Base from the sidebar.
In the Ingest URLs card, paste one or more URLs into the text area — one URL per line.
Click Ingest. Each URL is processed sequentially with a progress bar.
Successfully ingested pages appear in the Document Library with type "URL".

Screenshot: Ingest URLs card with multi-line textarea and progress bar

Web Scraping Engine

BABEH uses trafilatura as the primary content extractor with a BeautifulSoup4 fallback. It intelligently extracts the main article content while filtering out navigation bars, footers, ads, and other boilerplate.

Document Library

The Document Library is a searchable, sortable table of all uploaded files and ingested URLs.

Screenshot: Document Library table with search, filters, and action buttons

Table Fields

Column	Description
Filename	Original filename or URL title
Type	File type badge: `PDF`, `TXT`, or `URL`
Chunks	Number of text chunks created from this document
Size	Original file size (in KB or MB)
Upload Date	When the document was uploaded or ingested
Actions	Edit, Re-index, and Delete buttons

Filtering & Sorting

Search — Type in the search box to filter documents by filename.
Type filter — Switch between All, Files, or URLs.
Sort — Click column headers to sort by chunks, size, or date.
Pagination — Choose 10, 25, or 50 items per page.

Document Actions

Edit

Opens the document in an edit modal where you can:

Rename the document (filename field)
Edit content directly in a text editor — changes trigger automatic re-chunking and re-embedding
Use KBC AI Improvement to get AI-powered rewriting suggestions (see below)

Re-index

Re-processes the document using the current chunk settings. Useful when you've changed chunk size or overlap in Settings and want existing documents to use the new values.

Single document — Click the Re-index button on a specific row.
All documents — Use the "Re-index All" button at the top of the library. Progress streams via real-time SSE updates.

Stale Chunk Warning

When your current chunk settings differ from those used when a document was ingested, a yellow warning banner appears suggesting you re-index. This ensures all documents use consistent chunking for optimal search quality.

Delete

Permanently removes the document from all stores:

Chunks removed from ChromaDB (vector embeddings)
Chunks removed from SQLite FTS5 (keyword index)
Original file removed from disk
Metadata removed from SQLite documents table

Irreversible Action

Deleting a document cannot be undone. You will need to re-upload the original file if needed.

KBC Content Improvement (AI)

The KBC Content Improvement feature uses AI to analyze your document content and suggest rewrites that improve search retrieval quality.

How to Use

Open a document via the Edit button.
In the edit modal, find the KBC Improvement panel.
Click "Generate Suggestion" — the AI analyzes the content and proposes improvements.
Review the suggestion. Click "Use Suggestion" to apply it to the editor, or "Use & Save" to apply and save immediately.

sequenceDiagram actor User participant Editor as Edit Modal participant LLM as AI (LLM) participant DB as Database User->>Editor: Click "Generate Suggestion" Editor->>LLM: Send current content
(temperature: 0.3) LLM-->>Editor: Return improved content Editor->>User: Display suggestion preview User->>Editor: Click "Use & Save" Editor->>DB: Save updated content DB-->>Editor: Re-chunk + Re-embed Editor->>User: Yes Document updated

How It Helps

The AI rewrites content to be more structured, keyword-rich, and retrieval-friendly — meaning your search engine will find more relevant results. It uses a low temperature (0.3) to stay faithful to the original content while improving clarity.

Collection Info

Below the Document Library, a Collection Info section shows VectorDB statistics:

Total vectors — Number of embeddings stored in ChromaDB
Collection name — The ChromaDB collection identifier
Embedding model — cohere.embed-multilingual-v3 (1024 dimensions)