How Search Works

Understand the hybrid search pipeline — from query decomposition through vector and keyword search to LLM-generated answers with source citations.

The Full Search Pipeline

When a user submits a query, it passes through a multi-stage pipeline that combines different search strategies for maximum relevance. This is the most critical diagram to understand:

flowchart TB A[" User Query
'harga honda veloz vs toyota avanza'"] --> B[Query Decomposition] B -->|"Multi-topic detected"| B1["Sub-query 1: 'harga honda veloz'"] B -->|"Multi-topic detected"| B2["Sub-query 2: 'harga toyota avanza'"] B1 --> C[Keyword Detection] B2 --> C C -->|"'harga' → pricing keyword"| D[Pricing Database Search] C -->|"Always"| E[Vector Search
Semantic / ChromaDB] C -->|"Always"| F[BM25 Search
Keyword / FTS5] D --> G[Hybrid Fusion
Weighted Combination] E --> G F --> G G --> H[Relevance Threshold
Filter low scores] H --> I["Top-K Selection
(default: 5)"] I --> J[" LLM
Bedrock or Gemini"] J --> K[" Response
with [1][2] Citations"] style A fill:#3B82F6,color:#fff style D fill:#F59E0B,color:#fff style E fill:#8B5CF6,color:#fff style F fill:#F59E0B,color:#fff style G fill:#6366F1,color:#fff style J fill:#22C55E,color:#fff style K fill:#22C55E,color:#fff

Search Methods

BABEH supports three search methods, configurable in Settings:

MethodHow it WorksBest For
Hybrid (default) Combines Vector + BM25 results with weighted fusion (default: 50% each) Most queries — balances semantic understanding with keyword precision
Vector Only Pure semantic search via ChromaDB cosine similarity using Cohere multilingual embeddings Conceptual queries, paraphrased questions, cross-language search
BM25 Only Pure keyword search via SQLite FTS5 full-text indexing Exact term matching, product codes, technical terms

The user's query is converted to a 1024-dimensional vector using the cohere.embed-multilingual-v3 model, then compared against all stored document chunks using cosine similarity in ChromaDB. This finds results that are semantically similar even if they use different words.

Multilingual Understanding

The Cohere multilingual model understands 100+ languages. A query in Indonesian can match content in English, and vice versa.

Uses SQLite's FTS5 (Full-Text Search version 5) engine with the BM25 ranking algorithm. Scores are based on term frequency (how often the keyword appears in a chunk) and inverse document frequency (how rare the keyword is across all chunks).

Hybrid Fusion

In hybrid mode, results from both search methods are combined using weighted scores:

final_score = (vector_weight × vector_score) + (bm25_weight × bm25_score)

Default weights: Vector 50% + BM25 50%. These weights are configurable in the system configuration.

Query Decomposition

BABEH automatically detects multi-topic queries — questions that compare or ask about multiple items simultaneously. These are split into sub-queries for better retrieval.

Detection Patterns

PatternLanguageExample
A vs B EN / ID "Honda Civic vs Toyota Corolla"
A atau B ID "Civic atau Corolla?"
A or B EN "Civic or Corolla?"
A dan B ID "fitur Civic dan Corolla"
A and B EN "features of Civic and Corolla"
perbedaan A dan B ID "perbedaan Civic dan Corolla"
difference between A and B EN "difference between Civic and Corolla"

When a multi-topic query is detected, each sub-topic is searched independently, and results are merged for more diverse and comprehensive coverage.

Automatic Database Detection

BABEH automatically scans the query for keywords that indicate the user is asking about pricing or specifications. When detected, the relevant database is searched in addition to the document knowledge base.

Pricing Keywords

If any of these words appear in the query, the Pricing Database is searched:

harga, price, berapa, biaya, cost, tarif, pricing,
promo, diskon, discount, budget, kisaran, sekitar

Specification Keywords

If any of these words appear, the Product Specifications database is searched:

spesifikasi, spec, fitur, feature, transmisi, transmission,
mesin, engine, cc, tenaga, hp, hybrid, electric, listrik,
bensin, petrol, 4wd, awd, fwd, sunroof, carplay,
kamera, kamera 360, 360 camera, wireless, kursi,
kapasitas, seat, cooling seat, heated seat, cruise control,
lane assist, android auto, keyless, push start, rear camera,
electric seat, punya, ada fitur, apakah ada
Combined Detection

A query like "harga dan spesifikasi Honda Civic" triggers both pricing and spec detection, searching all three sources (documents + pricing + specs) simultaneously.

Relevance Threshold & Scoring

After search results are collected, they are filtered by a relevance threshold to ensure only meaningful results reach the LLM.

SettingDefaultRangeEffect
Relevance Threshold 0.25 0.0 – 1.0 Results below this score are filtered out before reaching the LLM
Top K 5 1 – 50 Maximum number of results sent to the LLM as context

Score Color Coding

In the Search Debug tool, scores are color-coded:

ColorScore RangeMeaning
Green≥ 0.5High relevance — strong match
Yellow≥ 0.3Moderate relevance — decent match
Red< 0.3Low relevance — may be filtered out

Citation System

BABEH uses a numbered citation system to ensure AI-generated answers are traceable back to their source documents.

How Citations Work

sequenceDiagram participant Search as Search Engine participant LLM as LLM (Bedrock/Gemini) participant User as End User Search->>LLM: Send top-K results as numbered sources:
[1] Document A - chunk content...
[2] Document B - chunk content... LLM->>LLM: Generate answer using sources LLM-->>User: "The Honda Civic has 150HP [1] and...
costs around 500 juta [2]" Note right of User: Citations [1], [2] link
back to source documents

The response includes a citations JSON that maps each reference number to its source document, enabling the frontend to display clickable source links.

No Information Response

When no relevant results are found (all scores below threshold, or no matching documents), the system returns a default response:

"Maaf, saya tidak memiliki informasi tersebut."
(Sorry, I don't have that information.)

This prevents the AI from hallucinating answers when no relevant context is available.

Streaming Responses

BABEH supports Server-Sent Events (SSE) streaming for real-time, word-by-word response delivery. The stream includes these event types:

EventDescription
search_completeSearch phase finished, results available
llm_chunkA piece of the LLM's response text
citationsSource citation mapping
metadataProcessing time, model used, token counts
doneStream complete