Search Strategy: Hybrid Semantic + Keyword
IntentForge v2 employs a hybrid search strategy that combines the traditional precision of keyword-based search with the contextual understanding of semantic (vector) search. This document explains why we use this approach and how it is implemented.
Why Hybrid Search?
Keyword search and semantic search have complementary strengths and weaknesses:
| Feature | Keyword Search (BM25) | Semantic Search (Vector) |
|---|---|---|
| Strengths | Exact matches, technical terms, specific IDs, rare words. | Synonyms, related concepts, "vibe" matching, multi-lingual. |
| Weaknesses | Fails on synonyms ("car" vs "automobile"), lacks context. | "Hallucinates" relevance, struggles with exact technical strings. |
By combining them, IntentForge provides results that are both conceptually relevant and technically precise.
Implementation Details
1. Vectorization (Semantic)
We use the BGE-Small or BERT ONNX models to transform queries and documents into 384-dimensional or 768-dimensional vectors.
- ONNX Runtime: Used for high-performance, hardware-accelerated inference in Rust.
- Binary Quantization: We compress these vectors using binary quantization. This reduces storage by 8x and accelerates similarity calculations by up to 10x with minimal loss in accuracy.
2. BM25 (Keyword)
Meilisearch provides a highly optimized BM25 implementation for fast keyword lookups.
3. Weighted Fusion
Results from both engines are merged using a weighted fusion algorithm. By default, we use a 0.7 (Semantic) / 0.3 (Keyword) ratio.
- Adaptive Ratio: The engine automatically adjusts this ratio based on the query type. For example, a Navigational query (e.g., "site:github.com rust") will favor keyword matches, while an Informational query (e.g., "how does async work in rust") will favor semantic matches.
Re-ranking with Cross-Encoders
For high-precision results, we implement a two-stage ranking process:
- Retrieval Stage: Hybrid search retrieves the top 40-50 candidates.
- Reranking Stage: A more powerful Cross-Encoder model evaluates the query against each candidate description. This stage is slower but far more accurate at determining the final order of results.
SimHash Deduplication
To ensure variety in search results, we use SimHash to identify near-duplicate content across different web sources. Documents with a low Hamming distance are identified as duplicates, and only the highest-quality version is presented to the user.