Binary Quantization for Vector Search: 8× Compression Without the Accuracy Trade-off
"Binary Quantization for Vector Search: 8× Compression Without the Accuracy Trade-off"
Vector search is memory-hungry. A million documents at 384 dimensions — each a 384-float vector — requires ~1.5 GB just for raw storage. Scale to a web-sized index and you're looking at terabytes.
Product quantization and IVF indexes help, but they're complex to implement and slow to query. Binary quantization is simpler, faster, and — with the right implementation — surprisingly accurate.
What Binary Quantization Does
Binary quantization maps each float vector to a compact binary code. Instead of storing float32[384] (1,536 bytes), we store uint8[48] (48 bytes). That's a 32× reduction in storage.
For retrieval, we use Hamming distance instead of cosine similarity. Hamming distance is just an XOR + popcount — a single CPU instruction on modern chips. This makes queries faster at smaller memory footprints.
The Accuracy Problem
Standard binary quantization loses ~15-20% retrieval accuracy on most benchmarks. That's unacceptable for a production search engine.
We solve this with asymmetric binary quantization:
- Separate codebooks for the query encoder and the document encoder
- Supervised training on click-through data to learn which dimensions matter for retrieval
- Dimensional weighting — important dimensions get higher weight in the binary encoding
Results on Our Dataset
| Method | Storage | P95 Latency | NDCG@10 |
|---|---|---|---|
| Full float (384-dim) | 1,536 bytes/doc | 180ms | 0.847 |
| Product quantization | 64 bytes/doc | 95ms | 0.801 |
| Our binary quantization | 48 bytes/doc | 38ms | 0.819 |
We achieve 8× compression over full float vectors while losing only 3.3% NDCG. Query latency drops 5× because Hamming distance is hardware-accelerated.
Implementation Details
The encoding pipeline:
# Train the quantizer on labeled query-document pairs
codebook = train_asymmetric_quantizer(positive_pairs, negative_pairs)
# Encode documents (offline, batch)
doc_codes = codebook.encode_documents(all_documents)
# Encode queries (online, per-request)
query_code = codebook.encode_query(raw_query)
# Retrieve using Hamming distance
candidates = hmm_search(query_code, doc_codes, top_k=100)
The quantizer is trained once on historical click data and deployed as a static artifact. Query encoding runs on CPU with ONNX Runtime — no GPU required.
Why This Matters for Privacy-First Search
Running a search engine on commodity hardware means smaller data centers, fewer physical resources, and lower operational costs. This makes privacy-first search economically viable even for small teams.
IntentForge runs its full index on a single $20/month VPS because of compression techniques like this.
Future Work
We're exploring:
- Learned binary codes via differentiable relaxation
- Multi-scale quantization for hierarchical retrieval
- GPU-accelerated Hamming for real-time reranking
All experiments are documented in our research notes at oxiverse.com/research.
Related Content_
Building RAVANA v2: A Proto-Homeostatic Cognitive Architecture
How RAVANA v2 implements a five-layer GRACE control system with identity clamps for bounded AGI development.
IntentForge: How We Built a Privacy-First Search Engine on Tor
A technical deep dive into IntentForge's architecture — Tor-routed meta-search, intent-first matching, and binary quantized vectors.