Binary Quantization — 8× Vector Compression with Minimal Accuracy Loss - Oxiverse

Binary Quantization for Vector Search: 8× Compression Without the Accuracy Trade-off

Vector search is memory-hungry. A million documents at 384 dimensions — each a 384-float vector — requires ~1.5 GB just for raw storage. Scale to a web-sized index and you're looking at terabytes.

Product quantization and IVF indexes help, but they're complex to implement and slow to query. Binary quantization is simpler, faster, and — with the right implementation — surprisingly accurate.

What Binary Quantization Does

Binary quantization maps each float vector to a compact binary code. Instead of storing float32[384] (1,536 bytes), we store uint8[48] (48 bytes). That's a 32× reduction in storage.

For retrieval, we use Hamming distance instead of cosine similarity. Hamming distance is just an XOR + popcount — a single CPU instruction on modern chips. This makes queries faster at smaller memory footprints.

The Accuracy Problem

Standard binary quantization loses ~15-20% retrieval accuracy on most benchmarks. That's unacceptable for a production search engine.

We solve this with asymmetric binary quantization:

Separate codebooks for the query encoder and the document encoder
Supervised training on click-through data to learn which dimensions matter for retrieval
Dimensional weighting — important dimensions get higher weight in the binary encoding

Results on Our Dataset

Method	Storage	P95 Latency	NDCG@10
Full float (384-dim)	1,536 bytes/doc	180ms	0.847
Product quantization	64 bytes/doc	95ms	0.801
Our binary quantization	48 bytes/doc	38ms	0.819

We achieve 8× compression over full float vectors while losing only 3.3% NDCG. Query latency drops 5× because Hamming distance is hardware-accelerated.

Implementation Details

The encoding pipeline:

# Train the quantizer on labeled query-document pairs
codebook = train_asymmetric_quantizer(positive_pairs, negative_pairs)

# Encode documents (offline, batch)
doc_codes = codebook.encode_documents(all_documents)

# Encode queries (online, per-request)
query_code = codebook.encode_query(raw_query)

# Retrieve using Hamming distance
candidates = hmm_search(query_code, doc_codes, topk=100)

The quantizer is trained once on historical click data and deployed as a static artifact. Query encoding runs on CPU with ONNX Runtime — no GPU required.

Why This Matters for Privacy-First Search

Running a search engine on commodity hardware means smaller data centers, fewer physical resources, and lower operational costs. This makes privacy-first search economically viable even for small teams.

IntentForge runs its full index on a single $20/month VPS because of compression techniques like this.

Future Work

We're exploring:

Learned binary codes via differentiable relaxation
Multi-scale quantization for hierarchical retrieval
GPU-accelerated Hamming for real-time reranking

All experiments are documented in our research notes at oxiverse.com/research.

Binary Quantization — 8× Vector Compression with Minimal Accuracy Loss