Blog3 min read

Binary Quantization — 8× Vector Compression with Minimal Accuracy Loss

Binary Quantization — 8× Vector Compression with Minimal Accuracy Loss
A
Admin

"*“Vector search is memory-hungry. Binary quantization is the answer – but traditional methods lose 15-20% accuracy.”* We cracked asymmetric binary quantization: 48 bytes per document instead of 1,536 bytes, 5× faster queries, and only 3.3% NDCG loss. No GPUs. No terabytes of RAM. Just efficient, accurate search on commodity hardware. Dive into the math, the implementation, and why this makes privacy-first search viable."

Binary Quantization for Vector Search: 8× Compression Without the Accuracy Trade-off

Vector search is memory-hungry. A million documents at 384 dimensions — each a 384-float vector — requires ~1.5 GB just for raw storage. Scale to a web-sized index and you're looking at terabytes.

Product quantization and IVF indexes help, but they're complex to implement and slow to query. Binary quantization is simpler, faster, and — with the right implementation — surprisingly accurate.

What Binary Quantization Does

Binary quantization maps each float vector to a compact binary code. Instead of storing float32[384] (1,536 bytes), we store uint8[48] (48 bytes). That's a 32× reduction in storage.

For retrieval, we use Hamming distance instead of cosine similarity. Hamming distance is just an XOR + popcount — a single CPU instruction on modern chips. This makes queries faster at smaller memory footprints.

The Accuracy Problem

Standard binary quantization loses ~15-20% retrieval accuracy on most benchmarks. That's unacceptable for a production search engine.

We solve this with asymmetric binary quantization:

  • Separate codebooks for the query encoder and the document encoder
  • Supervised training on click-through data to learn which dimensions matter for retrieval
  • Dimensional weighting — important dimensions get higher weight in the binary encoding

Results on Our Dataset

MethodStorageP95 LatencyNDCG@10
Full float (384-dim)1,536 bytes/doc180ms0.847
Product quantization64 bytes/doc95ms0.801
Our binary quantization48 bytes/doc38ms0.819

We achieve 8× compression over full float vectors while losing only 3.3% NDCG. Query latency drops 5× because Hamming distance is hardware-accelerated.

Implementation Details

The encoding pipeline:

# Train the quantizer on labeled query-document pairs
codebook = train_asymmetric_quantizer(positive_pairs, negative_pairs)

# Encode documents (offline, batch)
doc_codes = codebook.encode_documents(all_documents)

# Encode queries (online, per-request)
query_code = codebook.encode_query(raw_query)

# Retrieve using Hamming distance
candidates = hmm_search(query_code, doc_codes, topk=100)

The quantizer is trained once on historical click data and deployed as a static artifact. Query encoding runs on CPU with ONNX Runtime — no GPU required.

Why This Matters for Privacy-First Search

Running a search engine on commodity hardware means smaller data centers, fewer physical resources, and lower operational costs. This makes privacy-first search economically viable even for small teams.

IntentForge runs its full index on a single $20/month VPS because of compression techniques like this.

Future Work

We're exploring:

  • Learned binary codes via differentiable relaxation
  • Multi-scale quantization for hierarchical retrieval
  • GPU-accelerated Hamming for real-time reranking

All experiments are documented in our research notes at oxiverse.com/research.

Related Content_