IntentForge v2
Intent-First Discovery Engine — A high-performance, privacy-respecting discovery engine that identifies and ranks content based on user intent signals. Features autonomous background discovery, query expansion, two-stage ONNX re-ranking, intent-based anti-signals (commercial/spam filtering), and Tor-routed meta-search for unblocked access to the entire web. Self-expands its index via a parallel fan-out self-improvement pipeline that searches all query variations concurrently, batch-embeds, and batch-indexes for ~3-10s convergence.
Key Features
- Intent-First Ranking — Semantic alignment scoring between queries and documents, not just keyword matching
- Tor-Routed Meta-Search — All search providers route through Tor with Snowflake/obfs4 bridges, bypassing IP blocks and CAPTCHAs
- Self-Improving Index — Autonomous background enrichment achieves 8/15 avg quality per query (3 perfect 15/15, 7 near-perfect 10-14/15)
- Binary Quantized Vectors — 8× compression (384→48 bytes/doc) with minimal accuracy loss
- Autonomous Discovery — Sitemaps, link following, RSS feeds, and Common Crawl delta ingestion
- Sub-50ms P95 Latency — Hybrid search with two-stage re-ranking
- Anti-Signals — Filters commercial/spam content via dedicated scoring
- Intent-Gated Crawling — Fast pre-filtering reduces noise by 30-50% before fetching
- Hybrid Extraction — Rust-native extraction (90% of pages) with Trafilatura fallback
- Tiered Storage — High-intent docs get full indexing; others are snippet-only
- Relative Candidate Ranking — Scores all unseen URLs by RRF + alignment + source weight, selecting top candidates
Quick Stats
| Metric | Value |
|---|---|
| Search Providers | 8 direct + SearXNG (70+ engines via Tor) |
| Vector Size | 48 bytes/doc (binary quantized) |
| Query Latency (P95) | <50ms target |
| Indexing Throughput | ~30k pages/hr (Starter tier) |
| RSS Sources | 150+ verified feeds across 20+ categories |
| Self-Improvement | Parallel fan-out, batch embed/index, ~3-10s convergence, 8/15 avg quality |
| Tor Coverage | All providers routed through Snowflake/obfs4 bridges |
Hardware Tiers
| Tier | Specs | Capacity |
|---|---|---|
| Starter (Dev) | 4 vCPU, 16 GB RAM, 100 GB NVMe | ~30k pages/hr, <5M docs, <60ms queries |
| Production | 8-16 vCPU, 32-64 GB RAM, 200-500 GB NVMe | 10M+ docs, global traffic, <50ms queries |
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ IntentForge v2 │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Crawler │◄────►│ Discovery │◄────►│ Indexer │ │
│ │(Batch + Rob) │ │ (Sitemaps) │ │ (Meilisearch)│ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ Rust Native │ │ ONNX MiniLM │ │ Redis Queue │ │
│ │ Extractor │◄─────►│ (Inference) │◄─────►│ (Priority) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬───────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ │ │
│ │ Trafilatura │ │ FastIntent │ │ │
│ │ (Fallback) │ │ Scorer │ │ │
│ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ ┌───────────────────────────▼──────────────────────▼──────┐ │
│ │ Common Crawl Delta (monthly, CDX + HEAD verify) │ │
│ │ Quality scoring → Dedup → Enqueue → State persist │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼──────────────────────────┐ │
│ │ Meta-Search Enrichment Pipeline (Tor-Routed) │ │
│ │ │ │
│ │ Direct Providers (via Tor): SearXNG (via Tor): │ │
│ │ • DuckDuckGo, Bing • Google, DDG, Brave │ │
│ │ • GitHub, ArXiv, Reddit • Bing, Wikipedia │ │
│ │ • GDELT, Invidious • StackOverflow, PyPI │ │
│ │ • Websurfx • NPM, HackerNews │ │
│ │ │ │
│ │ All traffic routed through Tor Snowflake/obfs4 bridges│ │
│ │ Exit nodes appear as residential IPs (no blocking) │ │
│ │ │ │
│ │ Self-Improvement: parallel fan-out, relative ranking, │ │
│ │ quality-gated indexing, partial improvements │ │
│ └────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────▼──────────┐ │
│ │ HTTP API (axum) │ │
│ │ Port 9100 │ │
│ │ /search /crawl │ │
│ │ /health /metrics │ │
│ │ /discovery/status │ │
│ └─────────┬──────────┘ │
└──────────────────────────────┼──────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┬───────────────────────┐
▼ ▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────┐
│ Meilisearch │ │ Redis │ │ Query Layer│ │ SearXNG │
│ v1.13+ │ │ Stack 7.2 │ │ (FastAPI)│ │ :8080 │
│ ( + BQ) │ │(+BloomFilter│ │ /search │ │(70+ eng) │
└─────────────┘ └─────────────┘ └─────────────┘ └──────────┘
Quick Start
Prerequisites
- Rust 1.80+ (edition 2021)
- Docker & Docker Compose
- Bash (Linux/macOS) or PowerShell (Windows)
1. Clone and Setup
git clone https://github.com/oxiverse-labs/intentforge.git
cd intentforge
# Copy environment template
cp .env.example .env
# Edit .env — only MEILI_MASTER_KEY is required for local dev
2. Start Infrastructure
Option A: Local Development (Recommended for Contributors)
Starts only the services you need — no Traefik, no Watchtower, no auto-deployment. The Rust API runs locally via cargo.
# Linux/macOS
chmod +x scripts/dev.sh
./scripts/dev.sh up
# Windows
scripts\dev.bat up
# Then build and run the API locally
cargo run --features tor
Option B: Production (Cloud VMs)
Includes Traefik (SSL), Watchtower (auto-updates), and GHCR image pulls. Requires DOMAIN, EMAIL, and GITHUB_OWNER_LOWER in .env.
docker compose up -d
Services comparison:
| Service | Local Dev | Production |
|---|---|---|
| Meilisearch | ✅ Port 7700 | ✅ Internal |
| Redis | ✅ Port 6379 | ✅ Internal |
| Trafilatura | ✅ Built from source | ✅ GHCR image |
| Query Layer | ✅ Built from source | ✅ GHCR image |
| YouTube Unified | ✅ Built from source | ✅ GHCR image |
| Tor Proxy | ✅ | ✅ |
| Whoogle / LibreX / SearXNG | ✅ | ✅ |
| Traefik (SSL) | ❌ Not needed | ✅ Auto SSL |
| Watchtower | ❌ Not needed | ✅ Auto updates |
| IntentForge API | ✅ cargo run | ✅ GHCR image |
Docker Containers
IntentForge uses a microservices architecture orchestrated by Docker Compose. Each container serves a specific purpose:
Core Infrastructure
| Container | Image | Port | Purpose | Health Check |
|---|---|---|---|---|
meilisearch | getmeili/meilisearch:v1.13 | 7700 | Search index with binary quantization | HTTP /health |
redis | redis/redis-stack:7.2.0-v9 | 6379 | Priority queue, cache, Bloom filters | redis-cli ping |
Microservices (Built from Source)
| Container | Build Context | Port | Purpose | Dependencies |
|---|---|---|---|---|
trafilatura | ./services/trafilatura | 8080 | Content extraction from HTML | None |
query-layer | ./services/query-layer | 8000 | Semantic search with ONNX re-ranking | Meilisearch |
youtube-unified | ./services/youtube-unified | 8085 | Unified YouTube search API | None |
Tor & Meta-Search
| Container | Image | Port | Purpose | Dependencies |
|---|---|---|---|---|
tor | alpine:latest (installs Tor) | 9050 | SOCKS5 proxy for anonymous routing | None |
whoogle | benbusby/whoogle-search:latest | 5000 | Privacy-preserving Google metasearch | Tor |
librex | librex/librex:latest | 8080 | PHP-based metasearch engine | None |
searxng | searxng/searxng:latest | 8082 | Meta-search aggregating 70+ engines | Tor |
Optional Monitoring (--profile monitoring)
| Container | Image | Port | Purpose |
|---|---|---|---|
redis-exporter | oliver006/redis_exporter:v1.58.0 | — | Redis metrics for Prometheus |
prometheus | prom/prometheus:v2.55.0 | 9090 | Metrics collection and storage |
grafana | grafana/grafana:11.4.0 | 3000 | Visualization dashboard |
3. Initialize Meilisearch Index
# Linux/macOS
chmod +x scripts/init_meilisearch.sh
./scripts/init_meilisearch.sh
# Windows
scripts\init_meilisearch.bat
4. Build and Run (Local Dev)
# Build in release mode with Tor support
cargo build --release --features tor
# Run the application
cargo run --features tor
5. Verify Setup
# Check API health
curl http://localhost:9100/health
# Run a test search
curl "http://localhost:9100/search?q=rust+programming"
# Check discovery status
curl http://localhost:9100/discovery/status
6. Run Tests
# Unit and integration tests
cargo test
# Code quality
cargo fmt && cargo clippy --features tor
# Load testing (requires infrastructure running)
locust -f tests/load_test.py --host=http://localhost:9100
Configuration
config.yaml
Key configuration sections:
| Section | Key | Default | Description |
|---|---|---|---|
meilisearch | url | http://localhost:7700 | Meilisearch endpoint |
binary_quantization | true | Enable 8× vector compression | |
semantic_ratio | 0.7 | Hybrid search weight | |
crawler | rate_limit | 10 | Requests per second |
max_concurrent | 8 | Parallel fetches | |
respect_robots | true | Honor robots.txt | |
static_first | true | Try static HTML before JS | |
js_whitelist | ["producthunt.com", "news.ycombinator.com"] | Domains requiring JS | |
discovery | interval_secs | 300 | Discovery cycle interval |
max_queue_size | 50000 | Redis queue threshold | |
follow_links | true | Enable link following | |
link_threshold | 0.6 | Quality score threshold for links | |
discovery.common_crawl_delta | enabled | true | Enable monthly CC delta ingestion |
crawl_id | CC-MAIN-2025-08 | CC crawl ID to query | |
domains | [github.com, ...] | Domains to discover | |
extraction | trafilatura_url | http://localhost:8080 | Trafilatura microservice URL |
batch_size | 16 | Extraction batch size | |
inference | model_path | models/all-MiniLM-L6-v2.onnx | ONNX model path |
embedding_dim | 384 | Embedding dimension | |
fast_intent_threshold | 0.6 | Fast scorer threshold | |
self_improvement | enabled | true | Enable gap-filling pipeline |
min_gap_score | 0.5 | Queries below this trigger enrichment | |
max_results_per_query | 30 | Max results per external query | |
max_search_rounds | 1 | Max rounds (1 = parallel fan-out) | |
max_parallel_gaps | 3 | Concurrent gap processing tasks |
sources.yaml
Configure RSS feeds and domains for autonomous discovery. Includes 150+ verified sources across 20+ categories:
sources:
# AI & Machine Learning
- name: "ArXiv AI"
url: "http://arxiv.org/rss/cs.AI"
category: "ai"
priority: 10
- name: "Hugging Face Blog"
url: "https://huggingface.co/blog/feed.xml"
category: "ai"
priority: 9
# Technology & Programming
- name: "Hacker News"
url: "https://hnrss.org/frontpage"
category: "tech"
priority: 10
- name: "GitHub Blog"
url: "https://github.blog/feed/"
category: "devtools"
priority: 9
domains:
- name: "GitHub"
base_url: "https://github.com"
category: "devtools"
priority: 10
- name: "ArXiv"
base_url: "https://arxiv.org"
category: "ai"
priority: 10
Tor & Privacy
All search traffic is routed through the Tor network using Snowflake/obfs4 bridges:
- Snowflake bridges — Routes traffic through volunteer WebRTC peers, making exit IPs appear as regular residential/consumer IPs instead of known Tor exit nodes
- obfs4 bridges — Proven reliable since 2015, obfuscates traffic as random noise
- Automatic circuit rotation — Circuits are rotated every 5 minutes for fresh exit IPs
- Fallback mode — If Snowflake is unavailable, standard Tor with diverse exit nodes is used
This bypasses services that block Tor exit nodes (Google, Bing, etc.) and provides privacy for all search queries.
Meta-Search Enrichment
When a search query returns empty results, fewer than 5 results, or average relevance < 0.3, the system automatically queues it for background enrichment:
- Direct providers (DuckDuckGo, Bing, GitHub, ArXiv, Reddit, GDELT, Invidious) are queried in parallel via Tor with provider-specific timeouts (SearXNG: 8s, DDG/Bing/GDELT: 4s, others: 5-6s)
- SearXNG aggregates 70+ engines through Tor (socks5h:// for DNS-through-Tor) in a single call
- Results are scored using relative ranking (RRF × 0.4 + alignment × 0.3 + source weight × 0.3)
- Top 15 candidates are crawled, embedded via ONNX, and quality-gated indexed into Meilisearch (6s global timeout, 15-result early-exit target)
- Self-improvement uses parallel fan-out (all variations searched concurrently), batch embedding, and batch indexing for ~3-10s convergence
- Future queries for the same topic return enriched results
This creates a self-healing index that grows automatically based on user demand, achieving an average of 8/15 quality documents per query.
Core Optimizations
| Area | Optimization | Gain |
|---|---|---|
| Discovery | Autonomous Link Following + Enhanced Discovery | Infinite index expansion |
| Meta-Search | Tor-routed providers + SearXNG (70+ engines), 6s global timeout, provider-specific timeouts | Unblocked access, 3× quality |
| Self-Improvement | Parallel fan-out + batch embed/index + relative ranking + single-pass processing | ~3-10s convergence, 8/15 avg quality |
| Tor Routing | Snowflake/obfs4 bridges + diverse exit nodes | Bypasses IP blocks |
| Vectors | Binary quantization (384 → 48 bytes) | 8× smaller |
| Search | Hybrid (semantic 0.7 + keyword 0.3) + two-stage rerank | 2× relevance |
| Query Expansion | Synonym expansion + intent auto-filters | Broader semantic recall |
| Anti-Signals | commercial_score + spam_score filtering | Noise reduction |
| Intent-Gated Crawling | FastIntentScorer pre-filters URLs before fetching | 30-50% noise reduction |
| Hybrid Extraction | Rust-native (scraper) + Trafilatura fallback | 5× throughput |
| Tiered Storage | High-intent full index, low-intent snippet-only | 40% memory savings |
| Batching | Parallel fetch & inference blocks | 5× throughput |
| High-Throughput Indexing | Cached index handle + fire-and-forget Meilisearch updates | 10× indexing speed |
| Concurrent Discovery | Async channels (mpsc) + JoinSet for parallel search paths | 3× faster search enrichment |
| Dedup | SimHash + RedisBloom (BF.EXISTS/BF.ADD) | <500 MB @ 10M docs |
| Inference | ONNX Runtime (ort crate in Rust) | 4× faster than Python |
| Search Caching | Redis-based SearchResponse caching with Zstd compression | Sub-50ms repeated queries |
HTTP API (Port 9100)
| Method | Endpoint | Description |
|---|---|---|
GET | /search?q=<query> | Hybrid search with query expansion, reranking, and filters |
GET | /news?q=<query> | News aggregation from 5+ sources |
GET | /images?q=<query> | Image search (index + Pixabay/Pexels) |
GET | /videos?q=<query> | Video search across 8 sources |
GET | /crawl?url=<url> | Crawl a single URL |
POST | /crawl/batch | Batch crawl multiple URLs |
GET | /health | Health check |
GET | /metrics | Prometheus metrics |
GET | /discovery/status | Queue size, last cycle, configured domains |
POST | /discovery/enqueue | Manually enqueue a URL |
POST | /admin/reindex-scores | Re-index documents to fix legacy scores |
See API_REFERENCE.md for full endpoint documentation.
Project Structure
intentforge/
├── src/ # Rust core
│ ├── api/ # HTTP API (Axum)
│ ├── crawler/ # Web crawler with robots.txt
│ ├── indexer/ # Meilisearch indexing
│ ├── inference/ # ONNX embedding & reranking
│ ├── discovery/ # Autonomous discovery service
│ ├── meta_search/ # Meta-search framework with Tor routing
│ │ ├── aggregator.rs # Provider orchestration + RRF scoring
│ │ ├── tor.rs # Tor daemon management + bridges
│ │ ├── proxy_manager.rs # Public proxy scraping + validation
│ │ └── providers/ # Individual search providers (DDG, Bing, etc.)
│ ├── common_crawl/ # Common Crawl delta ingestion
│ ├── redis_store/ # Redis client with Bloom filters
│ ├── self_improvement/ # Gap-filling and auto-enrichment
│ ├── trending/ # RSS feed monitoring
│ ├── sources/ # Source configuration loader
│ ├── video_discovery/ # Video search (YouTube, Piped, etc.)
│ ├── image_search/ # Image search (Pixabay, Pexels)
│ ├── intent_classifier/ # Intent classification
│ ├── domain_manager/ # Domain-aware scoring
│ ├── query_expansion/ # Query rewriting and expansion
│ ├── anti_detection/ # Anti-detection headers
│ ├── anti_detection_client/ # Anti-detection HTTP client
│ ├── cloudflare_bypass/ # Cloudflare bypass
│ ├── image_indexer/ # Image indexing
│ └── metrics/ # Prometheus metrics
├── services/
│ ├── query_layer/ # Python FastAPI semantic search + ranking
│ ├── trafilatura/ # Python content extraction
│ ├── searxng/ # Self-hosted meta-search (70+ engines)
│ └── youtube-unified/ # YouTube search aggregation
├── scripts/ # Helper scripts
├── tests/ # Load and accuracy tests
├── docs/ # Documentation
├── config.yaml # Application configuration
├── sources.yaml # RSS/discovery sources
└── docker-compose.yml # Docker orchestration
Technology Stack
| Component | Technology |
|---|---|
| Core API | Rust 1.80+ (Axum, Tokio, Reqwest) |
| Semantic Search | Python FastAPI + ONNX Runtime |
| Index | Meilisearch v1.13+ with binary quantization |
| Queue/Cache | Redis Stack 7.2 + RedisBloom |
| Content Extraction | Trafilatura (Python) + Rust scraper (native) |
| Meta-Search | SearXNG (70+ engines) + 8 direct providers (all Tor-routed) |
| Tor Routing | Tor daemon + Snowflake/obfs4 bridges (bypasses exit node blocking) |
| Video Discovery | YouTube Unified, Piped, Invidious, Internet Archive, Vimeo |
| Image Search | Pixabay, Pexels, SearXNG (multiple engines) |
| Embeddings | all-MiniLM-L6-v2 (ONNX, 384-dim → 48-dim binary) |
| Rust ONNX | ort crate for embedding inference |
| Monitoring | Prometheus + Grafana |
| Containerization | Docker + Docker Compose |
Self-Improvement Results
The self-improvement pipeline runs autonomously in the background for queries with low-quality results:
| Quality Tier | Queries | Count |
|---|---|---|
| Perfect 15/15 | simhash, ONNX, kubernetes, backpacking | 4 |
| Near Perfect 10-14 | rust tokio (14), rust lifetime (14), PostgreSQL (12), bonsai (11), distributed consensus (10), WebAssembly SIMD (10), sourdough pizza (10) | 7 |
| Partial 1-9 | zero knowledge (7), EMI PCB (5), meilisearch (4), japanese knife (3), react server (3), terraform (1), reinforcement learning (1) | 7 |
| Failed 0/15 | sourdough starter, home espresso | 2 |
Average quality: 8.0/15 across all 20 test queries.
License
This project is licensed under the Intent Engine Community License (IECL) v1.0 - see the LICENSE file for details.
Key Points:
- ✅ Free for Non-Commercial Purposes (personal, educational, academic, internal evaluation)
- ❌ Commercial use requires separate Commercial License
- 📧 Contact: anony45.omnipresent@proton.me for Commercial Licensing
Non-Commercial Purposes include:
- Personal use
- Educational purposes
- Academic research
- Internal evaluation
- Open research experimentation
Commercial Use (requires separate license):
- Selling the Software
- Offering as a hosted service (SaaS)
- Integrating into paid products
- Commercial consulting or client work
- Any revenue-generating activity
Built with ❤️ by Likhith Sai Seemala