IntentForge v2

Intent-First Discovery Engine — A high-performance, privacy-respecting discovery engine that identifies and ranks content based on user intent signals. Features autonomous background discovery, query expansion, two-stage ONNX re-ranking, intent-based anti-signals (commercial/spam filtering), and Tor-routed meta-search for unblocked access to the entire web. Self-expands its index via a parallel fan-out self-improvement pipeline that searches all query variations concurrently, batch-embeds, and batch-indexes for ~3-10s convergence.

Key Features

  • Intent-First Ranking — Semantic alignment scoring between queries and documents, not just keyword matching
  • Tor-Routed Meta-Search — All search providers route through Tor with Snowflake/obfs4 bridges, bypassing IP blocks and CAPTCHAs
  • Self-Improving Index — Autonomous background enrichment achieves 8/15 avg quality per query (3 perfect 15/15, 7 near-perfect 10-14/15)
  • Binary Quantized Vectors — 8× compression (384→48 bytes/doc) with minimal accuracy loss
  • Autonomous Discovery — Sitemaps, link following, RSS feeds, and Common Crawl delta ingestion
  • Sub-50ms P95 Latency — Hybrid search with two-stage re-ranking
  • Anti-Signals — Filters commercial/spam content via dedicated scoring
  • Intent-Gated Crawling — Fast pre-filtering reduces noise by 30-50% before fetching
  • Hybrid Extraction — Rust-native extraction (90% of pages) with Trafilatura fallback
  • Tiered Storage — High-intent docs get full indexing; others are snippet-only
  • Relative Candidate Ranking — Scores all unseen URLs by RRF + alignment + source weight, selecting top candidates

Quick Stats

MetricValue
Search Providers8 direct + SearXNG (70+ engines via Tor)
Vector Size48 bytes/doc (binary quantized)
Query Latency (P95)<50ms target
Indexing Throughput~30k pages/hr (Starter tier)
RSS Sources150+ verified feeds across 20+ categories
Self-ImprovementParallel fan-out, batch embed/index, ~3-10s convergence, 8/15 avg quality
Tor CoverageAll providers routed through Snowflake/obfs4 bridges

Hardware Tiers

TierSpecsCapacity
Starter (Dev)4 vCPU, 16 GB RAM, 100 GB NVMe~30k pages/hr, <5M docs, <60ms queries
Production8-16 vCPU, 32-64 GB RAM, 200-500 GB NVMe10M+ docs, global traffic, <50ms queries

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                            IntentForge v2                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐               │
│  │   Crawler    │◄────►│  Discovery   │◄────►│   Indexer    │               │
│  │(Batch + Rob) │      │  (Sitemaps)  │      │ (Meilisearch)│               │
│  └──────┬───────┘      └──────┬───────┘      └──────┬───────┘               │
│         │                     │                     │                       │
│  ┌──────▼──────┐       ┌──────▼──────┐       ┌──────▼──────┐                │
│  │ Rust Native │       │ ONNX MiniLM │       │ Redis Queue │                │
│  │ Extractor   │◄─────►│ (Inference) │◄─────►│ (Priority)  │                │
│  └──────┬──────┘       └──────┬──────┘       └──────┬───────┘                │
│         │                     │                     │                       │
│  ┌──────▼──────┐       ┌──────▼──────┐              │                       │
│  │ Trafilatura │       │ FastIntent  │              │                       │
│  │ (Fallback)  │       │  Scorer     │              │                       │
│  └─────────────┘       └─────────────┘              │                       │
│                              │                      │                       │
│  ┌───────────────────────────▼──────────────────────▼──────┐                │
│  │  Common Crawl Delta (monthly, CDX + HEAD verify)        │                │
│  │  Quality scoring → Dedup → Enqueue → State persist      │                │
│  └───────────────────────────────────────────────────────┘                │
│                              │                                              │
│  ┌───────────────────────────▼──────────────────────────┐                   │
│  │  Meta-Search Enrichment Pipeline (Tor-Routed)        │                   │
│  │                                                      │                   │
│  │  Direct Providers (via Tor):   SearXNG (via Tor):      │                   │
│  │  • DuckDuckGo, Bing            • Google, DDG, Brave  │                   │
│  │  • GitHub, ArXiv, Reddit       • Bing, Wikipedia       │                   │
│  │  • GDELT, Invidious          • StackOverflow, PyPI   │                   │
│  │  • Websurfx                  • NPM, HackerNews     │                   │
│  │                                                      │                   │
│  │  All traffic routed through Tor Snowflake/obfs4 bridges│                   │
│  │  Exit nodes appear as residential IPs (no blocking)     │                   │
│  │                                                      │                   │
│  │  Self-Improvement: parallel fan-out, relative ranking,  │                   │
│  │  quality-gated indexing, partial improvements      │                   │
│  └────────────────────────────────────────────────────┘                   │
│                              │                                              │
│                    ┌─────────▼──────────┐                                   │
│                    │  HTTP API (axum)   │                                   │
│                    │  Port 9100         │                                   │
│                    │  /search /crawl    │                                   │
│                    │  /health /metrics  │                                   │
│                    │  /discovery/status │                                   │
│                    └─────────┬──────────┘                                   │
└──────────────────────────────┼──────────────────────────────────────────────┘
                                │
          ┌─────────────────────┼─────────────────────┬───────────────────────┐
          ▼                     ▼                     ▼                       ▼
   ┌─────────────┐    ┌─────────────┐   ┌─────────────┐  ┌──────────┐
   │ Meilisearch │    │    Redis    │   │ Query Layer│  │  SearXNG │
   │   v1.13+  │    │  Stack 7.2 │   │  (FastAPI)│  │  :8080   │
   │ ( + BQ)    │    │(+BloomFilter│   │  /search  │  │(70+ eng) │
   └─────────────┘    └─────────────┘   └─────────────┘  └──────────┘

Quick Start

Prerequisites

  • Rust 1.80+ (edition 2021)
  • Docker & Docker Compose
  • Bash (Linux/macOS) or PowerShell (Windows)

1. Clone and Setup

git clone https://github.com/oxiverse-labs/intentforge.git
cd intentforge

# Copy environment template
cp .env.example .env

# Edit .env — only MEILI_MASTER_KEY is required for local dev

2. Start Infrastructure

Option A: Local Development (Recommended for Contributors)

Starts only the services you need — no Traefik, no Watchtower, no auto-deployment. The Rust API runs locally via cargo.

# Linux/macOS
chmod +x scripts/dev.sh
./scripts/dev.sh up

# Windows
scripts\dev.bat up

# Then build and run the API locally
cargo run --features tor

Option B: Production (Cloud VMs)

Includes Traefik (SSL), Watchtower (auto-updates), and GHCR image pulls. Requires DOMAIN, EMAIL, and GITHUB_OWNER_LOWER in .env.

docker compose up -d

Services comparison:

ServiceLocal DevProduction
Meilisearch✅ Port 7700✅ Internal
Redis✅ Port 6379✅ Internal
Trafilatura✅ Built from source✅ GHCR image
Query Layer✅ Built from source✅ GHCR image
YouTube Unified✅ Built from source✅ GHCR image
Tor Proxy
Whoogle / LibreX / SearXNG
Traefik (SSL)❌ Not needed✅ Auto SSL
Watchtower❌ Not needed✅ Auto updates
IntentForge APIcargo run✅ GHCR image

Docker Containers

IntentForge uses a microservices architecture orchestrated by Docker Compose. Each container serves a specific purpose:

Core Infrastructure

ContainerImagePortPurposeHealth Check
meilisearchgetmeili/meilisearch:v1.137700Search index with binary quantizationHTTP /health
redisredis/redis-stack:7.2.0-v96379Priority queue, cache, Bloom filtersredis-cli ping

Microservices (Built from Source)

ContainerBuild ContextPortPurposeDependencies
trafilatura./services/trafilatura8080Content extraction from HTMLNone
query-layer./services/query-layer8000Semantic search with ONNX re-rankingMeilisearch
youtube-unified./services/youtube-unified8085Unified YouTube search APINone

Tor & Meta-Search

ContainerImagePortPurposeDependencies
toralpine:latest (installs Tor)9050SOCKS5 proxy for anonymous routingNone
whooglebenbusby/whoogle-search:latest5000Privacy-preserving Google metasearchTor
librexlibrex/librex:latest8080PHP-based metasearch engineNone
searxngsearxng/searxng:latest8082Meta-search aggregating 70+ enginesTor

Optional Monitoring (--profile monitoring)

ContainerImagePortPurpose
redis-exporteroliver006/redis_exporter:v1.58.0Redis metrics for Prometheus
prometheusprom/prometheus:v2.55.09090Metrics collection and storage
grafanagrafana/grafana:11.4.03000Visualization dashboard

3. Initialize Meilisearch Index

# Linux/macOS
chmod +x scripts/init_meilisearch.sh
./scripts/init_meilisearch.sh

# Windows
scripts\init_meilisearch.bat

4. Build and Run (Local Dev)

# Build in release mode with Tor support
cargo build --release --features tor

# Run the application
cargo run --features tor

5. Verify Setup

# Check API health
curl http://localhost:9100/health

# Run a test search
curl "http://localhost:9100/search?q=rust+programming"

# Check discovery status
curl http://localhost:9100/discovery/status

6. Run Tests

# Unit and integration tests
cargo test

# Code quality
cargo fmt && cargo clippy --features tor

# Load testing (requires infrastructure running)
locust -f tests/load_test.py --host=http://localhost:9100

Configuration

config.yaml

Key configuration sections:

SectionKeyDefaultDescription
meilisearchurlhttp://localhost:7700Meilisearch endpoint
binary_quantizationtrueEnable 8× vector compression
semantic_ratio0.7Hybrid search weight
crawlerrate_limit10Requests per second
max_concurrent8Parallel fetches
respect_robotstrueHonor robots.txt
static_firsttrueTry static HTML before JS
js_whitelist["producthunt.com", "news.ycombinator.com"]Domains requiring JS
discoveryinterval_secs300Discovery cycle interval
max_queue_size50000Redis queue threshold
follow_linkstrueEnable link following
link_threshold0.6Quality score threshold for links
discovery.common_crawl_deltaenabledtrueEnable monthly CC delta ingestion
crawl_idCC-MAIN-2025-08CC crawl ID to query
domains[github.com, ...]Domains to discover
extractiontrafilatura_urlhttp://localhost:8080Trafilatura microservice URL
batch_size16Extraction batch size
inferencemodel_pathmodels/all-MiniLM-L6-v2.onnxONNX model path
embedding_dim384Embedding dimension
fast_intent_threshold0.6Fast scorer threshold
self_improvementenabledtrueEnable gap-filling pipeline
min_gap_score0.5Queries below this trigger enrichment
max_results_per_query30Max results per external query
max_search_rounds1Max rounds (1 = parallel fan-out)
max_parallel_gaps3Concurrent gap processing tasks

sources.yaml

Configure RSS feeds and domains for autonomous discovery. Includes 150+ verified sources across 20+ categories:

sources:
  # AI & Machine Learning
  - name: "ArXiv AI"
    url: "http://arxiv.org/rss/cs.AI"
    category: "ai"
    priority: 10
  - name: "Hugging Face Blog"
    url: "https://huggingface.co/blog/feed.xml"
    category: "ai"
    priority: 9
  
  # Technology & Programming
  - name: "Hacker News"
    url: "https://hnrss.org/frontpage"
    category: "tech"
    priority: 10
  - name: "GitHub Blog"
    url: "https://github.blog/feed/"
    category: "devtools"
    priority: 9

domains:
  - name: "GitHub"
    base_url: "https://github.com"
    category: "devtools"
    priority: 10
  - name: "ArXiv"
    base_url: "https://arxiv.org"
    category: "ai"
    priority: 10

Tor & Privacy

All search traffic is routed through the Tor network using Snowflake/obfs4 bridges:

  • Snowflake bridges — Routes traffic through volunteer WebRTC peers, making exit IPs appear as regular residential/consumer IPs instead of known Tor exit nodes
  • obfs4 bridges — Proven reliable since 2015, obfuscates traffic as random noise
  • Automatic circuit rotation — Circuits are rotated every 5 minutes for fresh exit IPs
  • Fallback mode — If Snowflake is unavailable, standard Tor with diverse exit nodes is used

This bypasses services that block Tor exit nodes (Google, Bing, etc.) and provides privacy for all search queries.

Meta-Search Enrichment

When a search query returns empty results, fewer than 5 results, or average relevance < 0.3, the system automatically queues it for background enrichment:

  1. Direct providers (DuckDuckGo, Bing, GitHub, ArXiv, Reddit, GDELT, Invidious) are queried in parallel via Tor with provider-specific timeouts (SearXNG: 8s, DDG/Bing/GDELT: 4s, others: 5-6s)
  2. SearXNG aggregates 70+ engines through Tor (socks5h:// for DNS-through-Tor) in a single call
  3. Results are scored using relative ranking (RRF × 0.4 + alignment × 0.3 + source weight × 0.3)
  4. Top 15 candidates are crawled, embedded via ONNX, and quality-gated indexed into Meilisearch (6s global timeout, 15-result early-exit target)
  5. Self-improvement uses parallel fan-out (all variations searched concurrently), batch embedding, and batch indexing for ~3-10s convergence
  6. Future queries for the same topic return enriched results

This creates a self-healing index that grows automatically based on user demand, achieving an average of 8/15 quality documents per query.

Core Optimizations

AreaOptimizationGain
DiscoveryAutonomous Link Following + Enhanced DiscoveryInfinite index expansion
Meta-SearchTor-routed providers + SearXNG (70+ engines), 6s global timeout, provider-specific timeoutsUnblocked access, 3× quality
Self-ImprovementParallel fan-out + batch embed/index + relative ranking + single-pass processing~3-10s convergence, 8/15 avg quality
Tor RoutingSnowflake/obfs4 bridges + diverse exit nodesBypasses IP blocks
VectorsBinary quantization (384 → 48 bytes)8× smaller
SearchHybrid (semantic 0.7 + keyword 0.3) + two-stage rerank2× relevance
Query ExpansionSynonym expansion + intent auto-filtersBroader semantic recall
Anti-Signalscommercial_score + spam_score filteringNoise reduction
Intent-Gated CrawlingFastIntentScorer pre-filters URLs before fetching30-50% noise reduction
Hybrid ExtractionRust-native (scraper) + Trafilatura fallback5× throughput
Tiered StorageHigh-intent full index, low-intent snippet-only40% memory savings
BatchingParallel fetch & inference blocks5× throughput
High-Throughput IndexingCached index handle + fire-and-forget Meilisearch updates10× indexing speed
Concurrent DiscoveryAsync channels (mpsc) + JoinSet for parallel search paths3× faster search enrichment
DedupSimHash + RedisBloom (BF.EXISTS/BF.ADD)<500 MB @ 10M docs
InferenceONNX Runtime (ort crate in Rust)4× faster than Python
Search CachingRedis-based SearchResponse caching with Zstd compressionSub-50ms repeated queries

HTTP API (Port 9100)

MethodEndpointDescription
GET/search?q=<query>Hybrid search with query expansion, reranking, and filters
GET/news?q=<query>News aggregation from 5+ sources
GET/images?q=<query>Image search (index + Pixabay/Pexels)
GET/videos?q=<query>Video search across 8 sources
GET/crawl?url=<url>Crawl a single URL
POST/crawl/batchBatch crawl multiple URLs
GET/healthHealth check
GET/metricsPrometheus metrics
GET/discovery/statusQueue size, last cycle, configured domains
POST/discovery/enqueueManually enqueue a URL
POST/admin/reindex-scoresRe-index documents to fix legacy scores

See API_REFERENCE.md for full endpoint documentation.

Project Structure

intentforge/
├── src/                          # Rust core
│   ├── api/                      # HTTP API (Axum)
│   ├── crawler/                  # Web crawler with robots.txt
│   ├── indexer/                  # Meilisearch indexing
│   ├── inference/                # ONNX embedding & reranking
│   ├── discovery/                # Autonomous discovery service
│   ├── meta_search/              # Meta-search framework with Tor routing
│   │   ├── aggregator.rs         # Provider orchestration + RRF scoring
│   │   ├── tor.rs                # Tor daemon management + bridges
│   │   ├── proxy_manager.rs      # Public proxy scraping + validation
│   │   └── providers/            # Individual search providers (DDG, Bing, etc.)
│   ├── common_crawl/             # Common Crawl delta ingestion
│   ├── redis_store/              # Redis client with Bloom filters
│   ├── self_improvement/         # Gap-filling and auto-enrichment
│   ├── trending/                 # RSS feed monitoring
│   ├── sources/                  # Source configuration loader
│   ├── video_discovery/          # Video search (YouTube, Piped, etc.)
│   ├── image_search/             # Image search (Pixabay, Pexels)
│   ├── intent_classifier/        # Intent classification
│   ├── domain_manager/           # Domain-aware scoring
│   ├── query_expansion/          # Query rewriting and expansion
│   ├── anti_detection/           # Anti-detection headers
│   ├── anti_detection_client/   # Anti-detection HTTP client
│   ├── cloudflare_bypass/       # Cloudflare bypass
│   ├── image_indexer/            # Image indexing
│   └── metrics/                  # Prometheus metrics
├── services/
│   ├── query_layer/              # Python FastAPI semantic search + ranking
│   ├── trafilatura/              # Python content extraction
│   ├── searxng/                  # Self-hosted meta-search (70+ engines)
│   └── youtube-unified/          # YouTube search aggregation
├── scripts/                      # Helper scripts
├── tests/                        # Load and accuracy tests
├── docs/                         # Documentation
├── config.yaml                   # Application configuration
├── sources.yaml                  # RSS/discovery sources
└── docker-compose.yml            # Docker orchestration

Technology Stack

ComponentTechnology
Core APIRust 1.80+ (Axum, Tokio, Reqwest)
Semantic SearchPython FastAPI + ONNX Runtime
IndexMeilisearch v1.13+ with binary quantization
Queue/CacheRedis Stack 7.2 + RedisBloom
Content ExtractionTrafilatura (Python) + Rust scraper (native)
Meta-SearchSearXNG (70+ engines) + 8 direct providers (all Tor-routed)
Tor RoutingTor daemon + Snowflake/obfs4 bridges (bypasses exit node blocking)
Video DiscoveryYouTube Unified, Piped, Invidious, Internet Archive, Vimeo
Image SearchPixabay, Pexels, SearXNG (multiple engines)
Embeddingsall-MiniLM-L6-v2 (ONNX, 384-dim → 48-dim binary)
Rust ONNXort crate for embedding inference
MonitoringPrometheus + Grafana
ContainerizationDocker + Docker Compose

Self-Improvement Results

The self-improvement pipeline runs autonomously in the background for queries with low-quality results:

Quality TierQueriesCount
Perfect 15/15simhash, ONNX, kubernetes, backpacking4
Near Perfect 10-14rust tokio (14), rust lifetime (14), PostgreSQL (12), bonsai (11), distributed consensus (10), WebAssembly SIMD (10), sourdough pizza (10)7
Partial 1-9zero knowledge (7), EMI PCB (5), meilisearch (4), japanese knife (3), react server (3), terraform (1), reinforcement learning (1)7
Failed 0/15sourdough starter, home espresso2

Average quality: 8.0/15 across all 20 test queries.

License

This project is licensed under the Intent Engine Community License (IECL) v1.0 - see the LICENSE file for details.

Key Points:

  • ✅ Free for Non-Commercial Purposes (personal, educational, academic, internal evaluation)
  • ❌ Commercial use requires separate Commercial License
  • 📧 Contact: anony45.omnipresent@proton.me for Commercial Licensing

Non-Commercial Purposes include:

  • Personal use
  • Educational purposes
  • Academic research
  • Internal evaluation
  • Open research experimentation

Commercial Use (requires separate license):

  • Selling the Software
  • Offering as a hosted service (SaaS)
  • Integrating into paid products
  • Commercial consulting or client work
  • Any revenue-generating activity

Built with ❤️ by Likhith Sai Seemala