DocsIntentforgeREADME

IntentForge v2

Privacy-first, intent-driven search engine with zero-trust architecture.

IntentForge is a high-performance discovery platform that combines BM25 keyword matching with ONNX-powered semantic search to deliver intent-aligned results. Built with privacy at its core β€” no tracking, no data harvesting, no corporate dependencies.


Current Capabilities

πŸ” Core Search

  • Hybrid Search: Combines BM25 keyword matching with semantic vector search (ONNX embeddings)
  • Intent Classification: Detects Navigational, Informational, Transactional, and Exploratory queries with adaptive semantic ratios
  • Multi-tier Caching: L1 in-memory cache + Redis for sub-millisecond repeated queries
  • Meta-search Aggregation: Simultaneously queries 9+ providers (Brave, Google, Wikipedia, Arxiv, Reddit, GitHub, Hacker News, Medium, DuckDuckGo)

πŸ–ΌοΈ Image Search

  • Zero-bandwidth indexing: Extracts context from HTML metadata (alt, title, captions, surrounding text) without downloading images
  • Perceptual hashing: dHash (64-bit) for deduplication
  • Visual fingerprints: ThumbHash for compact (~20-30 byte) thumbnail representation
  • Quality gates: Filters tracking pixels, SVGs, and 1x1 placeholders

πŸ“Ή Video Search

  • Multi-source discovery: YouTube Unified, Piped, Invidious, Dailymotion, Vimeo, Internet Archive
  • Intent-first scoring: Ranks by intent match, relevance, and quality signals
  • Privacy-friendly frontends: Prefers Piped/Invidious over direct YouTube API

πŸ›‘οΈ Privacy & Security

  • Anti-detection: TLS fingerprinting, randomized timing, viewport spoofing to bypass AI-generated content detection
  • Tor integration: Optional route through Tor for anonymity
  • No tracking: No cookies, no user profiling, no data retention
  • Cloudflare bypass: Automated CAPTCHA solving for accessibility

πŸ“š Content Extraction

  • Trafilatura integration: Boilerplate removal, readable text extraction
  • Multi-format support: Articles, documentation, technical content
  • RSS firehose: 80+ curated sources for continuous content discovery

πŸš€ Performance

  • Cross-encoder reranking: ms-marco-MiniLM-L6-v2 for precision reordering
  • Adaptive semantic ratio: Query-type-aware blend of keyword vs semantic search
  • Self-improvement: Automatic gap analysis triggers background crawling for weak results
  • Common Crawl integration: Massive URL discovery from web archives

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        IntentForge Core (Rust)                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Intent      β”‚   Ranking    β”‚  Discovery    β”‚  Anti-Detection   β”‚
β”‚ Classifier  β”‚   Engine     β”‚  (Firehose)   β”‚  + Cloudflare     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    Meilisearch Index (Hybrid)                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                 Query Layer (Python + FastAPI)                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  ONNX Embed  β”‚  β”‚ Cross-Encoderβ”‚  β”‚  Multi-tier Cache     β”‚ β”‚
β”‚  β”‚  (all-MiniLM)β”‚  β”‚  Reranker    β”‚  β”‚  L1 + Redis          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                      Meta-Search API (Rust)                     β”‚
β”‚  Brave Β· Google Β· Wikipedia Β· Arxiv Β· Reddit Β· GitHub Β· HN     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    Video Services (Node.js)                     β”‚
β”‚           YouTube Unified Β· Piped Β· Invidious                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

API Endpoints

EndpointMethodDescription
/searchGETHybrid intent-first search
/newsGETNews aggregation with geo-targeting
/imagesGETSemantic image discovery
/videosGETIntent-weighted video search
/metaGETMeta-search across all providers
/crawlGETSingle URL content extraction
/crawl/batchPOSTBatch URL extraction
/healthGETService health check
/metricsGETPrometheus metrics

Stack

ComponentTechnology
Core EngineRust (Edition 2021)
Semantic SearchONNX Runtime (all-MiniLM-L6-v2)
Cross-Encoderms-marco-MiniLM-L6-v2
Search IndexMeilisearch
CacheRedis (Tier-2) + In-memory LRU (Tier-1)
Query LayerPython + FastAPI
Meta-SearchRust (9+ providers)
Video DiscoveryNode.js (YouTube Unified)
Content ExtractionTrafilatura (Python)
OrchestrationDocker Compose

Privacy Principles

  1. Zero data retention β€” No search logs, no analytics, no cookies
  2. No corporate dependencies β€” All sources are open or privacy-respecting
  3. Encrypted transport β€” HTTPS everywhere, Tor optional
  4. No AI content detection fingerprinting β€” Built-in bypass
  5. Open source β€” Full transparency on all components

Getting Started

# Full release build
cargo build --release

# Run API server (port 9100)
cargo run --release

# Docker full stack
docker-compose -f docker-compose.dev.yml up -d --build

Project Status

IntentForge v2 is operational and actively developed. The core search, image indexing, and video discovery systems are functional. Ongoing work focuses on latency optimization, expanded source coverage, and enhanced personalization.

See docs/ROADMAP.md for planned improvements.