DocsIntentforgeLONG_TERM_PLAN

Investigation Report & Long-Term Solution Plan - May 14, 2026

1. Executive Summary

Following the successful restoration of core search quality by fixing the ONNX model stubs, several functional gaps remain in the IntentForge v2 stack. These issues are primarily related to external provider blocking, architectural limitations, and incomplete API specifications. This document outlines the root causes and a comprehensive plan for long-term resolution.


2. Issues & Root Causes

2.1 /news → 0 Results

  • Symptoms: Queries to /news return empty lists.
  • Investigation:
    • GoogleNewsProvider uses direct connections, likely blocked by Google for data-center IP ranges.
    • DuckDuckGoNewsProvider and BingNewsProvider use Tor, which is aggressively blocked by their respective news endpoints.
  • Root Cause: Inconsistent and easily detectable proxying strategies for real-time news aggregation.

2.2 /videos → 0 Results / CORS Issue

  • Symptoms: Video searches return 0 results; user reports CORS errors.
  • Investigation:
    • youtube-unified (Node.js) lacks CORS middleware.
    • The main Rust API also lacks CORS configuration for Axum.
    • yt-dlp updates are failing because the auto-update.sh script is "not found" (likely a line-ending issue in the Docker container).
  • Root Cause: Missing cross-origin resource sharing configuration and broken background update mechanisms.

2.3 Non-English Support → 0 Results

  • Symptoms: Queries in languages other than English return no results.
  • Root Cause: The system uses all-MiniLM-L6-v2, an English-only embedding model. The discovery pipeline likely filters or fails to index non-English content effectively.

2.4 Missing Pagination Metadata

  • Symptoms: SearchResponse does not include page or total_pages.
  • Root Cause: The API response structure was never updated to support full pagination metadata.

2.5 Spam Query Handling

  • Symptoms: Spam-like queries return 0 results.
  • Root Cause: Hardcoded threshold (spam_score < 0.5) in the query rewriter.

3. Long-Term Solution Plan

Phase 1: Proxy & Infrastructure Standardization

  1. Middle-Route Implementation: Update all news providers to use build_middle_route_client (VPN/CF Worker/Public Proxy) instead of Tor or Direct. This maintains privacy while bypassing Tor exit node blocks.
  2. Docker Fixes: Convert services/youtube-unified/auto-update.sh to LF line endings and ensure it runs correctly in the Alpine container.
  3. CORS Support:
    • Add cors package to youtube-unified and enable it.
    • Add tower-http CORS layer to src/api/mod.rs to allow frontend integration.

Phase 2: Multilingual Capabilities

  1. Model Upgrade: Switch the default embedding model to paraphrase-multilingual-MiniLM-L12-v2 or BGE-M3 (384/1024 dimensions).
  2. Indexing Updates: Modify scripts/ensure_models.py to handle multilingual models and ensure they are correctly loaded by the ONNX inference engine.

Phase 3: API & Quality Improvements

  1. Pagination:
    • Update SearchResponse struct to include page, limit, and total_pages.
    • Implement pagination logic in the Meilisearch aggregator and meta-search providers.
  2. Configurable Spam Filtering: Move the spam_score threshold to config.yaml to allow site administrators to tune the aggressiveness of the filter.
  3. RSS Robustness: Enhance GoogleNewsProvider with multiple fallback RSS feeds and a more resilient parsing strategy.

4. Immediate Next Steps (Proposed)

  1. Fix line endings in auto-update.sh and rebuild youtube-unified.
  2. Add CORS headers to both the Rust and Node.js APIs.
  3. Update scripts/ensure_models.py to check file sizes and prevent stub-blocking.
  4. Refactor GoogleNewsProvider to use the middle route.