Crawl4AI SearXNG
HTTP-SSE基于Docker的MCP服务器,提供网络搜索、爬取和智能RAG功能
基于Docker的MCP服务器,提供网络搜索、爬取和智能RAG功能
Web Crawling, Search and RAG Capabilities for AI Agents and AI Coding Assistants
(FORKED FROM https://github.com/coleam00/mcp-crawl4ai-rag). Added SearXNG integration and batch scrape and processing capabilities.
A self-contained Docker solution that combines the Model Context Protocol (MCP), Crawl4AI, SearXNG, and Supabase to provide AI agents and coding assistants with complete web search, crawling, and RAG capabilities.
🚀 Complete Stack in One Command: Deploy everything with docker compose up -d - no Python setup, no dependencies, no external services required.
Unlike traditional scraping (such as Firecrawl) that dumps raw content and overwhelms LLM context windows, this solution uses intelligent RAG (Retrieval Augmented Generation) to:
Flexible Output Options:
This Docker-based MCP server provides a complete web intelligence stack that enables AI agents to:
Advanced RAG Strategies Available:
See the Configuration section below for details on how to enable and configure these strategies.
The server provides essential web crawling and search tools:
scrape_urls: Scrape one or more URLs and store their content in the vector database. Supports both single URLs and lists of URLs for batch processing.smart_crawl_url: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)get_available_sources: Get a list of all available sources (domains) in the databaseperform_rag_query: Search for relevant content using semantic search with optional source filteringsearch: Comprehensive web search tool that integrates SearXNG search with automated scraping and RAG processing. Performs a complete workflow: (1) searches SearXNG with the provided query, (2) extracts URLs from search results, (3) automatically scrapes all found URLs using existing scraping infrastructure, (4) stores content in vector database, and (5) returns either RAG-processed results organized by URL or raw markdown content. Key parameters: query (search terms), return_raw_markdown (bypasses RAG for raw content), num_results (search result limit), batch_size (database operation batching), max_concurrent (parallel scraping sessions). Ideal for research workflows, competitive analysis, and content discovery with built-in intelligence.search_code_examples (requires USE_AGENTIC_RAG=true): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants.USE_KNOWLEDGE_GRAPH=true, see below)parse_github_repository: Parse a GitHub repository into a Neo4j knowledge graph, extracting classes, methods, functions, and their relationships for hallucination detectioncheck_ai_script_hallucinations: Analyze Python scripts for AI hallucinations by validating imports, method calls, and class usage against the knowledge graphquery_knowledge_graph: Explore and query the Neo4j knowledge graph with commands like repos, classes, methods, and custom Cypher queriesRequired:
Optional:
This is a Docker-only solution - no Python environment setup required!
Clone this repository:
git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
Configure environment:
cp .env.example .env # Edit .env with your API keys (see Configuration section below)
Deploy the complete stack:
docker compose up -d
That's it! Your complete search, crawl, and RAG stack is now running:
The Docker Compose stack includes:
Before running the server, you need to set up the database with the pgvector extension:
Go to the SQL Editor in your Supabase dashboard (create a new project first if necessary)
Create a new query and paste the contents of crawled_pages.sql
Run the query to create the necessary tables and functions
To enable AI hallucination detection and repository analysis features, you need to set up Neo4j.
Note: The knowledge graph functionality works fully with Docker and supports all features.
Option 1: Local AI Package (Recommended)
The easiest way to get Neo4j running is with the Local AI Package:
Clone and start Neo4j:
git clone https://github.com/coleam00/local-ai-packaged.git cd local-ai-packaged # Follow repository instructions to start Neo4j with Docker Compose
Connection details for Docker:
bolt://host.docker.internal:7687 (for Docker containers)bolt://localhost:7687 (for local access)neo4jOption 2: Neo4j Docker
Run Neo4j directly with Docker:
docker run -d \ --name neo4j \ -p 7474:7474 -p 7687:7687 \ -e NEO4J_AUTH=neo4j/your-password \ neo4j:latest
Option 3: Neo4j Desktop
Use Neo4j Desktop for a local GUI-based installation:
bolt://host.docker.internal:7687 (for Docker containers)bolt://localhost:7687 (for local access)neo4jConfigure the Docker stack by editing your .env file (copy from .env.example):
# ======================================== # MCP SERVER CONFIGURATION # ======================================== TRANSPORT=sse HOST=0.0.0.0 PORT=8051 # ======================================== # INTEGRATED SEARXNG CONFIGURATION # ======================================== # Pre-configured for Docker Compose - SearXNG runs internally SEARXNG_URL=http://searxng:8080 SEARXNG_USER_AGENT=MCP-Crawl4AI-RAG-Server/1.0 SEARXNG_DEFAULT_ENGINES=google,bing,duckduckgo SEARXNG_TIMEOUT=30 # Optional: Custom domain for production HTTPS SEARXNG_HOSTNAME=http://localhost # [email protected] # For Let's Encrypt # ======================================== # AI SERVICES CONFIGURATION # ======================================== # Required: OpenAI API for embeddings OPENAI_API_KEY=your_openai_api_key # LLM for summaries and contextual embeddings MODEL_CHOICE=gpt-4.1-nano-2025-04-14 # Required: Supabase for vector database SUPABASE_URL=your_supabase_project_url SUPABASE_SERVICE_KEY=your_supabase_service_key # ======================================== # RAG ENHANCEMENT STRATEGIES # ======================================== USE_CONTEXTUAL_EMBEDDINGS=false USE_HYBRID_SEARCH=false USE_AGENTIC_RAG=false USE_RERANKING=false USE_KNOWLEDGE_GRAPH=false # Optional: Neo4j for knowledge graph (if USE_KNOWLEDGE_GRAPH=true) # Use host.docker.internal:7687 for Docker Desktop on Windows/Mac NEO4J_URI=bolt://localhost:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=your_neo4j_password
🔍 SearXNG Integration: The stack includes a pre-configured SearXNG instance that runs automatically. No external setup required!
🐳 Docker Networking: The default configuration uses Docker internal networking (http://searxng:8080) which works out of the box.
🔐 Production Setup: For production, set SEARXNG_HOSTNAME to your domain and SEARXNG_TLS to your email for automatic HTTPS.
The Crawl4AI RAG MCP server supports four powerful RAG strategies that can be enabled independently:
When enabled, this strategy enhances each chunk's embedding with additional context from the entire document. The system passes both the full document and the specific chunk to an LLM (configured via MODEL_CHOICE) to generate enriched context that gets embedded alongside the chunk content.
Combines traditional keyword search with semantic vector search to provide more comprehensive results. The system performs both searches in parallel and intelligently merges results, prioritizing documents that appear in both result sets.
Enables specialized code example extraction and storage. When crawling documentation, the system identifies code blocks (≥300 characters), extracts them with surrounding context, generates summaries, and stores them in a separate vector database table specifically designed for code search.
search_code_examples tool that AI agents can use to find specific code implementations.Applies cross-encoder reranking to search results after initial retrieval. Uses a lightweight cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2) to score each result against the original query, then reorders results by relevance.
Enables AI hallucination detection and repository analysis using Neo4j knowledge graphs. When enabled, the system can parse GitHub repositories into a graph database and validate AI-generated code against real repository structures. Fully compatible with Docker - all functionality works within the containerized environment.
parse_github_repository for indexing codebases, check_ai_script_hallucinations for validating AI-generated code, and query_knowledge_graph for exploring indexed repositories.Usage with MCP Tools:
You can tell the AI coding assistant to add a Python GitHub repository to the knowledge graph:
"Add https://github.com/pydantic/pydantic-ai.git to the knowledge graph"
Make sure the repo URL ends with .git.
You can also have the AI coding assistant check for hallucinations with scripts it creates using the MCP check_ai_script_hallucinations tool.
For general documentation RAG:
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=false
USE_RERANKING=true
For AI coding assistant with code examples:
USE_CONTEXTUAL_EMBEDDINGS=true
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=true
USE_RERANKING=true
USE_KNOWLEDGE_GRAPH=false
For AI coding assistant with hallucination detection:
USE_CONTEXTUAL_EMBEDDINGS=true
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=true
USE_RERANKING=true
USE_KNOWLEDGE_GRAPH=true
For fast, basic RAG:
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=false
USE_RERANKING=false
USE_KNOWLEDGE_GRAPH=false
The complete stack is managed through Docker Compose:
docker compose up -d
# All services docker compose logs -f # Specific service docker compose logs -f mcp-crawl4ai docker compose logs -f searxng
docker compose down
# Restart all docker compose restart # Restart specific service docker compose restart mcp-crawl4ai
The MCP server will be available at http://localhost:8051 for SSE connections.
After starting the Docker stack with docker compose up -d, your MCP server will be available for integration.
The Docker stack runs with SSE transport by default. Connect using:
Claude Desktop/Windsurf:
{ "mcpServers": { "crawl4ai-rag": { "transport": "sse", "url": "http://localhost:8051/sse" } } }
Windsurf (alternative syntax):
{ "mcpServers": { "crawl4ai-rag": { "transport": "sse", "serverUrl": "http://localhost:8051/sse" } } }
Claude Code CLI:
claude mcp add-json crawl4ai-rag '{"type":"http","url":"http://localhost:8051/sse"}' --scope user
http://localhost:8051/ssehttp://host.docker.internal:8051/sselocalhost with your server's IP addressFor production use with custom domains:
Update your .env:
SEARXNG_HOSTNAME=https://yourdomain.com SEARXNG_TLS=[email protected]
Access via HTTPS:
https://yourdomain.com:8051/sse
Verify the server is running:
curl http://localhost:8051/health
The knowledge graph system stores repository code structure in Neo4j with the following components:
knowledge_graphs/ folder):parse_repo_into_neo4j.py: Clones and analyzes GitHub repositories, extracting Python classes, methods, functions, and imports into Neo4j nodes and relationshipsai_script_analyzer.py: Parses Python scripts using AST to extract imports, class instantiations, method calls, and function usageknowledge_graph_validator.py: Validates AI-generated code against the knowledge graph to detect hallucinations (non-existent methods, incorrect parameters, etc.)hallucination_reporter.py: Generates comprehensive reports about detected hallucinations with confidence scores and recommendationsquery_knowledge_graph.py: Interactive CLI tool for exploring the knowledge graph (functionality now integrated into MCP tools)The Neo4j database stores code structure as:
Nodes:
Repository: GitHub repositoriesFile: Python files within repositoriesClass: Python classes with methods and attributesMethod: Class methods with parameter informationFunction: Standalone functionsAttribute: Class attributesRelationships:
Repository -[:CONTAINS]-> FileFile -[:DEFINES]-> ClassFile -[:DEFINES]-> FunctionClass -[:HAS_METHOD]-> MethodClass -[:HAS_ATTRIBUTE]-> Attributeparse_github_repository tool to clone and analyze open-source repositoriescheck_ai_script_hallucinations tool to validate AI-generated Python scriptsquery_knowledge_graph tool to explore available repositories, classes, and methodsContainer won't start:
# Check logs for specific errors docker compose logs mcp-crawl4ai # Verify configuration is valid docker compose config # Restart problematic services docker compose restart mcp-crawl4ai
SearXNG not accessible:
# Check if SearXNG is running docker compose logs searxng # Verify internal networking docker compose exec mcp-crawl4ai curl http://searxng:8080
Port conflicts:
# Check what's using ports netstat -tulpn | grep -E ":(8051|8080)" # Change ports in docker-compose.yml if needed ports: - "8052:8051" # Changed from 8051:8051
Environment variables not loading:
.env file is in the same directory as docker-compose.yml= in .env fileAPI connection failures:
OPENAI_API_KEY is valid and has creditsSUPABASE_URL and SUPABASE_SERVICE_KEY are correctdocker compose exec mcp-crawl4ai curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models
Neo4j connection issues:
host.docker.internal:7687 instead of localhost:7687 for Neo4j running on hostMemory usage:
# Monitor resource usage docker stats # Adjust memory limits in docker-compose.yml deploy: resources: limits: memory: 2G
Disk space:
# Clean up Docker docker system prune -a # Check volume usage docker volume ls
docker compose logs -fdocker compose configcurl commands shown abovedocker compose down -v && docker compose up -dThis Docker stack provides a foundation for building more complex MCP servers:
src/ and rebuild: docker compose build mcp-crawl4aisrc/crawl4ai_mcp.py with @mcp.tool() decoratorssearxng/settings.yml and restartdocker-compose.yml with additional containers