DataPizza MCP Server 🍕

A Model Context Protocol (MCP) server that provides intelligent access to datapizza-ai documentation through vector similarity search and retrieval-augmented generation.

Overview

This MCP server enables AI assistants and applications to query the comprehensive datapizza-ai documentation using natural language queries. It indexes documentation from the datapizza-ai repository and provides contextual, relevant responses through a RAG (Retrieval-Augmented Generation) pipeline.

Features

Intelligent Documentation Search: Natural language queries across datapizza-ai documentation
Vector-Based Retrieval: Uses OpenAI embeddings and Qdrant vector database for semantic search
MCP Protocol Compliance: Standard Model Context Protocol implementation for broad compatibility
Automatic Indexing: Downloads and indexes documentation from GitHub automatically
Cloud-Ready: Supports Qdrant Cloud for scalable vector storage
Configurable: Environment-based configuration for flexible deployment

Architecture

The server consists of four main components:

MCP Server: FastMCP-based server exposing the query_datapizza tool
Indexer: Downloads and processes datapizza-ai documentation into searchable chunks
Retriever: RAG engine for semantic search and response generation
Configuration: Environment-based settings management with validation

Prerequisites

Python 3.10 or higher
OpenAI API key
Qdrant Cloud account and API key
Internet connection for documentation indexing

Installation

Clone the repository:

git clone https://github.com/datapizza-labs/mcp_server_datapizza.git
cd datapizza-mcp-server

Navigate to the package directory:

cd datapizza-mcp-server

Install the package with development dependencies:

pip install -e ".[dev]"

Configuration

Create a .env file in the datapizza-mcp-server directory with the following variables:

# Required Configuration
OPENAI_API_KEY=your_openai_api_key_here
QDRANT_URL=your_qdrant_cloud_url
QDRANT_API_KEY=your_qdrant_api_key

# Optional Configuration
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSIONS=1536
COLLECTION_NAME=datapizza_docs
MAX_RESULTS=5
CHUNK_SIZE=1024
CHUNK_OVERLAP=200
LOG_LEVEL=INFO

Required Environment Variables

Variable	Description
`OPENAI_API_KEY`	OpenAI API key for generating embeddings
`QDRANT_URL`	Qdrant Cloud instance URL
`QDRANT_API_KEY`	Qdrant Cloud API key

Optional Environment Variables

Variable	Default	Description
`EMBEDDING_MODEL`	`text-embedding-3-small`	OpenAI embedding model
`EMBEDDING_DIMENSIONS`	`1536`	Embedding vector dimensions
`COLLECTION_NAME`	`datapizza_docs`	Qdrant collection name
`MAX_RESULTS`	`5`	Maximum search results returned
`CHUNK_SIZE`	`1024`	Document chunk size for indexing
`CHUNK_OVERLAP`	`200`	Overlap between document chunks
`LOG_LEVEL`	`INFO`	Logging level (DEBUG, INFO, WARNING, ERROR)

Usage

1. Index Documentation

Before using the server, index the datapizza-ai documentation:

python -m datapizza_mcp.indexer

To force re-indexing (clears existing data):

python -m datapizza_mcp.indexer --force

2. Start the MCP Server

python -m datapizza_mcp.server

Or use the provided Windows batch script:

../run_datapizza.bat

3. Query the Documentation

The server exposes a query_datapizza tool that can be called by MCP clients:

# Example query
result = await client.call_tool("query_datapizza", {
    "query": "come creare un agente con OpenAI",
    "max_results": 5
})

MCP Tools and Resources

Tools

query_datapizza: Search datapizza-ai documentation
- query (string): Natural language search query
- max_results (int, optional): Maximum number of results (default: 5)

Resources

datapizza://status: System status and configuration information

Development

Code Quality Tools

# Format code
black src/

# Lint code
ruff check src/
ruff check src/ --fix  # Auto-fix issues

# Type checking
mypy src/

# Run tests
pytest

Project Structure

datapizza-mcp-server/
├── src/datapizza_mcp/
│   ├── __init__.py          # Package exports
│   ├── config.py            # Configuration management
│   ├── server.py            # MCP server implementation
│   ├── indexer.py           # Documentation indexing
│   └── retriever.py         # RAG retrieval engine
├── pyproject.toml           # Package configuration
├── .env                     # Environment variables
└── README.md               # This file

Dependencies

Core Dependencies

mcp: Model Context Protocol framework
datapizza-ai-core: Core datapizza-ai functionality
datapizza-ai-embedders-openai: OpenAI embedding integration
datapizza-ai-vectorstores-qdrant: Qdrant vector store integration
openai: OpenAI API client
qdrant-client: Qdrant database client
requests: HTTP client for GitHub API
python-dotenv: Environment variable management

Development Dependencies

pytest: Testing framework
black: Code formatter
ruff: Linter and code style checker
mypy: Static type checker

Troubleshooting

Common Issues

Authentication Errors
- Verify OPENAI_API_KEY is set correctly
- Check Qdrant Cloud credentials (QDRANT_URL and QDRANT_API_KEY)
Empty Search Results
- Ensure documentation is indexed: python -m datapizza_mcp.indexer
- Check system status: query the datapizza://status resource
Connection Issues
- Verify internet connectivity for GitHub and Qdrant Cloud access
- Check firewall settings for outbound HTTPS connections

Debugging

Enable debug logging by setting LOG_LEVEL=DEBUG in your .env file.

Contributing

Fork the repository
Create a feature branch
Make your changes following the code style guidelines
Run the full test suite and code quality checks
Submit a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Support

For issues and questions:

GitHub Issues: datapizza-mcp-server/issues
DataPizza AI Documentation: datapizza-ai