# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance. ## Key Architecture Decisions - **Main entry point**: `src/main.py` - **Package manager**: uv (not pip or poetry) - **Python version**: 3.13 - **Configuration**: `.env` file for main configuration - **Source organization**: All source files go in `src/` directory - **Endpoint structure**: Endpoints are organized as separate modules - **Data persistence**: Models saved to `data/models.json` (configurable via `DATA_DIR`) ## Development Commands ```bash # Install dependencies uv sync # Run the application from project root uv run python src/main.py # Run on different port APP_PORT=8081 uv run python src/main.py # Add a new dependency uv add # Add a development dependency uv add --dev ``` ## API Endpoints ### Model Management - `GET /models` - List all models with full details - `POST /models` - Create a new model - `GET /models/{model_id}` - Get model details - `PUT /models/{model_id}` - Update model - `DELETE /models/{model_id}` - Delete model ### OpenAI v1 Compatible - Implemented - `GET /v1/models` - List models in OpenAI format - `GET /v1/models/{model_id}` - Get specific model in OpenAI format ### OpenAI v1 Compatible - Placeholders (TODO) - `POST /v1/chat/completions` - Chat completions (supports streaming via `stream` parameter) - `POST /v1/completions` - Text completions (supports streaming via `stream` parameter) - `POST /v1/embeddings` - Generate embeddings ### OpenAI v1 Compatible - Not Applicable - `/v1/images/*` - Image generation (vLLM is text-only) - `/v1/audio/*` - Audio endpoints (vLLM is text-only) - `/v1/assistants` - Assistants API (beta feature) - `/v1/fine_tuning/*` - Fine-tuning management - `/v1/files` - File management - `/v1/moderations` - Content moderation ### Utility - `GET /` - API info and endpoints - `GET /health` - Health check - `GET /docs` - Swagger UI documentation - `GET /redoc` - ReDoc documentation ## Project Structure ``` src/ ├── main.py # FastAPI application entry point ├── models/ │ └── model.py # Model dataclass with vLLM configurations ├── services/ │ ├── model_manager.py # Model lifecycle management │ └── persistence.py # JSON file persistence ├── endpoints/ │ ├── models.py # Model CRUD endpoints │ └── v1/ # OpenAI v1 compatible endpoints │ ├── models.py # Models listing │ ├── chat.py # Chat completions │ ├── completions.py # Text completions │ ├── embeddings.py # Embeddings generation │ └── misc.py # Other v1 endpoints └── data/ # Persisted models (auto-created) └── models.json ``` ## Implementation Status ### ✅ Completed - [x] FastAPI application setup with CORS - [x] Model dataclass with vLLM parameters - [x] Model management endpoints (CRUD) - [x] OpenAI v1 compatible `/v1/models` endpoint - [x] Model persistence to JSON file - [x] Port allocation for models - [x] Environment variable configuration - [x] All OpenAI v1 endpoint placeholders with proper request/response models - [x] Streaming support structure (parameter-based, not separate endpoints) - [x] Swagger/ReDoc API documentation ### 🚧 High Priority TODO - [ ] vLLM process spawning and management - [ ] Implement actual chat completions logic (`/v1/chat/completions`) - [ ] Implement actual text completions logic (`/v1/completions`) - [ ] Server-Sent Events (SSE) streaming for both endpoints - [ ] Request proxying to appropriate vLLM instance - [ ] Model health monitoring and status updates - [ ] Process cleanup on model deletion - [ ] Automatic model loading on startup (spawn vLLM processes) ### 🔄 Medium Priority TODO - [ ] Embeddings endpoint implementation (`/v1/embeddings`) - [ ] Load balancing for models with multiple instances - [ ] Model configuration validation - [ ] Error recovery and retry logic - [ ] Graceful shutdown handling ### 📊 Low Priority TODO - [ ] Authentication/API keys - [ ] Rate limiting - [ ] Metrics and monitoring endpoints - [ ] Content moderation endpoint - [ ] Fine-tuning management (if applicable) ## Model Configuration Fields The Model dataclass includes all vLLM parameters: - `model`: HuggingFace model ID, local path, or URL - `tensor_parallel_size`: GPU parallelism - `pipeline_parallel_size`: Pipeline parallelism - `max_model_len`: Maximum sequence length - `dtype`: Data type (auto, float16, bfloat16, float32) - `quantization`: Quantization method (awq, gptq, etc.) - `trust_remote_code`: Allow remote code execution - `gpu_memory_utilization`: GPU memory fraction (0-1) - `max_num_seqs`: Maximum concurrent sequences ## Important Notes - Models persist across server restarts in `data/models.json` - Each model is allocated a unique port starting from 8001 - Server runs on port 8000 by default (configurable via `APP_PORT`) - All datetime objects are timezone-aware (UTC) - Model status tracks lifecycle: loading, ready, error, unloading