5.3 KiB
5.3 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance.
Key Architecture Decisions
- Main entry point:
src/main.py - Package manager: uv (not pip or poetry)
- Python version: 3.13
- Configuration:
.envfile for main configuration - Source organization: All source files go in
src/directory - Endpoint structure: Endpoints are organized as separate modules
- Data persistence: Models saved to
data/models.json(configurable viaDATA_DIR)
Development Commands
# Install dependencies
uv sync
# Run the application from project root
uv run python src/main.py
# Run on different port
APP_PORT=8081 uv run python src/main.py
# Add a new dependency
uv add <package-name>
# Add a development dependency
uv add --dev <package-name>
API Endpoints
Model Management
GET /models- List all models with full detailsPOST /models- Create a new modelGET /models/{model_id}- Get model detailsPUT /models/{model_id}- Update modelDELETE /models/{model_id}- Delete model
OpenAI v1 Compatible - Implemented
GET /v1/models- List models in OpenAI formatGET /v1/models/{model_id}- Get specific model in OpenAI format
OpenAI v1 Compatible - Placeholders (TODO)
POST /v1/chat/completions- Chat completions (supports streaming viastreamparameter)POST /v1/completions- Text completions (supports streaming viastreamparameter)POST /v1/embeddings- Generate embeddings
OpenAI v1 Compatible - Not Applicable
/v1/images/*- Image generation (vLLM is text-only)/v1/audio/*- Audio endpoints (vLLM is text-only)/v1/assistants- Assistants API (beta feature)/v1/fine_tuning/*- Fine-tuning management/v1/files- File management/v1/moderations- Content moderation
Utility
GET /- API info and endpointsGET /health- Health checkGET /docs- Swagger UI documentationGET /redoc- ReDoc documentation
Project Structure
src/
├── main.py # FastAPI application entry point
├── models/
│ └── model.py # Model dataclass with vLLM configurations
├── services/
│ ├── model_manager.py # Model lifecycle management
│ └── persistence.py # JSON file persistence
├── endpoints/
│ ├── models.py # Model CRUD endpoints
│ └── v1/ # OpenAI v1 compatible endpoints
│ ├── models.py # Models listing
│ ├── chat.py # Chat completions
│ ├── completions.py # Text completions
│ ├── embeddings.py # Embeddings generation
│ └── misc.py # Other v1 endpoints
└── data/ # Persisted models (auto-created)
└── models.json
Implementation Status
✅ Completed
- FastAPI application setup with CORS
- Model dataclass with vLLM parameters
- Model management endpoints (CRUD)
- OpenAI v1 compatible
/v1/modelsendpoint - Model persistence to JSON file
- Port allocation for models
- Environment variable configuration
- All OpenAI v1 endpoint placeholders with proper request/response models
- Streaming support structure (parameter-based, not separate endpoints)
- Swagger/ReDoc API documentation
🚧 High Priority TODO
- vLLM process spawning and management
- Implement actual chat completions logic (
/v1/chat/completions) - Implement actual text completions logic (
/v1/completions) - Server-Sent Events (SSE) streaming for both endpoints
- Request proxying to appropriate vLLM instance
- Model health monitoring and status updates
- Process cleanup on model deletion
- Automatic model loading on startup (spawn vLLM processes)
🔄 Medium Priority TODO
- Embeddings endpoint implementation (
/v1/embeddings) - Load balancing for models with multiple instances
- Model configuration validation
- Error recovery and retry logic
- Graceful shutdown handling
📊 Low Priority TODO
- Authentication/API keys
- Rate limiting
- Metrics and monitoring endpoints
- Content moderation endpoint
- Fine-tuning management (if applicable)
Model Configuration Fields
The Model dataclass includes all vLLM parameters:
model: HuggingFace model ID, local path, or URLtensor_parallel_size: GPU parallelismpipeline_parallel_size: Pipeline parallelismmax_model_len: Maximum sequence lengthdtype: Data type (auto, float16, bfloat16, float32)quantization: Quantization method (awq, gptq, etc.)trust_remote_code: Allow remote code executiongpu_memory_utilization: GPU memory fraction (0-1)max_num_seqs: Maximum concurrent sequences
Important Notes
- Models persist across server restarts in
data/models.json - Each model is allocated a unique port starting from 8001
- Server runs on port 8000 by default (configurable via
APP_PORT) - All datetime objects are timezone-aware (UTC)
- Model status tracks lifecycle: loading, ready, error, unloading