149 lines
5.3 KiB
Markdown
149 lines
5.3 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance.
|
|
|
|
## Key Architecture Decisions
|
|
|
|
- **Main entry point**: `src/main.py`
|
|
- **Package manager**: uv (not pip or poetry)
|
|
- **Python version**: 3.13
|
|
- **Configuration**: `.env` file for main configuration
|
|
- **Source organization**: All source files go in `src/` directory
|
|
- **Endpoint structure**: Endpoints are organized as separate modules
|
|
- **Data persistence**: Models saved to `data/models.json` (configurable via `DATA_DIR`)
|
|
|
|
## Development Commands
|
|
|
|
```bash
|
|
# Install dependencies
|
|
uv sync
|
|
|
|
# Run the application from project root
|
|
uv run python src/main.py
|
|
|
|
# Run on different port
|
|
APP_PORT=8081 uv run python src/main.py
|
|
|
|
# Add a new dependency
|
|
uv add <package-name>
|
|
|
|
# Add a development dependency
|
|
uv add --dev <package-name>
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Model Management
|
|
- `GET /models` - List all models with full details
|
|
- `POST /models` - Create a new model
|
|
- `GET /models/{model_id}` - Get model details
|
|
- `PUT /models/{model_id}` - Update model
|
|
- `DELETE /models/{model_id}` - Delete model
|
|
|
|
### OpenAI v1 Compatible - Implemented
|
|
- `GET /v1/models` - List models in OpenAI format
|
|
- `GET /v1/models/{model_id}` - Get specific model in OpenAI format
|
|
|
|
### OpenAI v1 Compatible - Placeholders (TODO)
|
|
- `POST /v1/chat/completions` - Chat completions (supports streaming via `stream` parameter)
|
|
- `POST /v1/completions` - Text completions (supports streaming via `stream` parameter)
|
|
- `POST /v1/embeddings` - Generate embeddings
|
|
|
|
### OpenAI v1 Compatible - Not Applicable
|
|
- `/v1/images/*` - Image generation (vLLM is text-only)
|
|
- `/v1/audio/*` - Audio endpoints (vLLM is text-only)
|
|
- `/v1/assistants` - Assistants API (beta feature)
|
|
- `/v1/fine_tuning/*` - Fine-tuning management
|
|
- `/v1/files` - File management
|
|
- `/v1/moderations` - Content moderation
|
|
|
|
### Utility
|
|
- `GET /` - API info and endpoints
|
|
- `GET /health` - Health check
|
|
- `GET /docs` - Swagger UI documentation
|
|
- `GET /redoc` - ReDoc documentation
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
src/
|
|
├── main.py # FastAPI application entry point
|
|
├── models/
|
|
│ └── model.py # Model dataclass with vLLM configurations
|
|
├── services/
|
|
│ ├── model_manager.py # Model lifecycle management
|
|
│ └── persistence.py # JSON file persistence
|
|
├── endpoints/
|
|
│ ├── models.py # Model CRUD endpoints
|
|
│ └── v1/ # OpenAI v1 compatible endpoints
|
|
│ ├── models.py # Models listing
|
|
│ ├── chat.py # Chat completions
|
|
│ ├── completions.py # Text completions
|
|
│ ├── embeddings.py # Embeddings generation
|
|
│ └── misc.py # Other v1 endpoints
|
|
└── data/ # Persisted models (auto-created)
|
|
└── models.json
|
|
```
|
|
|
|
## Implementation Status
|
|
|
|
### ✅ Completed
|
|
- [x] FastAPI application setup with CORS
|
|
- [x] Model dataclass with vLLM parameters
|
|
- [x] Model management endpoints (CRUD)
|
|
- [x] OpenAI v1 compatible `/v1/models` endpoint
|
|
- [x] Model persistence to JSON file
|
|
- [x] Port allocation for models
|
|
- [x] Environment variable configuration
|
|
- [x] All OpenAI v1 endpoint placeholders with proper request/response models
|
|
- [x] Streaming support structure (parameter-based, not separate endpoints)
|
|
- [x] Swagger/ReDoc API documentation
|
|
|
|
### 🚧 High Priority TODO
|
|
- [ ] vLLM process spawning and management
|
|
- [ ] Implement actual chat completions logic (`/v1/chat/completions`)
|
|
- [ ] Implement actual text completions logic (`/v1/completions`)
|
|
- [ ] Server-Sent Events (SSE) streaming for both endpoints
|
|
- [ ] Request proxying to appropriate vLLM instance
|
|
- [ ] Model health monitoring and status updates
|
|
- [ ] Process cleanup on model deletion
|
|
- [ ] Automatic model loading on startup (spawn vLLM processes)
|
|
|
|
### 🔄 Medium Priority TODO
|
|
- [ ] Embeddings endpoint implementation (`/v1/embeddings`)
|
|
- [ ] Load balancing for models with multiple instances
|
|
- [ ] Model configuration validation
|
|
- [ ] Error recovery and retry logic
|
|
- [ ] Graceful shutdown handling
|
|
|
|
### 📊 Low Priority TODO
|
|
- [ ] Authentication/API keys
|
|
- [ ] Rate limiting
|
|
- [ ] Metrics and monitoring endpoints
|
|
- [ ] Content moderation endpoint
|
|
- [ ] Fine-tuning management (if applicable)
|
|
|
|
## Model Configuration Fields
|
|
|
|
The Model dataclass includes all vLLM parameters:
|
|
- `model`: HuggingFace model ID, local path, or URL
|
|
- `tensor_parallel_size`: GPU parallelism
|
|
- `pipeline_parallel_size`: Pipeline parallelism
|
|
- `max_model_len`: Maximum sequence length
|
|
- `dtype`: Data type (auto, float16, bfloat16, float32)
|
|
- `quantization`: Quantization method (awq, gptq, etc.)
|
|
- `trust_remote_code`: Allow remote code execution
|
|
- `gpu_memory_utilization`: GPU memory fraction (0-1)
|
|
- `max_num_seqs`: Maximum concurrent sequences
|
|
|
|
## Important Notes
|
|
|
|
- Models persist across server restarts in `data/models.json`
|
|
- Each model is allocated a unique port starting from 8001
|
|
- Server runs on port 8000 by default (configurable via `APP_PORT`)
|
|
- All datetime objects are timezone-aware (UTC)
|
|
- Model status tracks lifecycle: loading, ready, error, unloading |