vLLM-Proxy/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance.

## Key Architecture Decisions

- **Main entry point**: `src/main.py`
- **Package manager**: uv (not pip or poetry)
- **Python version**: 3.13
- **Configuration**: `.env` file for main configuration
- **Source organization**: All source files go in `src/` directory
- **Endpoint structure**: Endpoints are organized as separate modules
- **Data persistence**: Models saved to `data/models.json` (configurable via `DATA_DIR`)

## Development Commands

```bash
# Install dependencies
uv sync

# Run the application from project root
uv run python src/main.py

# Run on different port
APP_PORT=8081 uv run python src/main.py

# Add a new dependency
uv add <package-name>

# Add a development dependency
uv add --dev <package-name>
```

## API Endpoints

### Model Management
- `GET /models` - List all models with full details
- `POST /models` - Create a new model
- `GET /models/{model_id}` - Get model details
- `PUT /models/{model_id}` - Update model
- `DELETE /models/{model_id}` - Delete model

### OpenAI v1 Compatible - Implemented
- `GET /v1/models` - List models in OpenAI format
- `GET /v1/models/{model_id}` - Get specific model in OpenAI format

### OpenAI v1 Compatible - Placeholders (TODO)
- `POST /v1/chat/completions` - Chat completions (supports streaming via `stream` parameter)
- `POST /v1/completions` - Text completions (supports streaming via `stream` parameter)
- `POST /v1/embeddings` - Generate embeddings

### OpenAI v1 Compatible - Not Applicable
- `/v1/images/*` - Image generation (vLLM is text-only)
- `/v1/audio/*` - Audio endpoints (vLLM is text-only)
- `/v1/assistants` - Assistants API (beta feature)
- `/v1/fine_tuning/*` - Fine-tuning management
- `/v1/files` - File management
- `/v1/moderations` - Content moderation

### Utility
- `GET /` - API info and endpoints
- `GET /health` - Health check
- `GET /docs` - Swagger UI documentation
- `GET /redoc` - ReDoc documentation

## Project Structure

```
src/
├── main.py                 # FastAPI application entry point
├── models/
│   └── model.py           # Model dataclass with vLLM configurations
├── services/
│   ├── model_manager.py   # Model lifecycle management
│   └── persistence.py     # JSON file persistence
├── endpoints/
│   ├── models.py          # Model CRUD endpoints
│   └── v1/                # OpenAI v1 compatible endpoints
│       ├── models.py      # Models listing
│       ├── chat.py        # Chat completions
│       ├── completions.py # Text completions
│       ├── embeddings.py  # Embeddings generation
│       └── misc.py        # Other v1 endpoints
└── data/                  # Persisted models (auto-created)
    └── models.json
```

## Implementation Status

### ✅ Completed
- [x] FastAPI application setup with CORS
- [x] Model dataclass with vLLM parameters
- [x] Model management endpoints (CRUD)
- [x] OpenAI v1 compatible `/v1/models` endpoint
- [x] Model persistence to JSON file
- [x] Port allocation for models
- [x] Environment variable configuration
- [x] All OpenAI v1 endpoint placeholders with proper request/response models
- [x] Streaming support structure (parameter-based, not separate endpoints)
- [x] Swagger/ReDoc API documentation

### 🚧 High Priority TODO
- [ ] vLLM process spawning and management
- [ ] Implement actual chat completions logic (`/v1/chat/completions`)
- [ ] Implement actual text completions logic (`/v1/completions`)
- [ ] Server-Sent Events (SSE) streaming for both endpoints
- [ ] Request proxying to appropriate vLLM instance
- [ ] Model health monitoring and status updates
- [ ] Process cleanup on model deletion
- [ ] Automatic model loading on startup (spawn vLLM processes)

### 🔄 Medium Priority TODO
- [ ] Embeddings endpoint implementation (`/v1/embeddings`)
- [ ] Load balancing for models with multiple instances
- [ ] Model configuration validation
- [ ] Error recovery and retry logic
- [ ] Graceful shutdown handling

### 📊 Low Priority TODO
- [ ] Authentication/API keys
- [ ] Rate limiting
- [ ] Metrics and monitoring endpoints
- [ ] Content moderation endpoint
- [ ] Fine-tuning management (if applicable)

## Model Configuration Fields

The Model dataclass includes all vLLM parameters:
- `model`: HuggingFace model ID, local path, or URL
- `tensor_parallel_size`: GPU parallelism
- `pipeline_parallel_size`: Pipeline parallelism
- `max_model_len`: Maximum sequence length
- `dtype`: Data type (auto, float16, bfloat16, float32)
- `quantization`: Quantization method (awq, gptq, etc.)
- `trust_remote_code`: Allow remote code execution
- `gpu_memory_utilization`: GPU memory fraction (0-1)
- `max_num_seqs`: Maximum concurrent sequences

## Important Notes

- Models persist across server restarts in `data/models.json`
- Each model is allocated a unique port starting from 8001
- Server runs on port 8000 by default (configurable via `APP_PORT`)
- All datetime objects are timezone-aware (UTC)
- Model status tracks lifecycle: loading, ready, error, unloading