Files
vLLM-Proxy/CLAUDE.md
2025-09-09 06:48:51 +02:00

5.3 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance.

Key Architecture Decisions

  • Main entry point: src/main.py
  • Package manager: uv (not pip or poetry)
  • Python version: 3.13
  • Configuration: .env file for main configuration
  • Source organization: All source files go in src/ directory
  • Endpoint structure: Endpoints are organized as separate modules
  • Data persistence: Models saved to data/models.json (configurable via DATA_DIR)

Development Commands

# Install dependencies
uv sync

# Run the application from project root
uv run python src/main.py

# Run on different port
APP_PORT=8081 uv run python src/main.py

# Add a new dependency
uv add <package-name>

# Add a development dependency
uv add --dev <package-name>

API Endpoints

Model Management

  • GET /models - List all models with full details
  • POST /models - Create a new model
  • GET /models/{model_id} - Get model details
  • PUT /models/{model_id} - Update model
  • DELETE /models/{model_id} - Delete model

OpenAI v1 Compatible - Implemented

  • GET /v1/models - List models in OpenAI format
  • GET /v1/models/{model_id} - Get specific model in OpenAI format

OpenAI v1 Compatible - Placeholders (TODO)

  • POST /v1/chat/completions - Chat completions (supports streaming via stream parameter)
  • POST /v1/completions - Text completions (supports streaming via stream parameter)
  • POST /v1/embeddings - Generate embeddings

OpenAI v1 Compatible - Not Applicable

  • /v1/images/* - Image generation (vLLM is text-only)
  • /v1/audio/* - Audio endpoints (vLLM is text-only)
  • /v1/assistants - Assistants API (beta feature)
  • /v1/fine_tuning/* - Fine-tuning management
  • /v1/files - File management
  • /v1/moderations - Content moderation

Utility

  • GET / - API info and endpoints
  • GET /health - Health check
  • GET /docs - Swagger UI documentation
  • GET /redoc - ReDoc documentation

Project Structure

src/
├── main.py                 # FastAPI application entry point
├── models/
│   └── model.py           # Model dataclass with vLLM configurations
├── services/
│   ├── model_manager.py   # Model lifecycle management
│   └── persistence.py     # JSON file persistence
├── endpoints/
│   ├── models.py          # Model CRUD endpoints
│   └── v1/                # OpenAI v1 compatible endpoints
│       ├── models.py      # Models listing
│       ├── chat.py        # Chat completions
│       ├── completions.py # Text completions
│       ├── embeddings.py  # Embeddings generation
│       └── misc.py        # Other v1 endpoints
└── data/                  # Persisted models (auto-created)
    └── models.json

Implementation Status

Completed

  • FastAPI application setup with CORS
  • Model dataclass with vLLM parameters
  • Model management endpoints (CRUD)
  • OpenAI v1 compatible /v1/models endpoint
  • Model persistence to JSON file
  • Port allocation for models
  • Environment variable configuration
  • All OpenAI v1 endpoint placeholders with proper request/response models
  • Streaming support structure (parameter-based, not separate endpoints)
  • Swagger/ReDoc API documentation

🚧 High Priority TODO

  • vLLM process spawning and management
  • Implement actual chat completions logic (/v1/chat/completions)
  • Implement actual text completions logic (/v1/completions)
  • Server-Sent Events (SSE) streaming for both endpoints
  • Request proxying to appropriate vLLM instance
  • Model health monitoring and status updates
  • Process cleanup on model deletion
  • Automatic model loading on startup (spawn vLLM processes)

🔄 Medium Priority TODO

  • Embeddings endpoint implementation (/v1/embeddings)
  • Load balancing for models with multiple instances
  • Model configuration validation
  • Error recovery and retry logic
  • Graceful shutdown handling

📊 Low Priority TODO

  • Authentication/API keys
  • Rate limiting
  • Metrics and monitoring endpoints
  • Content moderation endpoint
  • Fine-tuning management (if applicable)

Model Configuration Fields

The Model dataclass includes all vLLM parameters:

  • model: HuggingFace model ID, local path, or URL
  • tensor_parallel_size: GPU parallelism
  • pipeline_parallel_size: Pipeline parallelism
  • max_model_len: Maximum sequence length
  • dtype: Data type (auto, float16, bfloat16, float32)
  • quantization: Quantization method (awq, gptq, etc.)
  • trust_remote_code: Allow remote code execution
  • gpu_memory_utilization: GPU memory fraction (0-1)
  • max_num_seqs: Maximum concurrent sequences

Important Notes

  • Models persist across server restarts in data/models.json
  • Each model is allocated a unique port starting from 8001
  • Server runs on port 8000 by default (configurable via APP_PORT)
  • All datetime objects are timezone-aware (UTC)
  • Model status tracks lifecycle: loading, ready, error, unloading