gmarth/vLLM-Proxy

Fork 0

Files

Alexander Thiess 838f1c737e first steps

2025-09-09 06:48:51 +02:00

5.3 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance.

Key Architecture Decisions

Main entry point: src/main.py
Package manager: uv (not pip or poetry)
Python version: 3.13
Configuration: .env file for main configuration
Source organization: All source files go in src/ directory
Endpoint structure: Endpoints are organized as separate modules
Data persistence: Models saved to data/models.json (configurable via DATA_DIR)

Development Commands

# Install dependencies
uv sync

# Run the application from project root
uv run python src/main.py

# Run on different port
APP_PORT=8081 uv run python src/main.py

# Add a new dependency
uv add <package-name>

# Add a development dependency
uv add --dev <package-name>

API Endpoints

Model Management

GET /models - List all models with full details
POST /models - Create a new model
GET /models/{model_id} - Get model details
PUT /models/{model_id} - Update model
DELETE /models/{model_id} - Delete model

OpenAI v1 Compatible - Implemented

GET /v1/models - List models in OpenAI format
GET /v1/models/{model_id} - Get specific model in OpenAI format

OpenAI v1 Compatible - Placeholders (TODO)

POST /v1/chat/completions - Chat completions (supports streaming via stream parameter)
POST /v1/completions - Text completions (supports streaming via stream parameter)
POST /v1/embeddings - Generate embeddings

OpenAI v1 Compatible - Not Applicable

/v1/images/* - Image generation (vLLM is text-only)
/v1/audio/* - Audio endpoints (vLLM is text-only)
/v1/assistants - Assistants API (beta feature)
/v1/fine_tuning/* - Fine-tuning management
/v1/files - File management
/v1/moderations - Content moderation

Utility

GET / - API info and endpoints
GET /health - Health check
GET /docs - Swagger UI documentation
GET /redoc - ReDoc documentation

Project Structure

src/
├── main.py                 # FastAPI application entry point
├── models/
│   └── model.py           # Model dataclass with vLLM configurations
├── services/
│   ├── model_manager.py   # Model lifecycle management
│   └── persistence.py     # JSON file persistence
├── endpoints/
│   ├── models.py          # Model CRUD endpoints
│   └── v1/                # OpenAI v1 compatible endpoints
│       ├── models.py      # Models listing
│       ├── chat.py        # Chat completions
│       ├── completions.py # Text completions
│       ├── embeddings.py  # Embeddings generation
│       └── misc.py        # Other v1 endpoints
└── data/                  # Persisted models (auto-created)
    └── models.json

Implementation Status

✅ Completed

FastAPI application setup with CORS
Model dataclass with vLLM parameters
Model management endpoints (CRUD)
OpenAI v1 compatible /v1/models endpoint
Model persistence to JSON file
Port allocation for models
Environment variable configuration
All OpenAI v1 endpoint placeholders with proper request/response models
Streaming support structure (parameter-based, not separate endpoints)
Swagger/ReDoc API documentation

🚧 High Priority TODO

vLLM process spawning and management
Implement actual chat completions logic (/v1/chat/completions)
Implement actual text completions logic (/v1/completions)
Server-Sent Events (SSE) streaming for both endpoints
Request proxying to appropriate vLLM instance
Model health monitoring and status updates
Process cleanup on model deletion
Automatic model loading on startup (spawn vLLM processes)

🔄 Medium Priority TODO

Embeddings endpoint implementation (/v1/embeddings)
Load balancing for models with multiple instances
Model configuration validation
Error recovery and retry logic
Graceful shutdown handling

📊 Low Priority TODO

Authentication/API keys
Rate limiting
Metrics and monitoring endpoints
Content moderation endpoint
Fine-tuning management (if applicable)

Model Configuration Fields

The Model dataclass includes all vLLM parameters:

model: HuggingFace model ID, local path, or URL
tensor_parallel_size: GPU parallelism
pipeline_parallel_size: Pipeline parallelism
max_model_len: Maximum sequence length
dtype: Data type (auto, float16, bfloat16, float32)
quantization: Quantization method (awq, gptq, etc.)
trust_remote_code: Allow remote code execution
gpu_memory_utilization: GPU memory fraction (0-1)
max_num_seqs: Maximum concurrent sequences

Important Notes

Models persist across server restarts in data/models.json
Each model is allocated a unique port starting from 8001
Server runs on port 8000 by default (configurable via APP_PORT)
All datetime objects are timezone-aware (UTC)
Model status tracks lifecycle: loading, ready, error, unloading

5.3 KiB Raw Blame History