first steps

This commit is contained in:
2025-09-09 06:48:51 +02:00
parent 6e2b456dbb
commit 838f1c737e
21 changed files with 1411 additions and 0 deletions

149
CLAUDE.md Normal file
View File

@@ -0,0 +1,149 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance.
## Key Architecture Decisions
- **Main entry point**: `src/main.py`
- **Package manager**: uv (not pip or poetry)
- **Python version**: 3.13
- **Configuration**: `.env` file for main configuration
- **Source organization**: All source files go in `src/` directory
- **Endpoint structure**: Endpoints are organized as separate modules
- **Data persistence**: Models saved to `data/models.json` (configurable via `DATA_DIR`)
## Development Commands
```bash
# Install dependencies
uv sync
# Run the application from project root
uv run python src/main.py
# Run on different port
APP_PORT=8081 uv run python src/main.py
# Add a new dependency
uv add <package-name>
# Add a development dependency
uv add --dev <package-name>
```
## API Endpoints
### Model Management
- `GET /models` - List all models with full details
- `POST /models` - Create a new model
- `GET /models/{model_id}` - Get model details
- `PUT /models/{model_id}` - Update model
- `DELETE /models/{model_id}` - Delete model
### OpenAI v1 Compatible - Implemented
- `GET /v1/models` - List models in OpenAI format
- `GET /v1/models/{model_id}` - Get specific model in OpenAI format
### OpenAI v1 Compatible - Placeholders (TODO)
- `POST /v1/chat/completions` - Chat completions (supports streaming via `stream` parameter)
- `POST /v1/completions` - Text completions (supports streaming via `stream` parameter)
- `POST /v1/embeddings` - Generate embeddings
### OpenAI v1 Compatible - Not Applicable
- `/v1/images/*` - Image generation (vLLM is text-only)
- `/v1/audio/*` - Audio endpoints (vLLM is text-only)
- `/v1/assistants` - Assistants API (beta feature)
- `/v1/fine_tuning/*` - Fine-tuning management
- `/v1/files` - File management
- `/v1/moderations` - Content moderation
### Utility
- `GET /` - API info and endpoints
- `GET /health` - Health check
- `GET /docs` - Swagger UI documentation
- `GET /redoc` - ReDoc documentation
## Project Structure
```
src/
├── main.py # FastAPI application entry point
├── models/
│ └── model.py # Model dataclass with vLLM configurations
├── services/
│ ├── model_manager.py # Model lifecycle management
│ └── persistence.py # JSON file persistence
├── endpoints/
│ ├── models.py # Model CRUD endpoints
│ └── v1/ # OpenAI v1 compatible endpoints
│ ├── models.py # Models listing
│ ├── chat.py # Chat completions
│ ├── completions.py # Text completions
│ ├── embeddings.py # Embeddings generation
│ └── misc.py # Other v1 endpoints
└── data/ # Persisted models (auto-created)
└── models.json
```
## Implementation Status
### ✅ Completed
- [x] FastAPI application setup with CORS
- [x] Model dataclass with vLLM parameters
- [x] Model management endpoints (CRUD)
- [x] OpenAI v1 compatible `/v1/models` endpoint
- [x] Model persistence to JSON file
- [x] Port allocation for models
- [x] Environment variable configuration
- [x] All OpenAI v1 endpoint placeholders with proper request/response models
- [x] Streaming support structure (parameter-based, not separate endpoints)
- [x] Swagger/ReDoc API documentation
### 🚧 High Priority TODO
- [ ] vLLM process spawning and management
- [ ] Implement actual chat completions logic (`/v1/chat/completions`)
- [ ] Implement actual text completions logic (`/v1/completions`)
- [ ] Server-Sent Events (SSE) streaming for both endpoints
- [ ] Request proxying to appropriate vLLM instance
- [ ] Model health monitoring and status updates
- [ ] Process cleanup on model deletion
- [ ] Automatic model loading on startup (spawn vLLM processes)
### 🔄 Medium Priority TODO
- [ ] Embeddings endpoint implementation (`/v1/embeddings`)
- [ ] Load balancing for models with multiple instances
- [ ] Model configuration validation
- [ ] Error recovery and retry logic
- [ ] Graceful shutdown handling
### 📊 Low Priority TODO
- [ ] Authentication/API keys
- [ ] Rate limiting
- [ ] Metrics and monitoring endpoints
- [ ] Content moderation endpoint
- [ ] Fine-tuning management (if applicable)
## Model Configuration Fields
The Model dataclass includes all vLLM parameters:
- `model`: HuggingFace model ID, local path, or URL
- `tensor_parallel_size`: GPU parallelism
- `pipeline_parallel_size`: Pipeline parallelism
- `max_model_len`: Maximum sequence length
- `dtype`: Data type (auto, float16, bfloat16, float32)
- `quantization`: Quantization method (awq, gptq, etc.)
- `trust_remote_code`: Allow remote code execution
- `gpu_memory_utilization`: GPU memory fraction (0-1)
- `max_num_seqs`: Maximum concurrent sequences
## Important Notes
- Models persist across server restarts in `data/models.json`
- Each model is allocated a unique port starting from 8001
- Server runs on port 8000 by default (configurable via `APP_PORT`)
- All datetime objects are timezone-aware (UTC)
- Model status tracks lifecycle: loading, ready, error, unloading