first steps

2025-09-09 06:48:51 +02:00
parent 6e2b456dbb
commit 838f1c737e
21 changed files with 1411 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,149 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance.
+
+## Key Architecture Decisions
+
+- **Main entry point**: `src/main.py`
+- **Package manager**: uv (not pip or poetry)
+- **Python version**: 3.13
+- **Configuration**: `.env` file for main configuration
+- **Source organization**: All source files go in `src/` directory
+- **Endpoint structure**: Endpoints are organized as separate modules
+- **Data persistence**: Models saved to `data/models.json` (configurable via `DATA_DIR`)
+
+## Development Commands
+
+```bash
+# Install dependencies
+uv sync
+
+# Run the application from project root
+uv run python src/main.py
+
+# Run on different port
+APP_PORT=8081 uv run python src/main.py
+
+# Add a new dependency
+uv add <package-name>
+
+# Add a development dependency
+uv add --dev <package-name>
+```
+
+## API Endpoints
+
+### Model Management
+- `GET /models` - List all models with full details
+- `POST /models` - Create a new model
+- `GET /models/{model_id}` - Get model details
+- `PUT /models/{model_id}` - Update model
+- `DELETE /models/{model_id}` - Delete model
+
+### OpenAI v1 Compatible - Implemented
+- `GET /v1/models` - List models in OpenAI format
+- `GET /v1/models/{model_id}` - Get specific model in OpenAI format
+
+### OpenAI v1 Compatible - Placeholders (TODO)
+- `POST /v1/chat/completions` - Chat completions (supports streaming via `stream` parameter)
+- `POST /v1/completions` - Text completions (supports streaming via `stream` parameter)
+- `POST /v1/embeddings` - Generate embeddings
+
+### OpenAI v1 Compatible - Not Applicable
+- `/v1/images/*` - Image generation (vLLM is text-only)
+- `/v1/audio/*` - Audio endpoints (vLLM is text-only)
+- `/v1/assistants` - Assistants API (beta feature)
+- `/v1/fine_tuning/*` - Fine-tuning management
+- `/v1/files` - File management
+- `/v1/moderations` - Content moderation
+
+### Utility
+- `GET /` - API info and endpoints
+- `GET /health` - Health check
+- `GET /docs` - Swagger UI documentation
+- `GET /redoc` - ReDoc documentation
+
+## Project Structure
+
+```
+src/
+├── main.py                 # FastAPI application entry point
+├── models/
+│   └── model.py           # Model dataclass with vLLM configurations
+├── services/
+│   ├── model_manager.py   # Model lifecycle management
+│   └── persistence.py     # JSON file persistence
+├── endpoints/
+│   ├── models.py          # Model CRUD endpoints
+│   └── v1/                # OpenAI v1 compatible endpoints
+│       ├── models.py      # Models listing
+│       ├── chat.py        # Chat completions
+│       ├── completions.py # Text completions
+│       ├── embeddings.py  # Embeddings generation
+│       └── misc.py        # Other v1 endpoints
+└── data/                  # Persisted models (auto-created)
+    └── models.json
+```
+
+## Implementation Status
+
+### ✅ Completed
+- [x] FastAPI application setup with CORS
+- [x] Model dataclass with vLLM parameters
+- [x] Model management endpoints (CRUD)
+- [x] OpenAI v1 compatible `/v1/models` endpoint
+- [x] Model persistence to JSON file
+- [x] Port allocation for models
+- [x] Environment variable configuration
+- [x] All OpenAI v1 endpoint placeholders with proper request/response models
+- [x] Streaming support structure (parameter-based, not separate endpoints)
+- [x] Swagger/ReDoc API documentation
+
+### 🚧 High Priority TODO
+- [ ] vLLM process spawning and management
+- [ ] Implement actual chat completions logic (`/v1/chat/completions`)
+- [ ] Implement actual text completions logic (`/v1/completions`)
+- [ ] Server-Sent Events (SSE) streaming for both endpoints
+- [ ] Request proxying to appropriate vLLM instance
+- [ ] Model health monitoring and status updates
+- [ ] Process cleanup on model deletion
+- [ ] Automatic model loading on startup (spawn vLLM processes)
+
+### 🔄 Medium Priority TODO
+- [ ] Embeddings endpoint implementation (`/v1/embeddings`)
+- [ ] Load balancing for models with multiple instances
+- [ ] Model configuration validation
+- [ ] Error recovery and retry logic
+- [ ] Graceful shutdown handling
+
+### 📊 Low Priority TODO
+- [ ] Authentication/API keys
+- [ ] Rate limiting
+- [ ] Metrics and monitoring endpoints
+- [ ] Content moderation endpoint
+- [ ] Fine-tuning management (if applicable)
+
+## Model Configuration Fields
+
+The Model dataclass includes all vLLM parameters:
+- `model`: HuggingFace model ID, local path, or URL
+- `tensor_parallel_size`: GPU parallelism
+- `pipeline_parallel_size`: Pipeline parallelism
+- `max_model_len`: Maximum sequence length
+- `dtype`: Data type (auto, float16, bfloat16, float32)
+- `quantization`: Quantization method (awq, gptq, etc.)
+- `trust_remote_code`: Allow remote code execution
+- `gpu_memory_utilization`: GPU memory fraction (0-1)
+- `max_num_seqs`: Maximum concurrent sequences
+
+## Important Notes
+
+- Models persist across server restarts in `data/models.json`
+- Each model is allocated a unique port starting from 8001
+- Server runs on port 8000 by default (configurable via `APP_PORT`)
+- All datetime objects are timezone-aware (UTC)
+- Model status tracks lifecycle: loading, ready, error, unloading