first steps
This commit is contained in:
149
CLAUDE.md
Normal file
149
CLAUDE.md
Normal file
@@ -0,0 +1,149 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
This is a vLLM proxy REST API that solves the limitation of vLLM only being able to load one model at a time per process. The proxy acts as a daemon that manages multiple vLLM instances and routes requests to the appropriate instance.
|
||||
|
||||
## Key Architecture Decisions
|
||||
|
||||
- **Main entry point**: `src/main.py`
|
||||
- **Package manager**: uv (not pip or poetry)
|
||||
- **Python version**: 3.13
|
||||
- **Configuration**: `.env` file for main configuration
|
||||
- **Source organization**: All source files go in `src/` directory
|
||||
- **Endpoint structure**: Endpoints are organized as separate modules
|
||||
- **Data persistence**: Models saved to `data/models.json` (configurable via `DATA_DIR`)
|
||||
|
||||
## Development Commands
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
uv sync
|
||||
|
||||
# Run the application from project root
|
||||
uv run python src/main.py
|
||||
|
||||
# Run on different port
|
||||
APP_PORT=8081 uv run python src/main.py
|
||||
|
||||
# Add a new dependency
|
||||
uv add <package-name>
|
||||
|
||||
# Add a development dependency
|
||||
uv add --dev <package-name>
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Model Management
|
||||
- `GET /models` - List all models with full details
|
||||
- `POST /models` - Create a new model
|
||||
- `GET /models/{model_id}` - Get model details
|
||||
- `PUT /models/{model_id}` - Update model
|
||||
- `DELETE /models/{model_id}` - Delete model
|
||||
|
||||
### OpenAI v1 Compatible - Implemented
|
||||
- `GET /v1/models` - List models in OpenAI format
|
||||
- `GET /v1/models/{model_id}` - Get specific model in OpenAI format
|
||||
|
||||
### OpenAI v1 Compatible - Placeholders (TODO)
|
||||
- `POST /v1/chat/completions` - Chat completions (supports streaming via `stream` parameter)
|
||||
- `POST /v1/completions` - Text completions (supports streaming via `stream` parameter)
|
||||
- `POST /v1/embeddings` - Generate embeddings
|
||||
|
||||
### OpenAI v1 Compatible - Not Applicable
|
||||
- `/v1/images/*` - Image generation (vLLM is text-only)
|
||||
- `/v1/audio/*` - Audio endpoints (vLLM is text-only)
|
||||
- `/v1/assistants` - Assistants API (beta feature)
|
||||
- `/v1/fine_tuning/*` - Fine-tuning management
|
||||
- `/v1/files` - File management
|
||||
- `/v1/moderations` - Content moderation
|
||||
|
||||
### Utility
|
||||
- `GET /` - API info and endpoints
|
||||
- `GET /health` - Health check
|
||||
- `GET /docs` - Swagger UI documentation
|
||||
- `GET /redoc` - ReDoc documentation
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
src/
|
||||
├── main.py # FastAPI application entry point
|
||||
├── models/
|
||||
│ └── model.py # Model dataclass with vLLM configurations
|
||||
├── services/
|
||||
│ ├── model_manager.py # Model lifecycle management
|
||||
│ └── persistence.py # JSON file persistence
|
||||
├── endpoints/
|
||||
│ ├── models.py # Model CRUD endpoints
|
||||
│ └── v1/ # OpenAI v1 compatible endpoints
|
||||
│ ├── models.py # Models listing
|
||||
│ ├── chat.py # Chat completions
|
||||
│ ├── completions.py # Text completions
|
||||
│ ├── embeddings.py # Embeddings generation
|
||||
│ └── misc.py # Other v1 endpoints
|
||||
└── data/ # Persisted models (auto-created)
|
||||
└── models.json
|
||||
```
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### ✅ Completed
|
||||
- [x] FastAPI application setup with CORS
|
||||
- [x] Model dataclass with vLLM parameters
|
||||
- [x] Model management endpoints (CRUD)
|
||||
- [x] OpenAI v1 compatible `/v1/models` endpoint
|
||||
- [x] Model persistence to JSON file
|
||||
- [x] Port allocation for models
|
||||
- [x] Environment variable configuration
|
||||
- [x] All OpenAI v1 endpoint placeholders with proper request/response models
|
||||
- [x] Streaming support structure (parameter-based, not separate endpoints)
|
||||
- [x] Swagger/ReDoc API documentation
|
||||
|
||||
### 🚧 High Priority TODO
|
||||
- [ ] vLLM process spawning and management
|
||||
- [ ] Implement actual chat completions logic (`/v1/chat/completions`)
|
||||
- [ ] Implement actual text completions logic (`/v1/completions`)
|
||||
- [ ] Server-Sent Events (SSE) streaming for both endpoints
|
||||
- [ ] Request proxying to appropriate vLLM instance
|
||||
- [ ] Model health monitoring and status updates
|
||||
- [ ] Process cleanup on model deletion
|
||||
- [ ] Automatic model loading on startup (spawn vLLM processes)
|
||||
|
||||
### 🔄 Medium Priority TODO
|
||||
- [ ] Embeddings endpoint implementation (`/v1/embeddings`)
|
||||
- [ ] Load balancing for models with multiple instances
|
||||
- [ ] Model configuration validation
|
||||
- [ ] Error recovery and retry logic
|
||||
- [ ] Graceful shutdown handling
|
||||
|
||||
### 📊 Low Priority TODO
|
||||
- [ ] Authentication/API keys
|
||||
- [ ] Rate limiting
|
||||
- [ ] Metrics and monitoring endpoints
|
||||
- [ ] Content moderation endpoint
|
||||
- [ ] Fine-tuning management (if applicable)
|
||||
|
||||
## Model Configuration Fields
|
||||
|
||||
The Model dataclass includes all vLLM parameters:
|
||||
- `model`: HuggingFace model ID, local path, or URL
|
||||
- `tensor_parallel_size`: GPU parallelism
|
||||
- `pipeline_parallel_size`: Pipeline parallelism
|
||||
- `max_model_len`: Maximum sequence length
|
||||
- `dtype`: Data type (auto, float16, bfloat16, float32)
|
||||
- `quantization`: Quantization method (awq, gptq, etc.)
|
||||
- `trust_remote_code`: Allow remote code execution
|
||||
- `gpu_memory_utilization`: GPU memory fraction (0-1)
|
||||
- `max_num_seqs`: Maximum concurrent sequences
|
||||
|
||||
## Important Notes
|
||||
|
||||
- Models persist across server restarts in `data/models.json`
|
||||
- Each model is allocated a unique port starting from 8001
|
||||
- Server runs on port 8000 by default (configurable via `APP_PORT`)
|
||||
- All datetime objects are timezone-aware (UTC)
|
||||
- Model status tracks lifecycle: loading, ready, error, unloading
|
||||
Reference in New Issue
Block a user