# vLLM-Proxy A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background. ## Features - 🚀 **Multiple Model Management**: Run multiple vLLM models simultaneously - 🔄 **OpenAI Compatible**: Drop-in replacement for OpenAI API v1 endpoints - 💾 **Persistent Configuration**: Models persist across server restarts - 🎯 **Automatic Routing**: Requests are automatically routed to the correct model instance - 📊 **RESTful API**: Full CRUD operations for model management - ⚡ **Fast & Async**: Built with FastAPI for high performance ## Quick Start ### Prerequisites - Python 3.13+ - [uv](https://github.com/astral-sh/uv) package manager - CUDA-capable GPU (for running vLLM models) ### Installation ```bash # Clone the repository git clone https://github.com/yourusername/vLLM-Proxy.git cd vLLM-Proxy # Install dependencies uv sync ``` ### Running the Server ```bash # Start the proxy server uv run python src/main.py # Or run on a different port APP_PORT=8081 uv run python src/main.py ``` The server will start on `http://localhost:8000` by default. ## API Usage ### Model Management #### Create a Model ```bash curl -X POST http://localhost:8000/models \ -H "Content-Type: application/json" \ -d '{ "name": "llama-3.2", "model": "meta-llama/Llama-3.2-1B-Instruct", "dtype": "float16", "max_model_len": 4096 }' ``` #### List Models ```bash # Full details (admin view) curl http://localhost:8000/models # OpenAI compatible format curl http://localhost:8000/v1/models ``` #### Update a Model ```bash curl -X PUT http://localhost:8000/models/{model_id} \ -H "Content-Type: application/json" \ -d '{ "max_model_len": 8192, "gpu_memory_utilization": 0.8 }' ``` #### Delete a Model ```bash curl -X DELETE http://localhost:8000/models/{model_id} ``` ### Chat Completions (Placeholder - TODO) ```bash # Non-streaming chat completion curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.2", "messages": [ {"role": "user", "content": "Hello!"} ], "temperature": 0.7, "max_tokens": 100 }' # Streaming chat completion (when implemented) curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.2", "messages": [ {"role": "user", "content": "Hello!"} ], "stream": true }' ``` ### Text Completions (Placeholder - TODO) ```bash # Non-streaming completion curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.2", "prompt": "Once upon a time", "max_tokens": 50, "temperature": 0.7 }' # Streaming completion (when implemented) curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.2", "prompt": "Once upon a time", "max_tokens": 50, "stream": true }' ``` ### Embeddings (Placeholder - TODO) ```bash curl -X POST http://localhost:8000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "text-embedding-model", "input": "The food was delicious and the waiter was friendly." }' ``` ## Configuration ### Environment Variables Create a `.env` file in the `src` directory: ```env # Server configuration APP_HOST=0.0.0.0 APP_PORT=8000 # Data directory for persistence DATA_DIR=./data # Hugging Face token (for gated models) HF_TOKEN=your_token_here ``` ### Model Parameters When creating a model, you can configure all vLLM parameters: | Parameter | Description | Default | |-----------|-------------|---------| | `model` | HuggingFace model ID, local path, or URL | Required | | `tensor_parallel_size` | Number of GPUs for tensor parallelism | 1 | | `pipeline_parallel_size` | Number of GPUs for pipeline parallelism | 1 | | `max_model_len` | Maximum sequence length | Auto | | `dtype` | Data type (auto, float16, bfloat16, float32) | auto | | `quantization` | Quantization method (awq, gptq, etc.) | None | | `trust_remote_code` | Allow remote code execution | false | | `gpu_memory_utilization` | GPU memory fraction to use (0-1) | 0.9 | | `max_num_seqs` | Maximum concurrent sequences | 256 | ## Architecture ``` vLLM-Proxy │ ├── API Layer (FastAPI) │ ├── /v1/* endpoints (OpenAI compatible) │ └── /models/* endpoints (Management) │ ├── Model Manager │ ├── Lifecycle management │ ├── Port allocation │ └── Persistence layer │ └── vLLM Instances (Coming Soon) ├── Model A (port 8001) ├── Model B (port 8002) └── Model C (port 8003) ``` ## API Documentation Once the server is running, you can access the interactive API documentation at: - **Swagger UI**: `http://localhost:8000/docs` - **ReDoc**: `http://localhost:8000/redoc` ## Development ### Project Structure ``` src/ ├── main.py # FastAPI application ├── models/ # Data models │ └── model.py # Model dataclass with vLLM configurations ├── services/ # Business logic │ ├── model_manager.py # Model lifecycle management │ └── persistence.py # JSON file persistence ├── endpoints/ # API endpoints │ ├── models.py # Model CRUD operations │ └── v1/ # OpenAI v1 compatible endpoints │ ├── models.py # Models listing │ ├── chat.py # Chat completions (placeholder) │ ├── completions.py # Text completions (placeholder) │ ├── embeddings.py # Embeddings (placeholder) │ └── misc.py # Other v1 endpoints └── data/ # Persistent storage └── models.json # Saved model configurations ``` ### Adding Dependencies ```bash # Add a runtime dependency uv add package-name # Add a development dependency uv add --dev package-name ``` ## Roadmap ### ✅ Completed - [x] Model CRUD operations - [x] OpenAI v1/models endpoint - [x] Model persistence - [x] All OpenAI v1 endpoint placeholders - [x] Streaming support structure - [x] Interactive API documentation ### 🚧 High Priority - [ ] vLLM process management - [ ] Chat completions implementation - [ ] Text completions implementation - [ ] Server-Sent Events streaming - [ ] Request proxying to vLLM instances ### 🔄 Medium Priority - [ ] Embeddings endpoint - [ ] Model health monitoring - [ ] Load balancing - [ ] Error recovery ### 📊 Low Priority - [ ] Authentication/API keys - [ ] Rate limiting - [ ] Metrics and monitoring - [ ] Content moderation ## License MIT ## Contributing Contributions are welcome! Please feel free to submit a Pull Request.