838f1c737e436f64447e9b6b72071bdac518dc59
vLLM-Proxy
A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background.
Features
- 🚀 Multiple Model Management: Run multiple vLLM models simultaneously
- 🔄 OpenAI Compatible: Drop-in replacement for OpenAI API v1 endpoints
- 💾 Persistent Configuration: Models persist across server restarts
- 🎯 Automatic Routing: Requests are automatically routed to the correct model instance
- 📊 RESTful API: Full CRUD operations for model management
- ⚡ Fast & Async: Built with FastAPI for high performance
Quick Start
Prerequisites
- Python 3.13+
- uv package manager
- CUDA-capable GPU (for running vLLM models)
Installation
# Clone the repository
git clone https://github.com/yourusername/vLLM-Proxy.git
cd vLLM-Proxy
# Install dependencies
uv sync
Running the Server
# Start the proxy server
uv run python src/main.py
# Or run on a different port
APP_PORT=8081 uv run python src/main.py
The server will start on http://localhost:8000 by default.
API Usage
Model Management
Create a Model
curl -X POST http://localhost:8000/models \
-H "Content-Type: application/json" \
-d '{
"name": "llama-3.2",
"model": "meta-llama/Llama-3.2-1B-Instruct",
"dtype": "float16",
"max_model_len": 4096
}'
List Models
# Full details (admin view)
curl http://localhost:8000/models
# OpenAI compatible format
curl http://localhost:8000/v1/models
Update a Model
curl -X PUT http://localhost:8000/models/{model_id} \
-H "Content-Type: application/json" \
-d '{
"max_model_len": 8192,
"gpu_memory_utilization": 0.8
}'
Delete a Model
curl -X DELETE http://localhost:8000/models/{model_id}
Chat Completions (Placeholder - TODO)
# Non-streaming chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100
}'
# Streaming chat completion (when implemented)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2",
"messages": [
{"role": "user", "content": "Hello!"}
],
"stream": true
}'
Text Completions (Placeholder - TODO)
# Non-streaming completion
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2",
"prompt": "Once upon a time",
"max_tokens": 50,
"temperature": 0.7
}'
# Streaming completion (when implemented)
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2",
"prompt": "Once upon a time",
"max_tokens": 50,
"stream": true
}'
Embeddings (Placeholder - TODO)
curl -X POST http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-model",
"input": "The food was delicious and the waiter was friendly."
}'
Configuration
Environment Variables
Create a .env file in the src directory:
# Server configuration
APP_HOST=0.0.0.0
APP_PORT=8000
# Data directory for persistence
DATA_DIR=./data
# Hugging Face token (for gated models)
HF_TOKEN=your_token_here
Model Parameters
When creating a model, you can configure all vLLM parameters:
| Parameter | Description | Default |
|---|---|---|
model |
HuggingFace model ID, local path, or URL | Required |
tensor_parallel_size |
Number of GPUs for tensor parallelism | 1 |
pipeline_parallel_size |
Number of GPUs for pipeline parallelism | 1 |
max_model_len |
Maximum sequence length | Auto |
dtype |
Data type (auto, float16, bfloat16, float32) | auto |
quantization |
Quantization method (awq, gptq, etc.) | None |
trust_remote_code |
Allow remote code execution | false |
gpu_memory_utilization |
GPU memory fraction to use (0-1) | 0.9 |
max_num_seqs |
Maximum concurrent sequences | 256 |
Architecture
vLLM-Proxy
│
├── API Layer (FastAPI)
│ ├── /v1/* endpoints (OpenAI compatible)
│ └── /models/* endpoints (Management)
│
├── Model Manager
│ ├── Lifecycle management
│ ├── Port allocation
│ └── Persistence layer
│
└── vLLM Instances (Coming Soon)
├── Model A (port 8001)
├── Model B (port 8002)
└── Model C (port 8003)
API Documentation
Once the server is running, you can access the interactive API documentation at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Development
Project Structure
src/
├── main.py # FastAPI application
├── models/ # Data models
│ └── model.py # Model dataclass with vLLM configurations
├── services/ # Business logic
│ ├── model_manager.py # Model lifecycle management
│ └── persistence.py # JSON file persistence
├── endpoints/ # API endpoints
│ ├── models.py # Model CRUD operations
│ └── v1/ # OpenAI v1 compatible endpoints
│ ├── models.py # Models listing
│ ├── chat.py # Chat completions (placeholder)
│ ├── completions.py # Text completions (placeholder)
│ ├── embeddings.py # Embeddings (placeholder)
│ └── misc.py # Other v1 endpoints
└── data/ # Persistent storage
└── models.json # Saved model configurations
Adding Dependencies
# Add a runtime dependency
uv add package-name
# Add a development dependency
uv add --dev package-name
Roadmap
✅ Completed
- Model CRUD operations
- OpenAI v1/models endpoint
- Model persistence
- All OpenAI v1 endpoint placeholders
- Streaming support structure
- Interactive API documentation
🚧 High Priority
- vLLM process management
- Chat completions implementation
- Text completions implementation
- Server-Sent Events streaming
- Request proxying to vLLM instances
🔄 Medium Priority
- Embeddings endpoint
- Model health monitoring
- Load balancing
- Error recovery
📊 Low Priority
- Authentication/API keys
- Rate limiting
- Metrics and monitoring
- Content moderation
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Languages
Python
100%