283 lines
6.9 KiB
Markdown
283 lines
6.9 KiB
Markdown
# vLLM-Proxy
|
|
|
|
A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background.
|
|
|
|
## Features
|
|
|
|
- 🚀 **Multiple Model Management**: Run multiple vLLM models simultaneously
|
|
- 🔄 **OpenAI Compatible**: Drop-in replacement for OpenAI API v1 endpoints
|
|
- 💾 **Persistent Configuration**: Models persist across server restarts
|
|
- 🎯 **Automatic Routing**: Requests are automatically routed to the correct model instance
|
|
- 📊 **RESTful API**: Full CRUD operations for model management
|
|
- ⚡ **Fast & Async**: Built with FastAPI for high performance
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.13+
|
|
- [uv](https://github.com/astral-sh/uv) package manager
|
|
- CUDA-capable GPU (for running vLLM models)
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://github.com/yourusername/vLLM-Proxy.git
|
|
cd vLLM-Proxy
|
|
|
|
# Install dependencies
|
|
uv sync
|
|
```
|
|
|
|
### Running the Server
|
|
|
|
```bash
|
|
# Start the proxy server
|
|
uv run python src/main.py
|
|
|
|
# Or run on a different port
|
|
APP_PORT=8081 uv run python src/main.py
|
|
```
|
|
|
|
The server will start on `http://localhost:8000` by default.
|
|
|
|
## API Usage
|
|
|
|
### Model Management
|
|
|
|
#### Create a Model
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8000/models \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "llama-3.2",
|
|
"model": "meta-llama/Llama-3.2-1B-Instruct",
|
|
"dtype": "float16",
|
|
"max_model_len": 4096
|
|
}'
|
|
```
|
|
|
|
#### List Models
|
|
|
|
```bash
|
|
# Full details (admin view)
|
|
curl http://localhost:8000/models
|
|
|
|
# OpenAI compatible format
|
|
curl http://localhost:8000/v1/models
|
|
```
|
|
|
|
#### Update a Model
|
|
|
|
```bash
|
|
curl -X PUT http://localhost:8000/models/{model_id} \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"max_model_len": 8192,
|
|
"gpu_memory_utilization": 0.8
|
|
}'
|
|
```
|
|
|
|
#### Delete a Model
|
|
|
|
```bash
|
|
curl -X DELETE http://localhost:8000/models/{model_id}
|
|
```
|
|
|
|
### Chat Completions (Placeholder - TODO)
|
|
|
|
```bash
|
|
# Non-streaming chat completion
|
|
curl -X POST http://localhost:8000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama-3.2",
|
|
"messages": [
|
|
{"role": "user", "content": "Hello!"}
|
|
],
|
|
"temperature": 0.7,
|
|
"max_tokens": 100
|
|
}'
|
|
|
|
# Streaming chat completion (when implemented)
|
|
curl -X POST http://localhost:8000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama-3.2",
|
|
"messages": [
|
|
{"role": "user", "content": "Hello!"}
|
|
],
|
|
"stream": true
|
|
}'
|
|
```
|
|
|
|
### Text Completions (Placeholder - TODO)
|
|
|
|
```bash
|
|
# Non-streaming completion
|
|
curl -X POST http://localhost:8000/v1/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama-3.2",
|
|
"prompt": "Once upon a time",
|
|
"max_tokens": 50,
|
|
"temperature": 0.7
|
|
}'
|
|
|
|
# Streaming completion (when implemented)
|
|
curl -X POST http://localhost:8000/v1/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama-3.2",
|
|
"prompt": "Once upon a time",
|
|
"max_tokens": 50,
|
|
"stream": true
|
|
}'
|
|
```
|
|
|
|
### Embeddings (Placeholder - TODO)
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8000/v1/embeddings \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "text-embedding-model",
|
|
"input": "The food was delicious and the waiter was friendly."
|
|
}'
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
Create a `.env` file in the `src` directory:
|
|
|
|
```env
|
|
# Server configuration
|
|
APP_HOST=0.0.0.0
|
|
APP_PORT=8000
|
|
|
|
# Data directory for persistence
|
|
DATA_DIR=./data
|
|
|
|
# Hugging Face token (for gated models)
|
|
HF_TOKEN=your_token_here
|
|
```
|
|
|
|
### Model Parameters
|
|
|
|
When creating a model, you can configure all vLLM parameters:
|
|
|
|
| Parameter | Description | Default |
|
|
|-----------|-------------|---------|
|
|
| `model` | HuggingFace model ID, local path, or URL | Required |
|
|
| `tensor_parallel_size` | Number of GPUs for tensor parallelism | 1 |
|
|
| `pipeline_parallel_size` | Number of GPUs for pipeline parallelism | 1 |
|
|
| `max_model_len` | Maximum sequence length | Auto |
|
|
| `dtype` | Data type (auto, float16, bfloat16, float32) | auto |
|
|
| `quantization` | Quantization method (awq, gptq, etc.) | None |
|
|
| `trust_remote_code` | Allow remote code execution | false |
|
|
| `gpu_memory_utilization` | GPU memory fraction to use (0-1) | 0.9 |
|
|
| `max_num_seqs` | Maximum concurrent sequences | 256 |
|
|
|
|
## Architecture
|
|
|
|
```
|
|
vLLM-Proxy
|
|
│
|
|
├── API Layer (FastAPI)
|
|
│ ├── /v1/* endpoints (OpenAI compatible)
|
|
│ └── /models/* endpoints (Management)
|
|
│
|
|
├── Model Manager
|
|
│ ├── Lifecycle management
|
|
│ ├── Port allocation
|
|
│ └── Persistence layer
|
|
│
|
|
└── vLLM Instances (Coming Soon)
|
|
├── Model A (port 8001)
|
|
├── Model B (port 8002)
|
|
└── Model C (port 8003)
|
|
```
|
|
|
|
## API Documentation
|
|
|
|
Once the server is running, you can access the interactive API documentation at:
|
|
|
|
- **Swagger UI**: `http://localhost:8000/docs`
|
|
- **ReDoc**: `http://localhost:8000/redoc`
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
src/
|
|
├── main.py # FastAPI application
|
|
├── models/ # Data models
|
|
│ └── model.py # Model dataclass with vLLM configurations
|
|
├── services/ # Business logic
|
|
│ ├── model_manager.py # Model lifecycle management
|
|
│ └── persistence.py # JSON file persistence
|
|
├── endpoints/ # API endpoints
|
|
│ ├── models.py # Model CRUD operations
|
|
│ └── v1/ # OpenAI v1 compatible endpoints
|
|
│ ├── models.py # Models listing
|
|
│ ├── chat.py # Chat completions (placeholder)
|
|
│ ├── completions.py # Text completions (placeholder)
|
|
│ ├── embeddings.py # Embeddings (placeholder)
|
|
│ └── misc.py # Other v1 endpoints
|
|
└── data/ # Persistent storage
|
|
└── models.json # Saved model configurations
|
|
```
|
|
|
|
### Adding Dependencies
|
|
|
|
```bash
|
|
# Add a runtime dependency
|
|
uv add package-name
|
|
|
|
# Add a development dependency
|
|
uv add --dev package-name
|
|
```
|
|
|
|
## Roadmap
|
|
|
|
### ✅ Completed
|
|
- [x] Model CRUD operations
|
|
- [x] OpenAI v1/models endpoint
|
|
- [x] Model persistence
|
|
- [x] All OpenAI v1 endpoint placeholders
|
|
- [x] Streaming support structure
|
|
- [x] Interactive API documentation
|
|
|
|
### 🚧 High Priority
|
|
- [ ] vLLM process management
|
|
- [ ] Chat completions implementation
|
|
- [ ] Text completions implementation
|
|
- [ ] Server-Sent Events streaming
|
|
- [ ] Request proxying to vLLM instances
|
|
|
|
### 🔄 Medium Priority
|
|
- [ ] Embeddings endpoint
|
|
- [ ] Model health monitoring
|
|
- [ ] Load balancing
|
|
- [ ] Error recovery
|
|
|
|
### 📊 Low Priority
|
|
- [ ] Authentication/API keys
|
|
- [ ] Rate limiting
|
|
- [ ] Metrics and monitoring
|
|
- [ ] Content moderation
|
|
|
|
## License
|
|
|
|
MIT
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
|