first steps

This commit is contained in:
2025-09-09 06:48:51 +02:00
parent 6e2b456dbb
commit 838f1c737e
21 changed files with 1411 additions and 0 deletions

280
README.md
View File

@@ -1,2 +1,282 @@
# vLLM-Proxy
A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background.
## Features
- 🚀 **Multiple Model Management**: Run multiple vLLM models simultaneously
- 🔄 **OpenAI Compatible**: Drop-in replacement for OpenAI API v1 endpoints
- 💾 **Persistent Configuration**: Models persist across server restarts
- 🎯 **Automatic Routing**: Requests are automatically routed to the correct model instance
- 📊 **RESTful API**: Full CRUD operations for model management
-**Fast & Async**: Built with FastAPI for high performance
## Quick Start
### Prerequisites
- Python 3.13+
- [uv](https://github.com/astral-sh/uv) package manager
- CUDA-capable GPU (for running vLLM models)
### Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/vLLM-Proxy.git
cd vLLM-Proxy
# Install dependencies
uv sync
```
### Running the Server
```bash
# Start the proxy server
uv run python src/main.py
# Or run on a different port
APP_PORT=8081 uv run python src/main.py
```
The server will start on `http://localhost:8000` by default.
## API Usage
### Model Management
#### Create a Model
```bash
curl -X POST http://localhost:8000/models \
-H "Content-Type: application/json" \
-d '{
"name": "llama-3.2",
"model": "meta-llama/Llama-3.2-1B-Instruct",
"dtype": "float16",
"max_model_len": 4096
}'
```
#### List Models
```bash
# Full details (admin view)
curl http://localhost:8000/models
# OpenAI compatible format
curl http://localhost:8000/v1/models
```
#### Update a Model
```bash
curl -X PUT http://localhost:8000/models/{model_id} \
-H "Content-Type: application/json" \
-d '{
"max_model_len": 8192,
"gpu_memory_utilization": 0.8
}'
```
#### Delete a Model
```bash
curl -X DELETE http://localhost:8000/models/{model_id}
```
### Chat Completions (Placeholder - TODO)
```bash
# Non-streaming chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100
}'
# Streaming chat completion (when implemented)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2",
"messages": [
{"role": "user", "content": "Hello!"}
],
"stream": true
}'
```
### Text Completions (Placeholder - TODO)
```bash
# Non-streaming completion
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2",
"prompt": "Once upon a time",
"max_tokens": 50,
"temperature": 0.7
}'
# Streaming completion (when implemented)
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2",
"prompt": "Once upon a time",
"max_tokens": 50,
"stream": true
}'
```
### Embeddings (Placeholder - TODO)
```bash
curl -X POST http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-model",
"input": "The food was delicious and the waiter was friendly."
}'
```
## Configuration
### Environment Variables
Create a `.env` file in the `src` directory:
```env
# Server configuration
APP_HOST=0.0.0.0
APP_PORT=8000
# Data directory for persistence
DATA_DIR=./data
# Hugging Face token (for gated models)
HF_TOKEN=your_token_here
```
### Model Parameters
When creating a model, you can configure all vLLM parameters:
| Parameter | Description | Default |
|-----------|-------------|---------|
| `model` | HuggingFace model ID, local path, or URL | Required |
| `tensor_parallel_size` | Number of GPUs for tensor parallelism | 1 |
| `pipeline_parallel_size` | Number of GPUs for pipeline parallelism | 1 |
| `max_model_len` | Maximum sequence length | Auto |
| `dtype` | Data type (auto, float16, bfloat16, float32) | auto |
| `quantization` | Quantization method (awq, gptq, etc.) | None |
| `trust_remote_code` | Allow remote code execution | false |
| `gpu_memory_utilization` | GPU memory fraction to use (0-1) | 0.9 |
| `max_num_seqs` | Maximum concurrent sequences | 256 |
## Architecture
```
vLLM-Proxy
├── API Layer (FastAPI)
│ ├── /v1/* endpoints (OpenAI compatible)
│ └── /models/* endpoints (Management)
├── Model Manager
│ ├── Lifecycle management
│ ├── Port allocation
│ └── Persistence layer
└── vLLM Instances (Coming Soon)
├── Model A (port 8001)
├── Model B (port 8002)
└── Model C (port 8003)
```
## API Documentation
Once the server is running, you can access the interactive API documentation at:
- **Swagger UI**: `http://localhost:8000/docs`
- **ReDoc**: `http://localhost:8000/redoc`
## Development
### Project Structure
```
src/
├── main.py # FastAPI application
├── models/ # Data models
│ └── model.py # Model dataclass with vLLM configurations
├── services/ # Business logic
│ ├── model_manager.py # Model lifecycle management
│ └── persistence.py # JSON file persistence
├── endpoints/ # API endpoints
│ ├── models.py # Model CRUD operations
│ └── v1/ # OpenAI v1 compatible endpoints
│ ├── models.py # Models listing
│ ├── chat.py # Chat completions (placeholder)
│ ├── completions.py # Text completions (placeholder)
│ ├── embeddings.py # Embeddings (placeholder)
│ └── misc.py # Other v1 endpoints
└── data/ # Persistent storage
└── models.json # Saved model configurations
```
### Adding Dependencies
```bash
# Add a runtime dependency
uv add package-name
# Add a development dependency
uv add --dev package-name
```
## Roadmap
### ✅ Completed
- [x] Model CRUD operations
- [x] OpenAI v1/models endpoint
- [x] Model persistence
- [x] All OpenAI v1 endpoint placeholders
- [x] Streaming support structure
- [x] Interactive API documentation
### 🚧 High Priority
- [ ] vLLM process management
- [ ] Chat completions implementation
- [ ] Text completions implementation
- [ ] Server-Sent Events streaming
- [ ] Request proxying to vLLM instances
### 🔄 Medium Priority
- [ ] Embeddings endpoint
- [ ] Model health monitoring
- [ ] Load balancing
- [ ] Error recovery
### 📊 Low Priority
- [ ] Authentication/API keys
- [ ] Rate limiting
- [ ] Metrics and monitoring
- [ ] Content moderation
## License
MIT
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.