first steps
This commit is contained in:
280
README.md
280
README.md
@@ -1,2 +1,282 @@
|
||||
# vLLM-Proxy
|
||||
|
||||
A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background.
|
||||
|
||||
## Features
|
||||
|
||||
- 🚀 **Multiple Model Management**: Run multiple vLLM models simultaneously
|
||||
- 🔄 **OpenAI Compatible**: Drop-in replacement for OpenAI API v1 endpoints
|
||||
- 💾 **Persistent Configuration**: Models persist across server restarts
|
||||
- 🎯 **Automatic Routing**: Requests are automatically routed to the correct model instance
|
||||
- 📊 **RESTful API**: Full CRUD operations for model management
|
||||
- ⚡ **Fast & Async**: Built with FastAPI for high performance
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.13+
|
||||
- [uv](https://github.com/astral-sh/uv) package manager
|
||||
- CUDA-capable GPU (for running vLLM models)
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://github.com/yourusername/vLLM-Proxy.git
|
||||
cd vLLM-Proxy
|
||||
|
||||
# Install dependencies
|
||||
uv sync
|
||||
```
|
||||
|
||||
### Running the Server
|
||||
|
||||
```bash
|
||||
# Start the proxy server
|
||||
uv run python src/main.py
|
||||
|
||||
# Or run on a different port
|
||||
APP_PORT=8081 uv run python src/main.py
|
||||
```
|
||||
|
||||
The server will start on `http://localhost:8000` by default.
|
||||
|
||||
## API Usage
|
||||
|
||||
### Model Management
|
||||
|
||||
#### Create a Model
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/models \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"name": "llama-3.2",
|
||||
"model": "meta-llama/Llama-3.2-1B-Instruct",
|
||||
"dtype": "float16",
|
||||
"max_model_len": 4096
|
||||
}'
|
||||
```
|
||||
|
||||
#### List Models
|
||||
|
||||
```bash
|
||||
# Full details (admin view)
|
||||
curl http://localhost:8000/models
|
||||
|
||||
# OpenAI compatible format
|
||||
curl http://localhost:8000/v1/models
|
||||
```
|
||||
|
||||
#### Update a Model
|
||||
|
||||
```bash
|
||||
curl -X PUT http://localhost:8000/models/{model_id} \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"max_model_len": 8192,
|
||||
"gpu_memory_utilization": 0.8
|
||||
}'
|
||||
```
|
||||
|
||||
#### Delete a Model
|
||||
|
||||
```bash
|
||||
curl -X DELETE http://localhost:8000/models/{model_id}
|
||||
```
|
||||
|
||||
### Chat Completions (Placeholder - TODO)
|
||||
|
||||
```bash
|
||||
# Non-streaming chat completion
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-3.2",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello!"}
|
||||
],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 100
|
||||
}'
|
||||
|
||||
# Streaming chat completion (when implemented)
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-3.2",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello!"}
|
||||
],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
### Text Completions (Placeholder - TODO)
|
||||
|
||||
```bash
|
||||
# Non-streaming completion
|
||||
curl -X POST http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-3.2",
|
||||
"prompt": "Once upon a time",
|
||||
"max_tokens": 50,
|
||||
"temperature": 0.7
|
||||
}'
|
||||
|
||||
# Streaming completion (when implemented)
|
||||
curl -X POST http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-3.2",
|
||||
"prompt": "Once upon a time",
|
||||
"max_tokens": 50,
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
### Embeddings (Placeholder - TODO)
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "text-embedding-model",
|
||||
"input": "The food was delicious and the waiter was friendly."
|
||||
}'
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Create a `.env` file in the `src` directory:
|
||||
|
||||
```env
|
||||
# Server configuration
|
||||
APP_HOST=0.0.0.0
|
||||
APP_PORT=8000
|
||||
|
||||
# Data directory for persistence
|
||||
DATA_DIR=./data
|
||||
|
||||
# Hugging Face token (for gated models)
|
||||
HF_TOKEN=your_token_here
|
||||
```
|
||||
|
||||
### Model Parameters
|
||||
|
||||
When creating a model, you can configure all vLLM parameters:
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|-----------|-------------|---------|
|
||||
| `model` | HuggingFace model ID, local path, or URL | Required |
|
||||
| `tensor_parallel_size` | Number of GPUs for tensor parallelism | 1 |
|
||||
| `pipeline_parallel_size` | Number of GPUs for pipeline parallelism | 1 |
|
||||
| `max_model_len` | Maximum sequence length | Auto |
|
||||
| `dtype` | Data type (auto, float16, bfloat16, float32) | auto |
|
||||
| `quantization` | Quantization method (awq, gptq, etc.) | None |
|
||||
| `trust_remote_code` | Allow remote code execution | false |
|
||||
| `gpu_memory_utilization` | GPU memory fraction to use (0-1) | 0.9 |
|
||||
| `max_num_seqs` | Maximum concurrent sequences | 256 |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
vLLM-Proxy
|
||||
│
|
||||
├── API Layer (FastAPI)
|
||||
│ ├── /v1/* endpoints (OpenAI compatible)
|
||||
│ └── /models/* endpoints (Management)
|
||||
│
|
||||
├── Model Manager
|
||||
│ ├── Lifecycle management
|
||||
│ ├── Port allocation
|
||||
│ └── Persistence layer
|
||||
│
|
||||
└── vLLM Instances (Coming Soon)
|
||||
├── Model A (port 8001)
|
||||
├── Model B (port 8002)
|
||||
└── Model C (port 8003)
|
||||
```
|
||||
|
||||
## API Documentation
|
||||
|
||||
Once the server is running, you can access the interactive API documentation at:
|
||||
|
||||
- **Swagger UI**: `http://localhost:8000/docs`
|
||||
- **ReDoc**: `http://localhost:8000/redoc`
|
||||
|
||||
## Development
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
src/
|
||||
├── main.py # FastAPI application
|
||||
├── models/ # Data models
|
||||
│ └── model.py # Model dataclass with vLLM configurations
|
||||
├── services/ # Business logic
|
||||
│ ├── model_manager.py # Model lifecycle management
|
||||
│ └── persistence.py # JSON file persistence
|
||||
├── endpoints/ # API endpoints
|
||||
│ ├── models.py # Model CRUD operations
|
||||
│ └── v1/ # OpenAI v1 compatible endpoints
|
||||
│ ├── models.py # Models listing
|
||||
│ ├── chat.py # Chat completions (placeholder)
|
||||
│ ├── completions.py # Text completions (placeholder)
|
||||
│ ├── embeddings.py # Embeddings (placeholder)
|
||||
│ └── misc.py # Other v1 endpoints
|
||||
└── data/ # Persistent storage
|
||||
└── models.json # Saved model configurations
|
||||
```
|
||||
|
||||
### Adding Dependencies
|
||||
|
||||
```bash
|
||||
# Add a runtime dependency
|
||||
uv add package-name
|
||||
|
||||
# Add a development dependency
|
||||
uv add --dev package-name
|
||||
```
|
||||
|
||||
## Roadmap
|
||||
|
||||
### ✅ Completed
|
||||
- [x] Model CRUD operations
|
||||
- [x] OpenAI v1/models endpoint
|
||||
- [x] Model persistence
|
||||
- [x] All OpenAI v1 endpoint placeholders
|
||||
- [x] Streaming support structure
|
||||
- [x] Interactive API documentation
|
||||
|
||||
### 🚧 High Priority
|
||||
- [ ] vLLM process management
|
||||
- [ ] Chat completions implementation
|
||||
- [ ] Text completions implementation
|
||||
- [ ] Server-Sent Events streaming
|
||||
- [ ] Request proxying to vLLM instances
|
||||
|
||||
### 🔄 Medium Priority
|
||||
- [ ] Embeddings endpoint
|
||||
- [ ] Model health monitoring
|
||||
- [ ] Load balancing
|
||||
- [ ] Error recovery
|
||||
|
||||
### 📊 Low Priority
|
||||
- [ ] Authentication/API keys
|
||||
- [ ] Rate limiting
|
||||
- [ ] Metrics and monitoring
|
||||
- [ ] Content moderation
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! Please feel free to submit a Pull Request.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user