first steps

2025-09-09 06:48:51 +02:00
parent 6e2b456dbb
commit 838f1c737e
21 changed files with 1411 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,282 @@
 # vLLM-Proxy

+A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background.
+
+## Features
+
+- 🚀 **Multiple Model Management**: Run multiple vLLM models simultaneously
+- 🔄 **OpenAI Compatible**: Drop-in replacement for OpenAI API v1 endpoints
+- 💾 **Persistent Configuration**: Models persist across server restarts
+- 🎯 **Automatic Routing**: Requests are automatically routed to the correct model instance
+- 📊 **RESTful API**: Full CRUD operations for model management
+- ⚡ **Fast & Async**: Built with FastAPI for high performance
+
+## Quick Start
+
+### Prerequisites
+
+- Python 3.13+
+- [uv](https://github.com/astral-sh/uv) package manager
+- CUDA-capable GPU (for running vLLM models)
+
+### Installation
+
+```bash
+# Clone the repository
+git clone https://github.com/yourusername/vLLM-Proxy.git
+cd vLLM-Proxy
+
+# Install dependencies
+uv sync
+```
+
+### Running the Server
+
+```bash
+# Start the proxy server
+uv run python src/main.py
+
+# Or run on a different port
+APP_PORT=8081 uv run python src/main.py
+```
+
+The server will start on `http://localhost:8000` by default.
+
+## API Usage
+
+### Model Management
+
+#### Create a Model
+
+```bash
+curl -X POST http://localhost:8000/models \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "llama-3.2",
+    "model": "meta-llama/Llama-3.2-1B-Instruct",
+    "dtype": "float16",
+    "max_model_len": 4096
+  }'
+```
+
+#### List Models
+
+```bash
+# Full details (admin view)
+curl http://localhost:8000/models
+
+# OpenAI compatible format
+curl http://localhost:8000/v1/models
+```
+
+#### Update a Model
+
+```bash
+curl -X PUT http://localhost:8000/models/{model_id} \
+  -H "Content-Type: application/json" \
+  -d '{
+    "max_model_len": 8192,
+    "gpu_memory_utilization": 0.8
+  }'
+```
+
+#### Delete a Model
+
+```bash
+curl -X DELETE http://localhost:8000/models/{model_id}
+```
+
+### Chat Completions (Placeholder - TODO)
+
+```bash
+# Non-streaming chat completion
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-3.2",
+    "messages": [
+      {"role": "user", "content": "Hello!"}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 100
+  }'
+
+# Streaming chat completion (when implemented)
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-3.2",
+    "messages": [
+      {"role": "user", "content": "Hello!"}
+    ],
+    "stream": true
+  }'
+```
+
+### Text Completions (Placeholder - TODO)
+
+```bash
+# Non-streaming completion
+curl -X POST http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-3.2",
+    "prompt": "Once upon a time",
+    "max_tokens": 50,
+    "temperature": 0.7
+  }'
+
+# Streaming completion (when implemented)
+curl -X POST http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-3.2",
+    "prompt": "Once upon a time",
+    "max_tokens": 50,
+    "stream": true
+  }'
+```
+
+### Embeddings (Placeholder - TODO)
+
+```bash
+curl -X POST http://localhost:8000/v1/embeddings \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "text-embedding-model",
+    "input": "The food was delicious and the waiter was friendly."
+  }'
+```
+
+## Configuration
+
+### Environment Variables
+
+Create a `.env` file in the `src` directory:
+
+```env
+# Server configuration
+APP_HOST=0.0.0.0
+APP_PORT=8000
+
+# Data directory for persistence
+DATA_DIR=./data
+
+# Hugging Face token (for gated models)
+HF_TOKEN=your_token_here
+```
+
+### Model Parameters
+
+When creating a model, you can configure all vLLM parameters:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `model` | HuggingFace model ID, local path, or URL | Required |
+| `tensor_parallel_size` | Number of GPUs for tensor parallelism | 1 |
+| `pipeline_parallel_size` | Number of GPUs for pipeline parallelism | 1 |
+| `max_model_len` | Maximum sequence length | Auto |
+| `dtype` | Data type (auto, float16, bfloat16, float32) | auto |
+| `quantization` | Quantization method (awq, gptq, etc.) | None |
+| `trust_remote_code` | Allow remote code execution | false |
+| `gpu_memory_utilization` | GPU memory fraction to use (0-1) | 0.9 |
+| `max_num_seqs` | Maximum concurrent sequences | 256 |
+
+## Architecture
+
+```
+vLLM-Proxy
+    │
+    ├── API Layer (FastAPI)
+    │   ├── /v1/* endpoints (OpenAI compatible)
+    │   └── /models/* endpoints (Management)
+    │
+    ├── Model Manager
+    │   ├── Lifecycle management
+    │   ├── Port allocation
+    │   └── Persistence layer
+    │
+    └── vLLM Instances (Coming Soon)
+        ├── Model A (port 8001)
+        ├── Model B (port 8002)
+        └── Model C (port 8003)
+```
+
+## API Documentation
+
+Once the server is running, you can access the interactive API documentation at:
+
+- **Swagger UI**: `http://localhost:8000/docs`
+- **ReDoc**: `http://localhost:8000/redoc`
+
+## Development
+
+### Project Structure
+
+```
+src/
+├── main.py                 # FastAPI application
+├── models/                 # Data models
+│   └── model.py           # Model dataclass with vLLM configurations
+├── services/               # Business logic
+│   ├── model_manager.py   # Model lifecycle management
+│   └── persistence.py     # JSON file persistence
+├── endpoints/              # API endpoints
+│   ├── models.py          # Model CRUD operations
+│   └── v1/                # OpenAI v1 compatible endpoints
+│       ├── models.py      # Models listing
+│       ├── chat.py        # Chat completions (placeholder)
+│       ├── completions.py # Text completions (placeholder)
+│       ├── embeddings.py  # Embeddings (placeholder)
+│       └── misc.py        # Other v1 endpoints
+└── data/                  # Persistent storage
+    └── models.json        # Saved model configurations
+```
+
+### Adding Dependencies
+
+```bash
+# Add a runtime dependency
+uv add package-name
+
+# Add a development dependency
+uv add --dev package-name
+```
+
+## Roadmap
+
+### ✅ Completed
+- [x] Model CRUD operations
+- [x] OpenAI v1/models endpoint  
+- [x] Model persistence
+- [x] All OpenAI v1 endpoint placeholders
+- [x] Streaming support structure
+- [x] Interactive API documentation
+
+### 🚧 High Priority
+- [ ] vLLM process management
+- [ ] Chat completions implementation
+- [ ] Text completions implementation
+- [ ] Server-Sent Events streaming
+- [ ] Request proxying to vLLM instances
+
+### 🔄 Medium Priority  
+- [ ] Embeddings endpoint
+- [ ] Model health monitoring
+- [ ] Load balancing
+- [ ] Error recovery
+
+### 📊 Low Priority
+- [ ] Authentication/API keys
+- [ ] Rate limiting
+- [ ] Metrics and monitoring
+- [ ] Content moderation
+
+## License
+
+MIT
+
+## Contributing
+
+Contributions are welcome! Please feel free to submit a Pull Request.
+