2025-09-10 05:10:51 +02:00
2025-09-10 05:10:51 +02:00
2025-09-09 05:36:34 +02:00
2025-09-09 06:48:51 +02:00
2025-09-09 06:48:51 +02:00
2025-09-09 05:36:34 +02:00
2025-09-09 06:48:51 +02:00
2025-09-09 06:48:51 +02:00
2025-09-09 06:48:51 +02:00

vLLM-Proxy

A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background.

Features

  • 🚀 Multiple Model Management: Run multiple vLLM models simultaneously
  • 🔄 OpenAI Compatible: Drop-in replacement for OpenAI API v1 endpoints
  • 💾 Persistent Configuration: Models persist across server restarts
  • 🎯 Automatic Routing: Requests are automatically routed to the correct model instance
  • 📊 RESTful API: Full CRUD operations for model management
  • Fast & Async: Built with FastAPI for high performance

Quick Start

Prerequisites

  • Python 3.13+
  • uv package manager
  • CUDA-capable GPU (for running vLLM models)

Installation

# Clone the repository
git clone https://github.com/yourusername/vLLM-Proxy.git
cd vLLM-Proxy

# Install dependencies
uv sync

Running the Server

# Start the proxy server
uv run python src/main.py

# Or run on a different port
APP_PORT=8081 uv run python src/main.py

The server will start on http://localhost:8000 by default.

API Usage

Model Management

Create a Model

curl -X POST http://localhost:8000/models \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-3.2",
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "dtype": "float16",
    "max_model_len": 4096
  }'

List Models

# Full details (admin view)
curl http://localhost:8000/models

# OpenAI compatible format
curl http://localhost:8000/v1/models

Update a Model

curl -X PUT http://localhost:8000/models/{model_id} \
  -H "Content-Type: application/json" \
  -d '{
    "max_model_len": 8192,
    "gpu_memory_utilization": 0.8
  }'

Delete a Model

curl -X DELETE http://localhost:8000/models/{model_id}

Chat Completions (Placeholder - TODO)

# Non-streaming chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Streaming chat completion (when implemented)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "stream": true
  }'

Text Completions (Placeholder - TODO)

# Non-streaming completion
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "temperature": 0.7
  }'

# Streaming completion (when implemented)
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "stream": true
  }'

Embeddings (Placeholder - TODO)

curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-model",
    "input": "The food was delicious and the waiter was friendly."
  }'

Configuration

Environment Variables

Create a .env file in the src directory:

# Server configuration
APP_HOST=0.0.0.0
APP_PORT=8000

# Data directory for persistence
DATA_DIR=./data

# Hugging Face token (for gated models)
HF_TOKEN=your_token_here

Model Parameters

When creating a model, you can configure all vLLM parameters:

Parameter Description Default
model HuggingFace model ID, local path, or URL Required
tensor_parallel_size Number of GPUs for tensor parallelism 1
pipeline_parallel_size Number of GPUs for pipeline parallelism 1
max_model_len Maximum sequence length Auto
dtype Data type (auto, float16, bfloat16, float32) auto
quantization Quantization method (awq, gptq, etc.) None
trust_remote_code Allow remote code execution false
gpu_memory_utilization GPU memory fraction to use (0-1) 0.9
max_num_seqs Maximum concurrent sequences 256

Architecture

vLLM-Proxy
    │
    ├── API Layer (FastAPI)
    │   ├── /v1/* endpoints (OpenAI compatible)
    │   └── /models/* endpoints (Management)
    │
    ├── Model Manager
    │   ├── Lifecycle management
    │   ├── Port allocation
    │   └── Persistence layer
    │
    └── vLLM Instances (Coming Soon)
        ├── Model A (port 8001)
        ├── Model B (port 8002)
        └── Model C (port 8003)

API Documentation

Once the server is running, you can access the interactive API documentation at:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Development

Project Structure

src/
├── main.py                 # FastAPI application
├── models/                 # Data models
│   └── model.py           # Model dataclass with vLLM configurations
├── services/               # Business logic
│   ├── model_manager.py   # Model lifecycle management
│   └── persistence.py     # JSON file persistence
├── endpoints/              # API endpoints
│   ├── models.py          # Model CRUD operations
│   └── v1/                # OpenAI v1 compatible endpoints
│       ├── models.py      # Models listing
│       ├── chat.py        # Chat completions (placeholder)
│       ├── completions.py # Text completions (placeholder)
│       ├── embeddings.py  # Embeddings (placeholder)
│       └── misc.py        # Other v1 endpoints
└── data/                  # Persistent storage
    └── models.json        # Saved model configurations

Adding Dependencies

# Add a runtime dependency
uv add package-name

# Add a development dependency
uv add --dev package-name

Roadmap

Completed

  • Model CRUD operations
  • OpenAI v1/models endpoint
  • Model persistence
  • All OpenAI v1 endpoint placeholders
  • Streaming support structure
  • Interactive API documentation

🚧 High Priority

  • vLLM process management
  • Chat completions implementation
  • Text completions implementation
  • Server-Sent Events streaming
  • Request proxying to vLLM instances

🔄 Medium Priority

  • Embeddings endpoint
  • Model health monitoring
  • Load balancing
  • Error recovery

📊 Low Priority

  • Authentication/API keys
  • Rate limiting
  • Metrics and monitoring
  • Content moderation

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Description
No description provided
Readme MIT 49 KiB
Languages
Python 100%