gmarth/vLLM-Proxy

Fork 0

Go to file

Alexander Thiess 838f1c737e first steps

2025-09-09 06:48:51 +02:00

src

first steps

2025-09-09 06:48:51 +02:00

.gitignore

Initial commit

2025-09-09 05:36:34 +02:00

.python-version

first steps

2025-09-09 06:48:51 +02:00

CLAUDE.md

first steps

2025-09-09 06:48:51 +02:00

LICENSE

Initial commit

2025-09-09 05:36:34 +02:00

pyproject.toml

first steps

2025-09-09 06:48:51 +02:00

README.md

first steps

2025-09-09 06:48:51 +02:00

uv.lock

first steps

2025-09-09 06:48:51 +02:00

README.md

vLLM-Proxy

A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background.

Features

🚀 Multiple Model Management: Run multiple vLLM models simultaneously
🔄 OpenAI Compatible: Drop-in replacement for OpenAI API v1 endpoints
💾 Persistent Configuration: Models persist across server restarts
🎯 Automatic Routing: Requests are automatically routed to the correct model instance
📊 RESTful API: Full CRUD operations for model management
⚡ Fast & Async: Built with FastAPI for high performance

Quick Start

Prerequisites

Python 3.13+
uv package manager
CUDA-capable GPU (for running vLLM models)

Installation

# Clone the repository
git clone https://github.com/yourusername/vLLM-Proxy.git
cd vLLM-Proxy

# Install dependencies
uv sync

Running the Server

# Start the proxy server
uv run python src/main.py

# Or run on a different port
APP_PORT=8081 uv run python src/main.py

The server will start on http://localhost:8000 by default.

API Usage

Model Management

Create a Model

curl -X POST http://localhost:8000/models \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-3.2",
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "dtype": "float16",
    "max_model_len": 4096
  }'

List Models

# Full details (admin view)
curl http://localhost:8000/models

# OpenAI compatible format
curl http://localhost:8000/v1/models

Update a Model

curl -X PUT http://localhost:8000/models/{model_id} \
  -H "Content-Type: application/json" \
  -d '{
    "max_model_len": 8192,
    "gpu_memory_utilization": 0.8
  }'

Delete a Model

curl -X DELETE http://localhost:8000/models/{model_id}

Chat Completions (Placeholder - TODO)

# Non-streaming chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Streaming chat completion (when implemented)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "stream": true
  }'

Text Completions (Placeholder - TODO)

# Non-streaming completion
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "temperature": 0.7
  }'

# Streaming completion (when implemented)
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "stream": true
  }'

Embeddings (Placeholder - TODO)

curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-model",
    "input": "The food was delicious and the waiter was friendly."
  }'

Configuration

Environment Variables

Create a .env file in the src directory:

# Server configuration
APP_HOST=0.0.0.0
APP_PORT=8000

# Data directory for persistence
DATA_DIR=./data

# Hugging Face token (for gated models)
HF_TOKEN=your_token_here

Model Parameters

When creating a model, you can configure all vLLM parameters:

Parameter	Description	Default
`model`	HuggingFace model ID, local path, or URL	Required
`tensor_parallel_size`	Number of GPUs for tensor parallelism	1
`pipeline_parallel_size`	Number of GPUs for pipeline parallelism	1
`max_model_len`	Maximum sequence length	Auto
`dtype`	Data type (auto, float16, bfloat16, float32)	auto
`quantization`	Quantization method (awq, gptq, etc.)	None
`trust_remote_code`	Allow remote code execution	false
`gpu_memory_utilization`	GPU memory fraction to use (0-1)	0.9
`max_num_seqs`	Maximum concurrent sequences	256

Architecture

vLLM-Proxy
    │
    ├── API Layer (FastAPI)
    │   ├── /v1/* endpoints (OpenAI compatible)
    │   └── /models/* endpoints (Management)
    │
    ├── Model Manager
    │   ├── Lifecycle management
    │   ├── Port allocation
    │   └── Persistence layer
    │
    └── vLLM Instances (Coming Soon)
        ├── Model A (port 8001)
        ├── Model B (port 8002)
        └── Model C (port 8003)

API Documentation

Once the server is running, you can access the interactive API documentation at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Development

Project Structure

src/
├── main.py                 # FastAPI application
├── models/                 # Data models
│   └── model.py           # Model dataclass with vLLM configurations
├── services/               # Business logic
│   ├── model_manager.py   # Model lifecycle management
│   └── persistence.py     # JSON file persistence
├── endpoints/              # API endpoints
│   ├── models.py          # Model CRUD operations
│   └── v1/                # OpenAI v1 compatible endpoints
│       ├── models.py      # Models listing
│       ├── chat.py        # Chat completions (placeholder)
│       ├── completions.py # Text completions (placeholder)
│       ├── embeddings.py  # Embeddings (placeholder)
│       └── misc.py        # Other v1 endpoints
└── data/                  # Persistent storage
    └── models.json        # Saved model configurations

Adding Dependencies

# Add a runtime dependency
uv add package-name

# Add a development dependency
uv add --dev package-name

Roadmap

✅ Completed

Model CRUD operations
OpenAI v1/models endpoint
Model persistence
All OpenAI v1 endpoint placeholders
Streaming support structure
Interactive API documentation

🚧 High Priority

vLLM process management
Chat completions implementation
Text completions implementation
Server-Sent Events streaming
Request proxying to vLLM instances

🔄 Medium Priority

Embeddings endpoint
Model health monitoring
Load balancing
Error recovery

📊 Low Priority

Authentication/API keys
Rate limiting
Metrics and monitoring
Content moderation

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.