vLLM-Proxy/README.md

# vLLM-Proxy

A REST API proxy that manages multiple vLLM instances, solving the limitation of vLLM only being able to load one model at a time per process. This daemon provides OpenAI-compatible endpoints while managing multiple model instances in the background.

## Features

- 🚀 **Multiple Model Management**: Run multiple vLLM models simultaneously
- 🔄 **OpenAI Compatible**: Drop-in replacement for OpenAI API v1 endpoints
- 💾 **Persistent Configuration**: Models persist across server restarts
- 🎯 **Automatic Routing**: Requests are automatically routed to the correct model instance
- 📊 **RESTful API**: Full CRUD operations for model management
- ⚡ **Fast & Async**: Built with FastAPI for high performance

## Quick Start

### Prerequisites

- Python 3.13+
- [uv](https://github.com/astral-sh/uv) package manager
- CUDA-capable GPU (for running vLLM models)

### Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/vLLM-Proxy.git
cd vLLM-Proxy

# Install dependencies
uv sync
```

### Running the Server

```bash
# Start the proxy server
uv run python src/main.py

# Or run on a different port
APP_PORT=8081 uv run python src/main.py
```

The server will start on `http://localhost:8000` by default.

## API Usage

### Model Management

#### Create a Model

```bash
curl -X POST http://localhost:8000/models \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-3.2",
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "dtype": "float16",
    "max_model_len": 4096
  }'
```

#### List Models

```bash
# Full details (admin view)
curl http://localhost:8000/models

# OpenAI compatible format
curl http://localhost:8000/v1/models
```

#### Update a Model

```bash
curl -X PUT http://localhost:8000/models/{model_id} \
  -H "Content-Type: application/json" \
  -d '{
    "max_model_len": 8192,
    "gpu_memory_utilization": 0.8
  }'
```

#### Delete a Model

```bash
curl -X DELETE http://localhost:8000/models/{model_id}
```

### Chat Completions (Placeholder - TODO)

```bash
# Non-streaming chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Streaming chat completion (when implemented)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "stream": true
  }'
```

### Text Completions (Placeholder - TODO)

```bash
# Non-streaming completion
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "temperature": 0.7
  }'

# Streaming completion (when implemented)
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2",
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "stream": true
  }'
```

### Embeddings (Placeholder - TODO)

```bash
curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-model",
    "input": "The food was delicious and the waiter was friendly."
  }'
```

## Configuration

### Environment Variables

Create a `.env` file in the `src` directory:

```env
# Server configuration
APP_HOST=0.0.0.0
APP_PORT=8000

# Data directory for persistence
DATA_DIR=./data

# Hugging Face token (for gated models)
HF_TOKEN=your_token_here
```

### Model Parameters

When creating a model, you can configure all vLLM parameters:

| Parameter | Description | Default |
|-----------|-------------|---------|
| `model` | HuggingFace model ID, local path, or URL | Required |
| `tensor_parallel_size` | Number of GPUs for tensor parallelism | 1 |
| `pipeline_parallel_size` | Number of GPUs for pipeline parallelism | 1 |
| `max_model_len` | Maximum sequence length | Auto |
| `dtype` | Data type (auto, float16, bfloat16, float32) | auto |
| `quantization` | Quantization method (awq, gptq, etc.) | None |
| `trust_remote_code` | Allow remote code execution | false |
| `gpu_memory_utilization` | GPU memory fraction to use (0-1) | 0.9 |
| `max_num_seqs` | Maximum concurrent sequences | 256 |

## Architecture

```
vLLM-Proxy
    │
    ├── API Layer (FastAPI)
    │   ├── /v1/* endpoints (OpenAI compatible)
    │   └── /models/* endpoints (Management)
    │
    ├── Model Manager
    │   ├── Lifecycle management
    │   ├── Port allocation
    │   └── Persistence layer
    │
    └── vLLM Instances (Coming Soon)
        ├── Model A (port 8001)
        ├── Model B (port 8002)
        └── Model C (port 8003)
```

## API Documentation

Once the server is running, you can access the interactive API documentation at:

- **Swagger UI**: `http://localhost:8000/docs`
- **ReDoc**: `http://localhost:8000/redoc`

## Development

### Project Structure

```
src/
├── main.py                 # FastAPI application
├── models/                 # Data models
│   └── model.py           # Model dataclass with vLLM configurations
├── services/               # Business logic
│   ├── model_manager.py   # Model lifecycle management
│   └── persistence.py     # JSON file persistence
├── endpoints/              # API endpoints
│   ├── models.py          # Model CRUD operations
│   └── v1/                # OpenAI v1 compatible endpoints
│       ├── models.py      # Models listing
│       ├── chat.py        # Chat completions (placeholder)
│       ├── completions.py # Text completions (placeholder)
│       ├── embeddings.py  # Embeddings (placeholder)
│       └── misc.py        # Other v1 endpoints
└── data/                  # Persistent storage
    └── models.json        # Saved model configurations
```

### Adding Dependencies

```bash
# Add a runtime dependency
uv add package-name

# Add a development dependency
uv add --dev package-name
```

## Roadmap

### ✅ Completed
- [x] Model CRUD operations
- [x] OpenAI v1/models endpoint
- [x] Model persistence
- [x] All OpenAI v1 endpoint placeholders
- [x] Streaming support structure
- [x] Interactive API documentation

### 🚧 High Priority
- [ ] vLLM process management
- [ ] Chat completions implementation
- [ ] Text completions implementation
- [ ] Server-Sent Events streaming
- [ ] Request proxying to vLLM instances

### 🔄 Medium Priority
- [ ] Embeddings endpoint
- [ ] Model health monitoring
- [ ] Load balancing
- [ ] Error recovery

### 📊 Low Priority
- [ ] Authentication/API keys
- [ ] Rate limiting
- [ ] Metrics and monitoring
- [ ] Content moderation

## License

MIT

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.