CensorBot/README.md

# 🔒 CensorBot

A secure data sanitization tool for IT service companies that automatically detects and censors sensitive customer information using AI.

## Overview

CensorBot is a Python application that helps protect customer privacy by automatically identifying and replacing sensitive information with placeholders. It uses a small, efficient LLM (like DeepSeek) to process text locally, ensuring that sensitive data never leaves your control before being sent to external AI services.

## Features

- 🛡️ **Automatic Detection** - Identifies names, emails, phone numbers, addresses, SSNs, and more
- 🔄 **Real-time Processing** - Stream-based censoring for immediate feedback
- 🎯 **High Accuracy** - AI-powered detection understands context, not just patterns
- 💼 **Enterprise Ready** - Designed for IT service companies handling customer data
- 🌐 **Web Interface** - Clean, intuitive UI built with NiceGUI
- 📝 **30+ Test Examples** - Comprehensive test suite covering various scenarios

## Quick Start

### Prerequisites

- Python 3.8+
- [uv](https://github.com/astral-sh/uv) package manager
- An OpenAI-compatible API endpoint (e.g., DeepSeek, local LLM)

### Installation

1. Clone the repository:
```bash
git clone https://github.com/yourusername/CensorBot.git
cd CensorBot
```

2. Install dependencies:
```bash
uv sync
```

3. Configure environment variables:
```bash
cp .env.example .env
# Edit .env with your API credentials
```

4. Run the application:
```bash
uv run python src/main.py
```

5. Open your browser to `http://localhost:8080`

## Configuration

Create a `.env` file with the following variables:

```env
# LLM Backend Configuration
BACKEND_BASE_URL=https://api.deepseek.com  # Your LLM API endpoint
BACKEND_API_TOKEN=your-api-token-here       # API authentication token
BACKEND_MODEL=deepseek-chat                 # Model to use for censoring
```

## Usage

1. **Paste Text**: Copy your text containing sensitive customer information into the input field
2. **Process**: Click "Censor Data" to automatically detect and replace sensitive information
3. **Copy Result**: Use the censored text safely with any external AI service

### What Gets Censored

- Personal names
- Email addresses
- Phone numbers
- Physical addresses
- Social Security Numbers
- Credit card numbers
- Bank account numbers
- Driver's license numbers
- Passport numbers
- Medical record numbers
- IP addresses
- Usernames and passwords
- Company names (in customer context)
- Dates of birth

## Project Structure

```
CensorBot/
├── src/
│   ├── main.py          # Main application with NiceGUI interface
│   ├── prompt.md        # System prompt for the censoring LLM
│   └── lib/
│       └── llm.py       # LLM integration module
├── examples/            # 30+ test cases with various sensitive data
│   ├── 01_customer_support.txt
│   ├── 02_medical_record.txt
│   └── ...
├── .env.example         # Environment variables template
├── pyproject.toml       # Project dependencies
└── CLAUDE.md           # AI assistant instructions
```

## Development

### Running Tests

Test the censoring with example files:
```bash
# The application loads a random example on startup
uv run python src/main.py
```

### Adding Dependencies

```bash
uv add <package-name>
```

### Project Commands

```bash
# Install dependencies
uv sync

# Run the application
uv run python src/main.py

# Format code (if configured)
uv run black src/

# Type checking (if configured)
uv run mypy src/
```

## Security Considerations

- **Local Processing**: Use a local or self-hosted LLM for maximum security
- **No Data Storage**: CensorBot doesn't store any processed text
- **API Security**: Keep your API tokens secure and never commit them
- **HTTPS Only**: Use HTTPS for API communications
- **Regular Updates**: Keep dependencies updated for security patches

## Use Cases

- **IT Support Tickets**: Sanitize customer tickets before using AI for solutions
- **Documentation**: Remove sensitive data from technical documentation
- **Training Data**: Prepare datasets for ML training without privacy concerns
- **Compliance**: Meet GDPR, HIPAA, and other privacy regulations
- **Knowledge Base**: Create sanitized versions of customer interactions

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- Built with [NiceGUI](https://nicegui.io/) for the web interface
- Powered by [uv](https://github.com/astral-sh/uv) for fast Python package management
- AI censoring via OpenAI-compatible APIs

## Support

For issues, questions, or suggestions, please open an issue on GitHub.

---

**⚠️ Important**: This tool is designed to help protect privacy but should not be the only measure. Always review censored output and follow your organization's data protection policies.