This commit is contained in:
2025-08-29 21:33:33 +02:00
parent df4eeca9cb
commit 2b8271263d
36 changed files with 1439 additions and 0 deletions

1
.python-version Normal file
View File

@@ -0,0 +1 @@
3.12

77
CLAUDE.md Normal file
View File

@@ -0,0 +1,77 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
CensorBot is a Python application that acts as a data sanitization tool for IT service companies. It uses a small LLM (like DeepSeek) to automatically detect and censor sensitive customer information in text inputs. Users input text containing customer data, and CensorBot returns a sanitized version with all sensitive information replaced by placeholders. This censored text can then be safely used with any external LLM service (Claude, GPT-4, etc.) without risking data breaches. The application uses NiceGUI for the frontend.
## Key Architecture
### Core Components
- **Frontend**: NiceGUI-based web interface (to be implemented in `src/main.py`)
- **LLM Integration**: `src/lib/llm.py` provides async HTTP client for LLM API communication
- Supports both streaming and non-streaming responses
- Uses httpx for async HTTP requests
- Expects OpenAI-compatible chat completions API
### Configuration
- **Environment Variables** (via `.env` file):
- `BACKEND_BASE_URL`: Censoring LLM backend URL (e.g., DeepSeek API)
- `BACKEND_API_TOKEN`: API authentication token for the censoring LLM
- `BACKEND_MODEL`: Model to use for censoring (e.g., "deepseek-chat")
- **System Prompt**: Located in `src/prompt.md` - defines the censoring LLM's behavior for identifying and redacting sensitive data
## Development Commands
### Package Management (using uv)
```bash
# Install dependencies
uv sync
# Add a dependency
uv add <package>
# Run the application
uv run src/main.py
```
### Running the Application
```bash
# Run the NiceGUI application (once implemented)
uv run python src/main.py
```
## Important Implementation Notes
1. **LLM Integration**: The `get_response` function in `src/lib/llm.py` is fully functional and expects:
- Backend configuration with `BACKEND_BASE_URL`, `BACKEND_API_TOKEN` and `BACKEND_MODEL`
- Messages in OpenAI format with roles: "system", "assistant", "user"
- Returns async generators for both streaming and non-streaming modes
- Used exclusively for the censoring functionality
2. **Security Focus**: This application handles sensitive customer data. Always:
- Ensure proper data sanitization before and after LLM processing
- Never log or expose raw customer information
- Keep API tokens secure and never commit them
3. **Frontend Development**: When implementing the NiceGUI interface in `src/main.py`:
- Provide input field for text containing sensitive data
- Display censored output that users can copy
- Use async handlers to integrate with the LLM backend
- Implement proper error handling for API failures
- Consider showing before/after comparison
- Add copy-to-clipboard functionality for the censored text
4. **System Prompt**: The `src/prompt.md` file should contain clear instructions for the censoring LLM on:
- What constitutes customer information (names, addresses, phone numbers, emails, etc.)
- How to censor/redact sensitive data (e.g., replace with placeholders like [CUSTOMER_NAME], [EMAIL], etc.)
- Maintaining context while protecting privacy
- Ensuring the output remains useful for the downstream processing LLM
5. **Usage Flow**:
- User pastes text with customer data into CensorBot
- CensorBot uses small LLM to identify and replace sensitive information
- User receives censored text with placeholders
- User can copy censored text and use it with any external LLM service
- No direct integration with external LLMs - CensorBot is a standalone sanitization tool

View File

@@ -0,0 +1 @@
Hello, my name is Robert Johnson and I'm having issues with my account. My email is robert.j@techcorp.com and you can reach me at 555-123-4567. I live at 123 Main Street, Springfield, IL 62701. My account number is ACC-789456123.

View File

@@ -0,0 +1 @@
Patient Maria Garcia, DOB: 03/15/1985, MRN: MED-445566, visited our clinic on January 10, 2024. Her insurance ID is INS-778899-X. Contact phone: (312) 555-9876. Emergency contact: Juan Garcia at 312-555-6543.

View File

@@ -0,0 +1 @@
Customer Sarah Williams (SSN: 123-45-6789) requested a wire transfer from account 9876543210 to routing number 021000021. Her credit card ending in 4532 was used for verification. She can be contacted at sarah.w@finance.net or 202-555-0147.

View File

@@ -0,0 +1 @@
Username: jsmith2024, IP Address: 192.168.1.105. The user John Smith from DataTech Solutions reported that his password "SecurePass123!" was compromised. His employee ID is EMP-00456 and desk phone is ext. 3421.

View File

@@ -0,0 +1 @@
Order #ORD-2024-78945 for Michael Chen. Shipping to: 456 Oak Avenue, Apt 12B, San Francisco, CA 94102. Billing address same as shipping. Phone: 415-555-3698. Email: m.chen@webmail.com. Paid with Visa ending in 8901.

View File

@@ -0,0 +1 @@
This agreement is between Jennifer Anderson (Driver's License: DL-456789123) residing at 789 Elm Street, Boston, MA 02134, and Thomas Brown (Passport: P12345678) of 321 Pine Road, Cambridge, MA 02139. Contact: jennifer@lawfirm.com, thomas.brown@corporate.org.

View File

@@ -0,0 +1 @@
Passenger: David Lee, Passport Number: A12345678, Date of Birth: 07/22/1990. Flight booking confirmation: ABC123XYZ. Contact email: david.lee@travel.com, Mobile: +1-650-555-2468. Frequent flyer number: FF-998877665.

View File

@@ -0,0 +1 @@
I'm Lisa Martinez interested in the property at 567 Maple Drive. My current address is 890 Cedar Lane, Austin, TX 78701. You can reach me at 512-555-7890 or lisa.martinez@realestate.net. My pre-approval amount is $450,000 from Bank of Example, account ending in 3456.

View File

@@ -0,0 +1 @@
Policyholder: James Wilson, Policy #: POL-123456789, SSN: 987-65-4321. Claim for accident on 12/01/2023 at intersection of 5th Avenue and Main Street. Vehicle VIN: 1HGCM82633A123456. Contact: 303-555-4321, jwilson@insurance.co.

View File

@@ -0,0 +1 @@
Student Emma Thompson, ID: STU-20240156, enrolled in Computer Science program. Emergency contact: Mark Thompson (father) at 617-555-8765. Home address: 234 School Street, Newton, MA 02458. Email: emma.t@university.edu, DOB: 09/30/2002.

View File

@@ -0,0 +1,12 @@
From: alice.jones@company.com
To: bob.smith@client.org, carol.white@vendor.net
CC: manager@company.com
Hi Bob and Carol,
Following up on our call with Alice Jones (555-111-2222) and Bob Smith (555-333-4444). Alice's team at TechCorp (located at 100 Tech Way, Silicon Valley, CA 94000) will handle the implementation. Bob from ClientCo at 200 Business Blvd needs access by Friday.
Carol from VendorInc (Tax ID: 12-3456789) will provide licenses. Invoice to: Accounts Payable, 300 Finance Street, New York, NY 10001, Attn: David Brown.
Best regards,
Alice

View File

@@ -0,0 +1 @@
Our company newsletter features employee of the month Jessica Taylor from the Denver office. While we can't share personal details, Jessica (Employee ID: EMP-7890) has been with us for 5 years. For HR matters, contact her at 303-555-9999 or jessica.taylor@internal.company.com. Her public bio is available on our website, but her home address (456 Private Lane, Denver, CO 80201) and SSN (111-22-3333) remain confidential.

View File

@@ -0,0 +1,6 @@
Ticket #SUP-2024-5678
User: admin@smallbusiness.com
Server IP: 10.0.0.50 (Internal), 203.0.113.45 (Public)
Database connection string: Server=db.internal;User=dbadmin;Password=Str0ng!Pass#2024;Database=CustomerDB
Error log shows user 'john.doe' (ID: USR-445566) attempted login from IP 198.51.100.22 at 14:32:05 UTC.
Customer affected: SmallBiz LLC, Account #: BIZ-789456, Primary contact: owner@smallbusiness.com, Phone: 555-BIZHELP (555-249-4357).

View File

@@ -0,0 +1,3 @@
European customer Hans Müller (Passport: C01234567) from Hauptstraße 123, 10115 Berlin, Germany. Phone: +49 30 12345678, Email: hans.mueller@deutsch.de.
American partner: Jane Doe, 123 Main St, NY 10001, USA, +1-212-555-0199, jane@american.com, SSN: 555-12-9876.
Asian contact: 田中太郎 (Tanaka Taro), 〒100-0001 東京都千代田区, Japan, +81-3-1234-5678, tanaka@japanese.jp, Employee number: JP-00123.

View File

@@ -0,0 +1,5 @@
2024-01-15 10:30:45 ERROR: Authentication failed for user='mary.johnson@techfirm.com' from IP='192.168.100.55'
2024-01-15 10:30:46 INFO: Retry attempt with credentials: username='mary.johnson', password='MyP@ssw0rd123'
2024-01-15 10:30:47 SUCCESS: User authenticated. Session token: abc123xyz789, User ID: USR-998877
2024-01-15 10:30:48 INFO: Accessing customer record: John Adams, Account: 654321, Email: jadams@email.net
2024-01-15 10:30:49 DEBUG: SQL Query: SELECT * FROM users WHERE ssn='987-65-4321' AND dob='1980-05-15'

View File

@@ -0,0 +1,17 @@
REGISTRATION FORM SUBMISSION:
- First Name: Patricia
- Last Name: Robinson
- Date of Birth: 06/25/1988
- Social Security: 222-33-4444
- Email: patricia.r@emailprovider.com
- Phone: (424) 555-6789
- Address Line 1: 789 Sunset Boulevard
- Address Line 2: Suite 456
- City: Los Angeles
- State: CA
- ZIP: 90028
- Emergency Contact Name: Robert Robinson
- Emergency Contact Phone: 424-555-9876
- Insurance Provider: HealthCare Plus
- Policy Number: HCP-123456789
- Group Number: GRP-456

View File

@@ -0,0 +1,6 @@
[09:15] Mike Chen: Hey team, client ABC Corp (Tax ID: 98-7654321) needs the report
[09:16] Sarah Lin: Which contact? Is it still j.wright@abccorp.com?
[09:17] Mike Chen: No, new contact is Tom Baker, 555-0123, t.baker@abccorp.com
[09:18] Sarah Lin: Got it. Sending to their office at 500 Corporate Way, Suite 200
[09:19] Mike Chen: Include billing info: Account #ABC-2024-789, PO# 45678
[09:20] IT Support: FYI their VPN IP range is 203.0.113.0/24, firewall exceptions needed

View File

@@ -0,0 +1 @@
Customer François Dubois (Client ID: FR-12345) contacted us from Paris. Address: 24 Rue de la Paix, 75002 Paris, France. Téléphone: +33 1 23 45 67 89. His Spanish colleague María González (maria.gonzalez@empresa.es, +34 91 234 5678) from Calle Mayor 15, 28013 Madrid will handle the European coordination. Payment via IBAN: FR14 2004 1010 0505 0001 3M02 606.

View File

@@ -0,0 +1,4 @@
Order number: 555-123-4567 (looks like phone but it's an order number)
Real phone: 555-123-4567 (this is actually a phone number)
Reference code: john.smith@2024 (not an email)
Actual email: john.smith@example.com

View File

@@ -0,0 +1,5 @@
Customer J. Smith (only initial provided)
Phone: 555-1234 (missing area code)
Email: contact@ (incomplete email)
Address: Main Street (no number or city)
Card ending in: 1234

View File

@@ -0,0 +1 @@
Smith from Accounting called about the Johnson report. Contact Building 5, Room 302, Extension 4567. Reference case #12345 for details about 123 Main Street property.

View File

@@ -0,0 +1 @@
CEO John Anderson (public figure) discussed with customer John Anderson (private individual, ID: CUST-123, email: janderson@private.net). The CEO can be reached via our public line 1-800-COMPANY, while the customer's direct line is 555-987-6543.

View File

@@ -0,0 +1,3 @@
Phone: 555.123.4567 vs 555-123-4567 vs (555) 123-4567 vs +1 555 123 4567
SSN: 123-45-6789 vs 123456789 vs 123 45 6789
Date: 01/15/1990 vs 15-01-1990 vs January 15, 1990 vs 1990-01-15

View File

@@ -0,0 +1,3 @@
Website: https://user:password123@example.com/path/to/resource
File path: C:\Users\john.doe\Documents\SSN-123-45-6789.pdf
API endpoint: https://api.service.com/v1/users/12345/email/john@example.com

View File

@@ -0,0 +1,4 @@
Product codes: EMAIL-SCANNER-PRO, PHONE-HOLDER-XL
Book titles: "How to Protect Your SSN" by John Smith
Company names: Smith & Associates, 123 Solutions Inc.
Street names: Elizabeth Avenue, Johnson Boulevard

View File

@@ -0,0 +1,4 @@
Base64 email: am9obi5kb2VAZXhhbXBsZS5jb20= (john.doe@example.com)
URL encoded: john%40example.com
Hex phone: 0x5551234567
MD5 of SSN: 5d41402abc4b2a76b9719d911017c592

View File

@@ -0,0 +1,4 @@
Mathematical: The calculation 555-123-4567 equals -4135
Code variable: let customer_ssn = "123-45-6789";
Quote: She said "my email is jane@example.com"
JSON: {"name":"John Doe","phone":"555-0123","ssn":"111-22-3333"}

View File

@@ -0,0 +1,6 @@
123-45-6789
john.doe@example.com
555-123-4567
123 Main Street
Robert Johnson
CC: 4532-1111-2222-3333

View File

@@ -0,0 +1,4 @@
J*** D** called from 555-xxx-4567
Email: j****@example.com
SSN: ***-**-6789
Address: 1** Main Street, Spring*****, IL

View File

@@ -0,0 +1,4 @@
John Smith (Customer) - john.smith@customer.com - 555-111-2222
John Smith (Employee) - john.smith@company.com - 555-333-4444
John Smith (Vendor) - john.smith@vendor.com - 555-555-6666
Meeting with all three John Smiths scheduled for Tuesday.

9
pyproject.toml Normal file
View File

@@ -0,0 +1,9 @@
[project]
name = "censorbot"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"nicegui>=2.23.3",
]

174
src/main.py Normal file
View File

@@ -0,0 +1,174 @@
#!/usr/bin/env python3
"""
CensorBot - Data Sanitization Tool
A NiceGUI-based application for removing sensitive customer information from text
"""
import asyncio
import os
import random
from typing import List
from dotenv import load_dotenv
from nicegui import ui
from lib import get_response, LLMBackend, LLMMessage
load_dotenv()
def get_random_example_text() -> str:
examples_dir = "examples"
# Get all .txt files
txt_files = [f for f in os.listdir(examples_dir) if f.endswith('.txt')]
if not txt_files:
raise FileNotFoundError("No .txt files found in examples directory")
# Pick random file
random_file = random.choice(txt_files)
file_path = os.path.join(examples_dir, random_file)
# Read and return content
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
async def main():
input_text: ui.textarea
output_text: ui.textarea
prompt: str
with open('src/prompt.md') as prompt_file:
prompt = prompt_file.read()
backend: LLMBackend = {'base_url': os.environ['BACKEND_BASE_URL'],
'api_token': os.environ['BACKEND_API_TOKEN'],
'model': os.environ['BACKEND_MODEL']}
async def censor_input():
messages: List[LLMMessage] = [
{'role': 'system', 'content': prompt},
{'role': 'user', 'content': input_text.value}
]
try:
# Stream the response with cancellation support
async for chunk in get_response(backend, messages, True): # type: ignore
# Check if task was cancelled
current_task = asyncio.current_task()
if current_task and current_task.cancelled():
break
if 'content' in chunk:
output_text.value += chunk['content']
print(chunk['content'])
# Small delay to allow UI updates and cancellation checks
await asyncio.sleep(0.01)
except asyncio.CancelledError:
ui.notify('Generation stopped by user', type='info')
# Save whatever content we have so far
return
# Application header
with ui.header(elevated=True).classes('q-pa-md'):
ui.label('🔒 CensorBot').classes('text-h4 text-weight-bold')
ui.label('Secure Data Sanitization for IT Service Companies').classes('text-subtitle1 text-grey-7')
# Main container
with ui.column().classes('w-full max-w-6xl mx-auto q-pa-lg q-gutter-md'):
# Input section
with ui.card().classes('w-full'):
ui.label('Original Text').classes('text-h6 text-weight-medium')
ui.label('Contains sensitive customer information').classes('text-caption text-grey-7')
input_text = ui.textarea(
placeholder='Paste your text here...\n\n'
'Example:\n'
'Customer John Smith called from 555-1234 about issue with account john@example.com',
value=get_random_example_text()
).classes('w-full').style('font-family: monospace').props('autogrow')
# Character count
char_count_label = ui.label('0 characters').classes('text-caption text-grey-6')
# Output section
with ui.card().classes('w-full'):
ui.label('Censored Text').classes('text-h6 text-weight-medium')
ui.label('Safe to use with external LLMs').classes('text-caption text-green-7')
output_text = ui.textarea(
placeholder='Censored text will appear here...\n\n'
'Example:\n'
'Customer [CUSTOMER_NAME] called from [PHONE_NUMBER] about issue with account [EMAIL]',
value=''
).classes('w-full').style('font-family: monospace; background-color: #f5f5f5').props('readonly autogrow')
# Copy button
with ui.row().classes('w-full justify-end q-gutter-sm'):
copy_button = ui.button('Copy to Clipboard', icon='content_copy').props('outline')
copy_button.disable()
# Action buttons
with ui.card().classes('w-full'):
with ui.row().classes('w-full justify-center q-gutter-md'):
clear_button = ui.button('Clear All', icon='clear').props('outline color=negative')
process_button = ui.button('Censor Data', icon='shield', on_click=censor_input).props('color=primary size=lg')
# Statistics section
with ui.expansion('Processing Statistics', icon='analytics').classes('w-full'):
with ui.row().classes('w-full q-gutter-md'):
with ui.column().classes('col'):
ui.label('Items Censored').classes('text-weight-medium')
stats_censored = ui.label('0').classes('text-h4 text-primary')
with ui.column().classes('col'):
ui.label('Processing Time').classes('text-weight-medium')
stats_time = ui.label('0.0s').classes('text-h4 text-primary')
with ui.column().classes('col'):
ui.label('Data Reduction').classes('text-weight-medium')
stats_reduction = ui.label('0%').classes('text-h4 text-primary')
# Event handlers (mockup only - no real functionality)
def update_char_count():
char_count_label.text = f'{len(input_text.value)} characters'
def mock_copy():
ui.notify('Text copied to clipboard (mockup)', type='positive')
def clear_all():
input_text.value = ''
output_text.value = ''
copy_button.disable()
stats_censored.text = '0'
stats_time.text = '0.0s'
stats_reduction.text = '0%'
update_char_count()
# Connect event handlers
input_text.on('input', update_char_count)
copy_button.on_click(mock_copy)
clear_button.on_click(clear_all)
# Footer
with ui.footer().classes('q-pa-md text-center'):
ui.label('CensorBot - Protecting Customer Privacy').classes('text-caption text-grey-6')
ui.label('⚠️ This is a mockup - no actual processing implemented yet').classes('text-caption text-orange')
# Run the application
if __name__ in {"__main__", "__mp_main__"}:
@ui.page('/')
async def _():
await main()
ui.run(
title='CensorBot - Data Sanitization Tool',
favicon='🔒',
show=False,
dark=False,
port=8080
)

43
src/prompt.md Normal file
View File

@@ -0,0 +1,43 @@
# Data Censoring Instructions
You are a data sanitization assistant. Your sole purpose is to identify and replace sensitive customer information with appropriate placeholders while maintaining the context and meaning of the text.
## What to Censor
Replace the following types of sensitive information:
1. **Personal Names**: Replace with `[NAME]` or `[CUSTOMER_NAME]`
2. **Email Addresses**: Replace with `[EMAIL]`
3. **Phone Numbers**: Replace with `[PHONE]`
4. **Physical Addresses**: Replace with `[ADDRESS]`
5. **Social Security Numbers**: Replace with `[SSN]`
6. **Credit Card Numbers**: Replace with `[CREDIT_CARD]`
7. **Bank Account Numbers**: Replace with `[ACCOUNT_NUMBER]`
8. **Driver's License Numbers**: Replace with `[LICENSE]`
9. **Passport Numbers**: Replace with `[PASSPORT]`
10. **Medical Record Numbers**: Replace with `[MRN]`
11. **IP Addresses**: Replace with `[IP_ADDRESS]`
12. **Usernames/User IDs**: Replace with `[USERNAME]`
13. **Passwords**: Replace with `[PASSWORD]`
14. **Company Names** (when context indicates it's customer data): Replace with `[COMPANY]`
15. **Dates of Birth**: Replace with `[DOB]`
## Rules
1. **Preserve Context**: Keep all non-sensitive text exactly as provided
2. **Maintain Structure**: Preserve formatting, punctuation, and spacing
3. **Be Consistent**: Use the same placeholder for the same entity throughout the text
4. **No Commentary**: Output ONLY the censored text, no explanations or additional text
5. **When in Doubt**: If something might be sensitive, censor it
## Example
Input:
"John Smith from Acme Corp called at 555-1234 about his account john.smith@acme.com. His credit card ending in 4567 was declined."
Output:
"[CUSTOMER_NAME] from [COMPANY] called at [PHONE] about his account [EMAIL]. His credit card ending in [CREDIT_CARD] was declined."
## Your Task
Censor the following text by replacing all sensitive information with appropriate placeholders. Output only the censored version:

1031
uv.lock generated Normal file

File diff suppressed because it is too large Load Diff