--- title: PII Detection with BERT emoji: 🔍 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0 --- # PII Detection with BERT This Space demonstrates a BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text. ## Model Details - **Base Model**: [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) - **Training Dataset**: [ai4privacy/pii-masking-43k](https://huggingface.co/datasets/ai4privacy/pii-masking-43k) - **Task**: Token Classification / Named Entity Recognition (NER) - **Number of Entity Types**: 27 ## Detectable PII Types The model can identify 27 different types of personal information: ### Identity Information - NAME, USERNAME, DISPLAYNAME, GENDER, JOB ### Contact Information - EMAIL, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE ### Financial Information - CREDITCARDNUM, CREDITCARDISSUER, IBAN, BIC - ACCOUNTNAME, ACCOUNTNUM, CURRENCY, COINADDRESS ### Technical Information - IP, MAC, URL, USERAGENT, PASSWORD ### Other - NUM, ORDINALDIRECTION ## How It Works 1. **Input**: User provides text that may contain personal information 2. **Tokenization**: Text is split into tokens using BERT tokenizer 3. **Classification**: Each token is classified into one of 27 entity types or "O" (no entity) 4. **Visualization**: Detected entities are highlighted with different colors ## Training Details - Learning Rate: 5e-05 - Batch Size: 16 (train), 64 (eval) - Epochs: 3 - Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08) - Warmup Steps: 500 ## Use Cases - **Data Privacy**: Identify PII before sharing documents - **Data Anonymization**: Find information that needs masking - **Compliance**: Help meet GDPR, CCPA requirements - **Security**: Detect sensitive information leaks ## Limitations - Maximum input length: 512 tokens - Optimized for English text - May not detect all variations of PII - Performance depends on text format and quality ## Example Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification model_name = "your-username/your-space-name" # Update after deployment tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) text = "My name is John Smith and my email is john@example.com" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) ``` ## License Apache 2.0