Model Card for deberta-v3-large-self-disclosure-detection
The model is used to detect self-disclosures (personal information) in a sentence. It is a binary token classification task. For example "I am 22 years old and ..." has labels of "["DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "O", ...]"
The model is able to detect the following 17 categores: "Age", "Age_Gender", "Appearance", "Education", "Family", "Finance", "Gender", "Health", "Husband_BF", "Location", "Mental_Health", "Occupation", "Pet", "Race_Nationality", "Relationship_Status", "Sexual_Orientation", "Wife_GF".
For more details, please read the paper: Reducing Privacy Risks in Online Self-Disclosures with Language Models .
Accessing this model implies automatic agreement to the following guidelines:
- Only use the model for research purposes.
- No redistribution without the author's agreement.
- Any derivative works created using this model must acknowledge the original author.
Model Description
- Model type: A binary token-classification finetuned model that can detect self-disclosures
- Language(s) (NLP): English
- License: Creative Commons Attribution-NonCommercial
- Finetuned from model: microsoft/deberta-v3-large
Example Code
import torch
from torch.utils.data import DataLoader, Dataset
import datasets
from datasets import ClassLabel, load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig, DataCollatorForTokenClassification
model_path = "douy/deberta-v3-large-self-disclosure-detection-binary"
config = AutoConfig.from_pretrained(model_path,)
label2id = config.label2id
id2label = config.id2label
config.num_labels = 2
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True,)
model = AutoModelForTokenClassification.from_pretrained(model_path, config=config, device_map="cuda:0").eval()
def tokenize_and_align_labels(words):
    tokenized_inputs = tokenizer(
                words,
                padding=False,
                is_split_into_words=True,
            )
    # we use ("O") for all the labels
    word_ids = tokenized_inputs.word_ids(0)
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
        # Special tokens have a word id that is None. We set the label to -100 so they are automatically
        # ignored in the loss function.
        if word_idx is None:
            label_ids.append(-100)
        # We set the label for the first token of each word.
        elif word_idx != previous_word_idx:
            label_ids.append(label2id["O"])
        # For the other tokens in a word, we set the label to -100
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    tokenized_inputs["labels"] = label_ids
    return tokenized_inputs
class DisclosureDataset(Dataset):
    def __init__(self, inputs, tokenizer, tokenize_and_align_labels_function):
        self.inputs = inputs
        self.tokenizer = tokenizer
        self.tokenize_and_align_labels_function = tokenize_and_align_labels_function
    def __len__(self):
        return len(self.inputs)
    def __getitem__(self, idx):
        words = self.inputs[idx]
        tokenized_inputs = self.tokenize_and_align_labels_function(words)
        return tokenized_inputs
    
    
sentences = [
    "I am a 23-year-old who is currently going through the last leg of undergraduate school.",
    "We also partnered with news and data providers to add up-to-date information and new visual designs for categories like weather, stocks, sports, news, and maps.",
    "My husband and I live in US.",
    "I was messing with advanced voice the other day and I was like, 'Oh, I can do this.'",
]
inputs = [sentence.split() for sentence in sentences]
data_collator = DataCollatorForTokenClassification(tokenizer)
dataset = DisclosureDataset(inputs, tokenizer, tokenize_and_align_labels)
dataloader = DataLoader(dataset, collate_fn=data_collator, batch_size=2)
total_predictions = []
for step, batch in enumerate(dataloader):
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.inference_mode():
        outputs = model(**batch)
    predictions = outputs.logits.argmax(-1)
    labels = batch["labels"]
    predictions = predictions.cpu().tolist()
    labels = labels.cpu().tolist()
    true_predictions = []
    for i, label in enumerate(labels):
        true_pred = []
        for j, m in enumerate(label):
            if m != -100:
                true_pred.append(id2label[predictions[i][j]])
        true_predictions.append(true_pred)
    total_predictions.extend(true_predictions)
    
for word, pred in zip(inputs, total_predictions):
    for w, p in zip(word, pred):
        print(w, p)
Citation
@article{dou2023reducing,
  title={Reducing Privacy Risks in Online Self-Disclosures with Language Models},
  author={Dou, Yao and Krsek, Isadora and Naous, Tarek and Kabra, Anubha and Das, Sauvik and Ritter, Alan and Xu, Wei},
  journal={arXiv preprint arXiv:2311.09538},
  year={2023}
}
- Downloads last month
- -
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	๐
			
		Ask for provider support
Model tree for douy/deberta-v3-large-self-disclosure-detection-binary
Base model
microsoft/deberta-v3-large