maartenvs commited on
Commit
eb01413
·
verified ·
1 Parent(s): 1c87743

Upload 5 files

Browse files
Files changed (5) hide show
  1. BIAS.md +5 -0
  2. EXPLAINABILITY.md +13 -0
  3. PRIVACY.md +9 -0
  4. README.md +119 -159
  5. SAFETY&SECURITY.md +6 -0
BIAS.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Field | Response
2
+ :---------------------------------------------------------------------------------------------------|:---------------
3
+ Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | Not applicable for the training phase, as the model was trained exclusively on synthetic data. Stakeholder review with impacted groups is recommended during deployment to validate real-world performance.
4
+ Bias Metric (If Measured): | Strict F1 Score at a 0.3 threshold. Performance on evaluation datasets was: Argilla PII (0.70), AI4Privacy (0.64), and nvidia/Nemotron-PII (0.87). Demographic performance breakdowns are not available.
5
+ Measures taken to mitigate against unwanted bias: | Not applicable.
EXPLAINABILITY.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
3
+ Intended Task/Domain: | PII/PHI Detection: To detect and classify Personally Identifiable Information (PII) and Protected Health Information (PHI) in structured and unstructured text across domains like healthcare, finance, and legal.
4
+ Model Type: | Transformer (GLiNER architecture).
5
+ Intended Users: | Developers and data professionals implementing data governance, privacy compliance (GDPR, HIPAA), and content moderation workflows.
6
+ Output: | A list of dictionaries, where each dictionary contains the detected text, its label (e.g., SSN), start and end positions, and a confidence score.
7
+ Describe how the model works: | The model takes a text string as input and uses a non-generative transformer architecture to produce span-level entity annotations. It identifies and labels sensitive information across 55+ categories without generating new text.
8
+ Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable
9
+ Technical Limitations & Mitigation: | Limitation: Performance varies by domain, text format, and the confidence threshold chosen. Mitigation: NVIDIA recommends use-case-specific validation and human review for high-stakes deployments to ensure accuracy and safety.
10
+ Verified to have met prescribed NVIDIA quality standards: | Yes
11
+ Performance Metrics: | Strict F1 Score is the primary evaluation metric. The model also provides per-entity confidence scores in its output.
12
+ Potential Known Risks: | If the model does not work as intended, it could lead to false negatives (failing to detect PII) or false positives (incorrectly flagging non-sensitive data, causing unnecessary redaction).
13
+ Licensing: | Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
PRIVACY.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
3
+ Generatable or reverse engineerable personal data? | No
4
+ Personal data used to create this model? | No
5
+ How often is dataset reviewed? | The dataset was reviewed during its creation, model training, evaluation, and before release.
6
+ Is there provenance for all datasets used in training? | Yes
7
+ Does data labeling (annotation, metadata) comply with privacy laws? | Yes. Labels were automatically injected during the synthetic data generation process, so no real personal data was ever viewed or handled.
8
+ Is data compliant with data subject requests for data correction or removal, if such a request was made? | Not Applicable.
9
+ Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/
README.md CHANGED
@@ -1,162 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- library_name: gliner
6
- datasets:
7
- - nvidia/Nemotron-PII
8
- pipeline_tag: token-classification
9
- tags:
10
- - PII
11
- - PHI
12
- - GLiNER
13
- - information extraction
14
- - encoder
15
- - entity recognition
16
- - privacy
17
- ---
18
 
19
- # GLiNER-PII: Fine-Tuned Model for PII/PHI Detection
20
-
21
- The GLiNER-PII is a fine-tuned successor to the Gretel GLiNER PII models. Built on the GLiNER bi-large base (`knowledgator/gliner-bi-large-v1.0`), it detects and classifies a broad range of Personally Identifiable Information (PII) and Protected Health Information (PHI) in **English text**. The model works with both structured and unstructured text and is non-generative, producing span-level entity annotations with confidence scores across 55+ categories.
22
-
23
- This model is intended for privacy-preserving NLP workflows such as de-identification, redaction, and compliance checks in healthcare, finance, legal, and enterprise data pipelines.
24
-
25
- For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-large-v1.0).
26
-
27
- ## Training Data
28
-
29
- The model was fine-tuned on the `nvidia/Nemotron-PII` dataset, a synthetic, persona-grounded dataset containing 100,000 records across 50+ industries with span-level annotations for 55+ PII/PHI categories. The dataset was generated with NVIDIA NeMo Data Designer using synthetic personas grounded in U.S. Census data to ensure demographic realism and contextual consistency.
30
-
31
- **Dataset Details:**
32
- - **Size:** 100,000 records (50k train / 50k test)
33
- - **Domains:** 50+ industries (healthcare, finance, cybersecurity, etc.)
34
- - **Entity Types:** 55+ PII/PHI categories
35
- - **Locale Coverage:** US and international formats
36
- - **Content Types:** Both structured (forms, invoices) and unstructured (emails, notes) documents
37
-
38
- For detailed statistics on the dataset, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/nvidia/Nemotron-PII).
39
-
40
- ## Use Cases
41
-
42
- The GLiNER-PII supports detection and redaction of sensitive information across regulated and enterprise scenarios:
43
-
44
- - **Healthcare**: Redact PHI in clinical notes, reports, and medical documents
45
- - **Finance**: Identify account numbers, SSNs, and transaction details in banking and insurance documents
46
- - **Legal**: Protect client information in contracts, filings, and discovery materials
47
- - **Enterprise Data Governance**: Scan documents, emails, and data stores for sensitive information
48
- - **Data Privacy Compliance**: Support GDPR, HIPAA, and CCPA workflows across varied document types
49
- - **Cybersecurity**: Detect sensitive data in logs, security reports, and incident records
50
- - **Content Moderation**: Flag personal information in user-generated content
51
-
52
- Note: performance varies by domain, format, and threshold, so validation and human review are recommended for high-stakes deployments.
53
-
54
- ## Installation & Usage
55
-
56
- Ensure you have Python installed. Then, install or update the `gliner` package:
57
-
58
- ```python
59
- import json
60
- from gliner import GLiNER
61
-
62
- # Load the fine-tuned GLiNER model
63
- model = GLiNER.from_pretrained("nvidia/gliner-pii")
64
-
65
- # Sample text containing PII/PHI entities
66
- text = """
67
- **Claim Denial Letter**
68
-
69
- **Date:** 2023-11-15
70
-
71
- **Claimant Information:**
72
- - **Claimant Name:** Nataly White
73
- - **Medical Record Number:** 1842-75-3924
74
-
75
- **Claim Denial Details:**
76
-
77
- Dear Nataly White,
78
-
79
- We are writing to inform you that your disability claim has been denied. The denial is based on the information provided in your medical record, numbered 1842-75-3924.
80
- After a thorough review, it has been determined that the medical evidence does not meet the criteria outlined in your policy for disability benefits.
81
- """
82
-
83
- # Define the labels for PII/PHI entities
84
- labels = [
85
- "certificate_license_number",
86
- "first_name",
87
- "date_of_birth",
88
- "ssn",
89
- "medical_record_number",
90
- "password",
91
- "unique_id",
92
- "phone_number",
93
- "national_id",
94
- "swift_bic",
95
- "company_name",
96
- "country",
97
- "license_plate",
98
- "tax_id",
99
- "employee_id",
100
- "pin" ,
101
- "state",
102
- "email",
103
- "date_time",
104
- "api_key",
105
- "biometric_identifier",
106
- "credit_debit_card",
107
- "coordinate",
108
- "device_identifier",
109
- "city",
110
- "postcode",
111
- "bank_routing_number",
112
- "vehicle_identifier",
113
- "health_plan_beneficiary_number",
114
- "url",
115
- "ipv4",
116
- "last_name",
117
- "cvv" ,
118
- "customer_id",
119
- "date",
120
- "user_name",
121
- "street_address",
122
- "ipv6",
123
- "account_number",
124
- "time",
125
- "age",
126
- "fax_number",
127
- "county",
128
- "gender",
129
- "sexuality",
130
- "political_view",
131
- "race_ethnicity",
132
- "religious_belief",
133
- "language",
134
- "blood_type",
135
- "mac_address",
136
- "http_cookie",
137
- "employment_status",
138
- "education_level",
139
- "occupation"
140
- ]
141
-
142
- # Predict entities with a confidence threshold of 0.7
143
- entities = model.predict_entities(text, labels, threshold=0.3)
144
-
145
- # Display the detected entities
146
- print(f"{'start':>5} {'end':>5} {'text':<16} {'label'}")
147
- print("-" * 60)
148
- for e in entities:
149
- print(f"{e['start']:>5} {e['end']:>5} {e['text']:<16} {e['label']}")
150
-
151
- # Expected output:
152
- # start end text label
153
- # ------------------------------------------------------------
154
- # 42 52 2023-11-15 date
155
- # 107 113 Nataly first_name
156
- # 114 119 White last_name
157
- # 152 164 1842-75-3924 medical_record_number
158
- # 204 210 Nataly first_name
159
- # 211 216 White last_name
160
- # 376 388 1842-75-3924 medical_record_number
161
- ```
162
 
 
 
1
+ # GLiNER-PII Model Overview
2
+
3
+ ### Description:
4
+ GLiNER-PII is a successor to the Gretel GLiNER PII/PHI models. Built on the GLiNER bi-large base, it detects and classifies a broad range of Personally Identifiable Information (PII) and Protected Health Information (PHI) in structured and unstructured text. It is non-generative and produces span-level entity annotations with confidence scores across 55+ categories. This model was developed by NVIDIA.
5
+
6
+ This model is ready for commercial/non-commercial use. <br>
7
+
8
+ ### License/Terms of Use
9
+ Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). <br>
10
+
11
+ ### Deployment Geography:
12
+ Global <br>
13
+
14
+ ### Use Case: <br>
15
+ GLiNER-PII supports detection and redaction of sensitive information across regulated and enterprise scenarios.
16
+
17
+ - **Healthcare**: Redact PHI in clinical notes, reports, and medical documents.
18
+ - **Finance**: Identify account numbers, SSNs, and transaction details in banking and insurance documents.
19
+ - **Legal**: Protect client information in contracts, filings, and discovery materials.
20
+ - **Enterprise Data Governance**: Scan documents, emails, and data stores for sensitive information.
21
+ - **Data Privacy Compliance**: Support GDPR, HIPAA, and CCPA workflows across varied document types.
22
+ - **Cybersecurity**: Detect sensitive data in logs, security reports, and incident records.
23
+ - **Content Moderation**: Flag personal information in user-generated content.
24
+
25
+ Note: performance varies by domain, format, and threshold, so validation and human review are recommended for high‑stakes deployments. <br>
26
+
27
+ ### Release Date: <br>
28
+ Hugging Face 10/30/2025 via https://huggingface.co/nvidia/gliner-pii <br>
29
+
30
+ ## References(s):
31
+ - GLiNER base (Hugging Face): https://huggingface.co/knowledgator/gliner-bi-large-v1.0
32
+ - Gretel GLiNER PII/PHI models: https://huggingface.co/gretelai/gretel-gliner-bi-large-v1.0
33
+ - Training dataset: https://huggingface.co/datasets/nvidia/nemotron-pii
34
+ - GLiNER library: https://pypi.org/project/gliner/
35
+
36
+ ## Model Architecture:
37
+ **Architecture Type:** Transformer <br>
38
+
39
+ **Network Architecture:** GLiNER <br>
40
+
41
+ **This model was developed based on knowledgator/gliner-bi-large-v1.0** <br>
42
+ **Number of model parameters: 5.7 × 10^8** <br>
43
+
44
+ ## Input: <br>
45
+ **Input Type(s):** Text <br>
46
+ **Input Format:** UTF-8 string(s) <br>
47
+ **Input Parameters:** One-Dimensional (1D) <br>
48
+ **Other Properties Related to Input:** supports structured and unstructured text <br>
49
+
50
+ ## Output: <br>
51
+ **Output Type(s):** Text <br>
52
+ **Output Format:** String <br>
53
+ **Output Parameters:** One-Dimensional (1D) <br>
54
+ **Other Properties Related to Output:** List of dictionaries with keys {text, label, start, end, score} <br>
55
+
56
+ ## Software Integration:
57
+ **Runtime Engine(s):**
58
+ * PyTorch, GLiNER Python library <br>
59
+
60
+ **Supported Hardware Microarchitecture Compatibility:** <br>
61
+ * NVIDIA Ampere <br>
62
+ * NVIDIA Blackwell <br>
63
+ * NVIDIA Hopper <br>
64
+ * NVIDIA Lovelace <br>
65
+ * NVIDIA Pascal <br>
66
+ * NVIDIA Turing <br>
67
+ * NVIDIA Volta <br>
68
+ * CPU (x86_64) <br>
69
+
70
+ **Preferred/Supported Operating System(s):**
71
+ * Linux <br>
72
+
73
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>
74
+
75
+ ## Model Version(s):
76
+ - nvidia/gliner-pii
77
+ - Version: v1.0
78
+
79
+ ## Training and Evaluation Datasets:
80
+
81
+ ### Training Dataset
82
+
83
+ **Link:** [nvidia/nemotron-pii](https://huggingface.co/datasets/nvidia/nemotron-pii) <br>
84
+ **Data Modality:** Text <br>
85
+ **Text Training Data Size:** \~100k records (\~10^5, <1B tokens) <br>
86
+ **Data Collection Method:** Synthetic <br>
87
+ **Labeling Method:** Synthetic <br>
88
+
89
+ **Properties:**
90
+ Synthetic persona-grounded dataset generated with NVIDIA NeMo Data Designer, spanning 50+ industries and 55+ entity types (U.S. and international formats). Includes both structured and unstructured records. Labels automatically injected during generation.
91
+
92
+ ## Evaluation Datasets
93
+
94
+ * [Argilla PII](https://huggingface.co/argilla)
95
+ * [AI4Privacy](https://huggingface.co/ai4privacy)
96
+ * [Gretel PII Dataset V1/V2](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1)
97
+
98
+ **Data Collection Method:** Hybrid: Automated, Human <br>
99
+ **Labeling Method:** Hybrid: Automated, Human <br>
100
+
101
+ **Evaluation Results** <br>
102
+ From the combined evaluation across Argilla, AI4Privacy, and Gretel PII datasets:
103
+
104
+ | Benchmark | GLiNER V1 (Strict F1) | GLiNER V2 (Strict F1) |
105
+ | --------------------| --------------------: | --------------------: |
106
+ | Argilla PII | 0.64 | 0.70 |
107
+ | AI4Privacy | 0.60 | 0.64 |
108
+ | nvidia/Nemotron-PII | 0.66 | 0.87 |
109
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
+ We evaluated the model using `threshold=0.3`. <br>
112
+
113
+ # Inference:
114
+ **Acceleration Engine:** PyTorch (via Hugging Face Transformers) <br>
115
+ **Test Hardware:** NVIDIA A100 (Ampere, PCIe/SXM) <br>
116
+
117
+ ## Ethical Considerations:
118
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
119
+
120
+ For more detailed information on ethical considerations for this model, please see the the Bias, Explainability, Safety & Security, and Privacy Subcards. <br>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). <br>
SAFETY&SECURITY.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :---------------------------------------------------|:----------------------------------
3
+ Model Application Field(s): | Healthcare, Finance, Legal, Enterprise Data Governance, Data Privacy Compliance, Cybersecurity, Content Moderation
4
+ Describe the life critical impact (if present). | Not Applicable.
5
+ Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
6
+ Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.