TestingCapstone commited on
Commit
45ba90f
·
verified ·
1 Parent(s): e2b1818

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -72
README.md CHANGED
@@ -1,91 +1,116 @@
1
  ---
 
2
  license: apache-2.0
3
  base_model: bert-large-uncased
4
  tags:
5
- - generated_from_trainer
6
- - phishing
7
- - BERT
 
 
8
  metrics:
9
- - accuracy
10
- - precision
11
- - recall
12
  model-index:
13
- - name: bert-finetuned-phishing
14
- results: []
15
  widget:
16
- - text: https://www.verif22.com
17
- example_title: Phishing URL
18
- - text: Dear colleague, An important update about your email has exceeded your
19
- storage limit. You will not be able to send or receive all of your messages.
20
- We will close all older versions of our Mailbox as of Friday, June 12, 2023.
21
- To activate and complete the required information click here (https://ec-ec.squarespace.com).
22
- Account must be reactivated today to regenerate new space. Management Team
23
- example_title: Phishing Email
24
- - text: You have access to FREE Video Streaming in your plan. REGISTER with your email, password and
25
- then select the monthly subscription option. https://bit.ly/3vNrU5r
26
- example_title: Phishing SMS
27
- - text: if(data.selectedIndex > 0){$('#hidCflag').val(data.selectedData.value);};;
28
- var sprypassword1 = new Spry.Widget.ValidationPassword("sprypassword1");
29
- var sprytextfield1 = new Spry.Widget.ValidationTextField("sprytextfield1", "email");
30
- example_title: Phishing Script
31
- - text: Hi, this model is really accurate :)
32
- example_title: Benign message
33
- datasets:
34
- - ealvaradob/phishing-dataset
 
 
 
 
35
  language:
36
- - en
37
- pipeline_tag: text-classification
 
 
 
 
 
38
  ---
39
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
40
- should probably proofread and complete it, then remove this comment. -->
41
 
42
- # BERT FINETUNED ON PHISHING DETECTION
43
 
44
- This model is a fine-tuned version of [bert-large-uncased](https://huggingface.co/bert-large-uncased) on an [phishing dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset),
45
- capable of detecting phishing in its four most common forms: URLs, Emails, SMS messages and even websites.
 
 
 
46
 
47
- It achieves the following results on the evaluation set:
48
 
49
- - Loss: 0.1953
50
- - Accuracy: 0.9717
51
- - Precision: 0.9658
52
- - Recall: 0.9670
53
- - False Positive Rate: 0.0249
54
 
55
- ## Model description
 
56
 
57
- BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.
58
- This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why
59
- it can use lots of publicly available data) with an automatic process to generate inputs and labels from
60
- those texts.
61
 
62
- This model has the following configuration:
63
 
64
- - 24-layer
65
- - 1024 hidden dimension
66
- - 16 attention heads
67
- - 336M parameters
68
 
69
- ## Motivation and Purpose
 
 
 
70
 
71
- Phishing is one of the most frequent and most expensive cyber-attacks according to several security reports.
72
- This model aims to efficiently and accurately prevent phishing attacks against individuals and organizations.
73
- To achieve it, BERT was trained on a diverse and robust dataset containing: URLs, SMS Messages, Emails and
74
- Websites, which allows the model to extend its detection capability beyond the usual and to be used in various
75
- contexts.
76
 
77
- ### Training hyperparameters
78
 
79
- The following hyperparameters were used during training:
80
- - learning_rate: 2e-05
81
- - train_batch_size: 16
82
- - eval_batch_size: 16
83
- - seed: 42
84
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
85
- - lr_scheduler_type: linear
86
- - num_epochs: 4
87
 
88
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  | Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | False Positive Rate |
91
  |:-------------:|:-----:|:-----:|:---------------:|:--------:|:---------:|:------:|:-------------------:|
@@ -94,10 +119,23 @@ The following hyperparameters were used during training:
94
  | 0.0389 | 3.0 | 11598 | 0.1779 | 0.9683 | 0.9778 | 0.9461 | 0.0156 |
95
  | 0.0091 | 4.0 | 15464 | 0.1953 | 0.9717 | 0.9658 | 0.9670 | 0.0249 |
96
 
 
97
 
98
- ### Framework versions
99
-
100
- - Transformers 4.34.1
101
- - Pytorch 2.1.1+cu121
102
- - Datasets 2.14.6
103
- - Tokenizers 0.14.1
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: text-classification
3
  license: apache-2.0
4
  base_model: bert-large-uncased
5
  tags:
6
+ - generated_from_trainer
7
+ - phishing
8
+ - BERT
9
+ - cybersecurity
10
+ - text-classification
11
  metrics:
12
+ - accuracy
13
+ - precision
14
+ - recall
15
  model-index:
16
+ - name: phishing-email-detector-capstone
17
+ results: []
18
  widget:
19
+ - text: https://www.verif22.com
20
+ example_title: Phishing URL
21
+ - text: >
22
+ Dear colleague,
23
+ An important update about your email has exceeded your storage limit.
24
+ You will not be able to send or receive messages until you reactivate your account.
25
+ We will close all older versions of our Mailbox as of Friday, June 12, 2023.
26
+ To activate and complete the required information, click here (https://ec-ec.squarespace.com).
27
+ Your account must be reactivated today to regenerate new space.
28
+ Management Team
29
+ example_title: Phishing Email
30
+ - text: >
31
+ You have access to FREE Video Streaming in your plan.
32
+ REGISTER with your email and password, then select the monthly subscription option.
33
+ https://bit.ly/3vNrU5r
34
+ example_title: Phishing SMS
35
+ - text: >
36
+ if(data.selectedIndex > 0){$('#hidCflag').val(data.selectedData.value);};
37
+ var sprypassword1 = new Spry.Widget.ValidationPassword("sprypassword1");
38
+ var sprytextfield1 = new Spry.Widget.ValidationTextField("sprypassword1", "email");
39
+ example_title: Phishing Script
40
+ - text: Hi, this model is really accurate :)
41
+ example_title: Benign Message
42
  language:
43
+ - en
44
+ ---
45
+ # 🧠 Phishing Detection Model (BERT-Large-Uncased)
46
+
47
+ A transformer-based model fine-tuned to detect **phishing content** across multiple formats — including **emails, URLs, SMS messages, and scripts**.
48
+ Built on **BERT-Large-Uncased**, it leverages deep contextual understanding of language to classify text as *phishing* or *benign* with high accuracy.
49
+
50
  ---
 
 
51
 
52
+ ## 📌 Model Details
53
 
54
+ **Base model:** `bert-large-uncased`
55
+ **Architecture:** 24 layers 1024 hidden size 16 attention heads ~336M parameters
56
+ **License:** Apache 2.0
57
+ **Language:** English
58
+ **Pipeline tag:** `text-classification`
59
 
60
+ ---
61
 
62
+ ## 🧩 Model Description
 
 
 
 
63
 
64
+ This model was trained to identify phishing-related content by analyzing linguistic and structural patterns commonly found in malicious communications.
65
+ By leveraging BERT’s bidirectional transformer architecture, it effectively detects phishing attempts even when the message appears legitimate or well-written.
66
 
67
+ ### Key Features
68
+ - Detects **phishing attempts** in text, emails, URLs, and scripts
69
+ - Useful for **cybersecurity applications**, such as email gateways or web filtering systems
70
+ - Capable of identifying **varied phishing tactics** (impersonation, link manipulation, credential harvesting, etc.)
71
 
72
+ ---
73
 
74
+ ## 🎯 Intended Uses
 
 
 
75
 
76
+ **Recommended use cases:**
77
+ - Classify messages, emails, and URLs as *phishing* or *benign*
78
+ - Integrate into automated **security pipelines**, email filtering tools, or chat moderation systems
79
+ - Aid in **phishing research** or awareness programs
80
 
81
+ **Limitations:**
82
+ - May trigger **false positives** on legitimate content with financial or urgent language
83
+ - Optimized for **English text** only
84
+ - Should be part of a **multi-layered defense strategy**, not a standalone cybersecurity control
 
85
 
86
+ ---
87
 
88
+ ## 📊 Evaluation Results
 
 
 
 
 
 
 
89
 
90
+ | Metric | Score |
91
+ |--------|--------|
92
+ | **Loss** | 0.1953 |
93
+ | **Accuracy** | 0.9717 |
94
+ | **Precision** | 0.9658 |
95
+ | **Recall** | 0.9670 |
96
+ | **False Positive Rate** | 0.0249 |
97
+
98
+ ---
99
+
100
+ ## ⚙️ Training Details
101
+
102
+ ### Hyperparameters
103
+ | Parameter | Value |
104
+ |------------|--------|
105
+ | **Learning rate** | 2e-05 |
106
+ | **Train batch size** | 16 |
107
+ | **Eval batch size** | 16 |
108
+ | **Seed** | 42 |
109
+ | **Optimizer** | Adam (β₁=0.9, β₂=0.999, ε=1e-08) |
110
+ | **LR scheduler** | Linear |
111
+ | **Epochs** | 4 |
112
+
113
+ ### Training Results
114
 
115
  | Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | False Positive Rate |
116
  |:-------------:|:-----:|:-----:|:---------------:|:--------:|:---------:|:------:|:-------------------:|
 
119
  | 0.0389 | 3.0 | 11598 | 0.1779 | 0.9683 | 0.9778 | 0.9461 | 0.0156 |
120
  | 0.0091 | 4.0 | 15464 | 0.1953 | 0.9717 | 0.9658 | 0.9670 | 0.0249 |
121
 
122
+ ---
123
 
124
+ ## 🧠 Example Inference
125
+
126
+ Try the model in Python using the `transformers` library:
127
+
128
+ ```python
129
+ from transformers import pipeline
130
+ # Load the phishing detection model
131
+ classifier = pipeline("text-classification", model="your-username/phishing-email-detector-capstone")
132
+ # Example texts
133
+ examples = [
134
+ "Dear colleague, your email storage is full. Click here to verify your account: https://secure-update-login.com",
135
+ "Hi team, the meeting starts at 2 PM today.",
136
+ "You have won a free gift card! Claim now at http://bit.ly/3xYzabc"
137
+ ]
138
+ # Run inference
139
+ for text in examples:
140
+ result = classifier(text)[0]
141
+ print(f"Text: {text}\nPrediction: {result['label']} (score: {result['score']:.4f})\n")