TestingCapstone
/

phishing-email-detector-capstone

@@ -1,91 +1,116 @@
 ---
 license: apache-2.0
 base_model: bert-large-uncased
 tags:
-- generated_from_trainer
-- phishing
-- BERT
 metrics:
-- accuracy
-- precision
-- recall
 model-index:
-- name: bert-finetuned-phishing
-  results: []
 widget:
-- text: https://www.verif22.com
-  example_title: Phishing URL
-- text: Dear colleague, An important update about your email has exceeded your
-    storage limit. You will not be able to send or receive all of your messages.
-    We will close all older versions of our Mailbox as of Friday, June 12, 2023.
-    To activate and complete the required information click here (https://ec-ec.squarespace.com).
-    Account must be reactivated today to regenerate new space. Management Team
-  example_title: Phishing Email
-- text: You have access to FREE Video Streaming in your plan. REGISTER with your email, password and
-    then select the monthly subscription option. https://bit.ly/3vNrU5r
-  example_title: Phishing SMS
-- text: if(data.selectedIndex > 0){$('#hidCflag').val(data.selectedData.value);};;
-    var sprypassword1 = new Spry.Widget.ValidationPassword("sprypassword1");
-    var sprytextfield1 = new Spry.Widget.ValidationTextField("sprytextfield1", "email");
-  example_title: Phishing Script
-- text: Hi, this model is really accurate :)
-  example_title: Benign message
-datasets:
-- ealvaradob/phishing-dataset
 language:
-- en
-pipeline_tag: text-classification
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# BERT FINETUNED ON PHISHING DETECTION
-This model is a fine-tuned version of [bert-large-uncased](https://huggingface.co/bert-large-uncased) on an [phishing dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset),
-capable of detecting phishing in its four most common forms: URLs, Emails, SMS messages and even websites.
-It achieves the following results on the evaluation set:
-- Loss: 0.1953
-- Accuracy: 0.9717
-- Precision: 0.9658
-- Recall: 0.9670
-- False Positive Rate: 0.0249
-## Model description
-BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.
-This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why
-it can use lots of publicly available data) with an automatic process to generate inputs and labels from
-those texts.
-This model has the following configuration:
-- 24-layer
-- 1024 hidden dimension
-- 16 attention heads
-- 336M parameters
-## Motivation and Purpose
-Phishing is one of the most frequent and most expensive cyber-attacks according to several security reports.
-This model aims to efficiently and accurately prevent phishing attacks against individuals and organizations.
-To achieve it, BERT was trained on a diverse and robust dataset containing: URLs, SMS Messages, Emails and
-Websites, which allows the model to extend its detection capability beyond the usual and to be used in various
-contexts.
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 2e-05
-- train_batch_size: 16
-- eval_batch_size: 16
-- seed: 42
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- num_epochs: 4
-### Training results
 | Training Loss | Epoch | Step  | Validation Loss | Accuracy | Precision | Recall | False Positive Rate |
 |:-------------:|:-----:|:-----:|:---------------:|:--------:|:---------:|:------:|:-------------------:|
@@ -94,10 +119,23 @@ The following hyperparameters were used during training:
 | 0.0389        | 3.0   | 11598 | 0.1779          | 0.9683   | 0.9778    | 0.9461 | 0.0156              |
 | 0.0091        | 4.0   | 15464 | 0.1953          | 0.9717   | 0.9658    | 0.9670 | 0.0249              |
-### Framework versions
-- Transformers 4.34.1
-- Pytorch 2.1.1+cu121
-- Datasets 2.14.6
-- Tokenizers 0.14.1

 ---
+pipeline_tag: text-classification
 license: apache-2.0
 base_model: bert-large-uncased
 tags:
+  - generated_from_trainer
+  - phishing
+  - BERT
+  - cybersecurity
+  - text-classification
 metrics:
+  - accuracy
+  - precision
+  - recall
 model-index:
+  - name: phishing-email-detector-capstone
+    results: []
 widget:
+  - text: https://www.verif22.com
+    example_title: Phishing URL
+  - text: >
+      Dear colleague,
+      An important update about your email has exceeded your storage limit.
+      You will not be able to send or receive messages until you reactivate your account.
+      We will close all older versions of our Mailbox as of Friday, June 12, 2023.
+      To activate and complete the required information, click here (https://ec-ec.squarespace.com).
+      Your account must be reactivated today to regenerate new space.
+      — Management Team
+    example_title: Phishing Email
+  - text: >
+      You have access to FREE Video Streaming in your plan.
+      REGISTER with your email and password, then select the monthly subscription option.
+      https://bit.ly/3vNrU5r
+    example_title: Phishing SMS
+  - text: >
+      if(data.selectedIndex > 0){$('#hidCflag').val(data.selectedData.value);};
+      var sprypassword1 = new Spry.Widget.ValidationPassword("sprypassword1");
+      var sprytextfield1 = new Spry.Widget.ValidationTextField("sprypassword1", "email");
+    example_title: Phishing Script
+  - text: Hi, this model is really accurate :)
+    example_title: Benign Message
 language:
+  - en
+---
+# 🧠 Phishing Detection Model (BERT-Large-Uncased)
+A transformer-based model fine-tuned to detect **phishing content** across multiple formats — including **emails, URLs, SMS messages, and scripts**.
+Built on **BERT-Large-Uncased**, it leverages deep contextual understanding of language to classify text as *phishing* or *benign* with high accuracy.
 ---
+## 📌 Model Details
+**Base model:** `bert-large-uncased`
+**Architecture:** 24 layers • 1024 hidden size • 16 attention heads • ~336M parameters
+**License:** Apache 2.0
+**Language:** English
+**Pipeline tag:** `text-classification`
+---
+## 🧩 Model Description
+This model was trained to identify phishing-related content by analyzing linguistic and structural patterns commonly found in malicious communications.
+By leveraging BERT’s bidirectional transformer architecture, it effectively detects phishing attempts even when the message appears legitimate or well-written.
+### Key Features
+- Detects **phishing attempts** in text, emails, URLs, and scripts
+- Useful for **cybersecurity applications**, such as email gateways or web filtering systems
+- Capable of identifying **varied phishing tactics** (impersonation, link manipulation, credential harvesting, etc.)
+---
+## 🎯 Intended Uses
+**Recommended use cases:**
+- Classify messages, emails, and URLs as *phishing* or *benign*
+- Integrate into automated **security pipelines**, email filtering tools, or chat moderation systems
+- Aid in **phishing research** or awareness programs
+**Limitations:**
+- May trigger **false positives** on legitimate content with financial or urgent language
+- Optimized for **English text** only
+- Should be part of a **multi-layered defense strategy**, not a standalone cybersecurity control
+---
+## 📊 Evaluation Results
+| Metric | Score |
+|--------|--------|
+| **Loss** | 0.1953 |
+| **Accuracy** | 0.9717 |
+| **Precision** | 0.9658 |
+| **Recall** | 0.9670 |
+| **False Positive Rate** | 0.0249 |
+---
+## ⚙️ Training Details
+### Hyperparameters
+| Parameter | Value |
+|------------|--------|
+| **Learning rate** | 2e-05 |
+| **Train batch size** | 16 |
+| **Eval batch size** | 16 |
+| **Seed** | 42 |
+| **Optimizer** | Adam (β₁=0.9, β₂=0.999, ε=1e-08) |
+| **LR scheduler** | Linear |
+| **Epochs** | 4 |
+### Training Results
 | Training Loss | Epoch | Step  | Validation Loss | Accuracy | Precision | Recall | False Positive Rate |
 |:-------------:|:-----:|:-----:|:---------------:|:--------:|:---------:|:------:|:-------------------:|
 | 0.0389        | 3.0   | 11598 | 0.1779          | 0.9683   | 0.9778    | 0.9461 | 0.0156              |
 | 0.0091        | 4.0   | 15464 | 0.1953          | 0.9717   | 0.9658    | 0.9670 | 0.0249              |
+---
+## 🧠 Example Inference
+Try the model in Python using the `transformers` library:
+```python
+from transformers import pipeline
+# Load the phishing detection model
+classifier = pipeline("text-classification", model="your-username/phishing-email-detector-capstone")
+# Example texts
+examples = [
+    "Dear colleague, your email storage is full. Click here to verify your account: https://secure-update-login.com",
+    "Hi team, the meeting starts at 2 PM today.",
+    "You have won a free gift card! Claim now at http://bit.ly/3xYzabc"
+]
+# Run inference
+for text in examples:
+    result = classifier(text)[0]
+    print(f"Text: {text}\nPrediction: {result['label']} (score: {result['score']:.4f})\n")