1. Model Details
| Attribute | Value |
|---|---|
| Developed by | Petercusin (Guisheng Pan) |
| Model Architecture | DistilBERT |
| Activation Function | GELU |
| Dimensions | 768 |
| Size | 255M |
| Hidden Dimensions | 3072 |
| Attention Dropout | 0.1 |
| Dropout | 0.1 |
| Sequence Classification Dropout | 0.2 |
| Number of Heads | 12 |
| Number of Layers | 6 |
| Max Position Embeddings | 512 |
| Vocabulary Size | 30522 |
| Initializer Range | 0.02 |
| Tied Weights | True |
| Problem Type | Multi-Label Classification |
2. Model Description
This model is designed to classify English news articles into various domains or categories. It can be used for tasks such as news categorization, content organization, and topic-based filtering.
โ๏ธ3. How to Get Started with the Model
# -*- coding: utf-8 -*-
"""
Created on Sat Apr 26 08:48:07 2025
@author: Petercusin
"""
import torch
import torch.nn.functional as F
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
# Step 1: Load the trained model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("English-news-category-classifier")
model = DistilBertForSequenceClassification.from_pretrained("English-news-category-classifier")
# Step 2: Define a function to preprocess the input text
def preprocess_text(text):
inputs = tokenizer(text, padding='max_length', truncation=True, return_tensors='pt')
return inputs
# Step 3: Define a function to make predictions
def predict(text):
# Preprocess the input text
inputs = preprocess_text(text)
# Make predictions
with torch.no_grad():
outputs = model(**inputs)
# Get the predicted class probabilities
logits = outputs.logits
probabilities = F.softmax(logits, dim=1).squeeze().tolist()
predicted_class_id = torch.argmax(logits, dim=1).item()
return predicted_class_id, probabilities
# Step 4: Load the label map from the model's configuration
label_map = model.config.id2label
# Example usage
new_titles = [
"Stock markets reach all-time high amid economic recovery",
"Scientists discover new species in Amazon rainforest",
"Congress passes new bill on healthcare reforms",
"The stairway to love: Chongqing's real-life fairy tale",
"African delegation take in Shanghai sights on Huangpu cruise",
"China expected to achieve higher grain output in 2025: report",
"China continued its dominance at the 2025 World Aquatics Diving World Cup in Guadalajara, sweeping all four gold medals on the third day of competitions on Saturday, along with one silver.",
"A 'DeepSeek moment for AI agents' as China launches Manus",
"Developed by Monica.im, Manus achieved top scores on the GAIA (General AI Assistant) benchmark, exceeding those of OpenAI's GPT (generative pre-trained transformer) tools. GAIA is a real-world benchmark for general AI assistants.",
"This week and without warning, a horrid video popped up on my phone. A puppy had its mouth and paws bound with tape, and was hanging in a plastic bag by the motorway. I immediately flicked past, but the image stayed with me. This was something I didnโt want to see, yet there it was at 11am on a Tuesday."
]
for v in new_titles:
input_text=v
predicted_class_id, probabilities = predict(input_text)
predicted_category = label_map[predicted_class_id]
print(f"Predicted category: {predicted_category}")
print(f"Text to classify: {v}")
predicted_probability = probabilities[predicted_class_id]
print(f"Probability of the predicted category: {predicted_probability:.4f}\n")
Result
Predicted category: BUSINESS
Text to classify: Stock markets reach all-time high amid economic recovery
Probability of the predicted category: 0.5707
Predicted category: SCIENCE
Text to classify: Scientists discover new species in Amazon rainforest
Probability of the predicted category: 0.5186
Predicted category: POLITICS
Text to classify: Congress passes new bill on healthcare reforms
Probability of the predicted category: 0.6175
Predicted category: ARTS
Text to classify: The stairway to love: Chongqing's real-life fairy tale
Probability of the predicted category: 0.2746
Predicted category: WORLDPOST
Text to classify: African delegation take in Shanghai sights on Huangpu cruise
Probability of the predicted category: 0.4686
Predicted category: GREEN
Text to classify: China expected to achieve higher grain output in 2025: report
Probability of the predicted category: 0.2889
Predicted category: SPORTS
Text to classify: China continued its dominance at the 2025 World Aquatics Diving World Cup in Guadalajara, sweeping all four gold medals on the third day of competitions on Saturday, along with one silver.
Probability of the predicted category: 0.4540
Predicted category: TECH
Text to classify: A 'DeepSeek moment for AI agents' as China launches Manus
Probability of the predicted category: 0.3297
Predicted category: TECH
Text to classify: Developed by Monica.im, Manus achieved top scores on the GAIA (General AI Assistant) benchmark, exceeding those of OpenAI's GPT (generative pre-trained transformer) tools. GAIA is a real-world benchmark for general AI assistants.
Probability of the predicted category: 0.8065
Predicted category: GOOD NEWS
Text to classify: This week and without warning, a horrid video popped up on my phone. A puppy had its mouth and paws bound with tape, and was hanging in a plastic bag by the motorway. I immediately flicked past, but the image stayed with me. This was something I didnโt want to see, yet there it was at 11am on a Tuesday.
Probability of the predicted category: 0.1350
4. Training Data
The model was trained on a dataset of news articles categorized into 42 different domains. The categories include:
| Column 1 | Column 2 |
|---|---|
| 0 LATINO VOICES | 21 WORLD NEWS |
| 1 ARTS | 22 QUEER VOICES |
| 2 CULTURE & ARTS | 23 PARENTING |
| 3 HOME & LIVING | 24 MONEY |
| 4 ARTS & CULTURE | 25 SPORTS |
| 5 THE WORLDPOST | 26 POLITICS |
| 6 GOOD NEWS | 27 WELLNESS |
| 7 FIFTY | 28 GREEN |
| 8 CRIME | 29 BUSINESS |
| 9 RELIGION | 30 TECH |
| 10 PARENTS | 31 ENVIRONMENT |
| 11 TASTE | 32 WOMEN |
| 12 WORLDPOST | 33 U.S. NEWS |
| 13 EDUCATION | 34 HEALTHY LIVING |
| 14 ENTERTAINMENT | 35 DIVORCE |
| 15 FOOD & DRINK | 36 MEDIA |
| 16 TRAVEL | 37 WEDDINGS |
| 17 STYLE & BEAUTY | 38 BLACK VOICES |
| 18 IMPACT | 39 STYLE |
| 19 WEIRD NEWS | 40 COMEDY |
| 20 COLLEGE | 41 SCIENCE |
5. Evaluation
- The model was evaluated on a test set, and the following metrics were obtained:
- Evaluation Loss: 1.6844
- Evaluation Accuracy: 0.5371
- Evaluation F1 Score: 0.5282
- Evaluation Precision: 0.5347
- Evaluation Recall: 0.5371
- Evaluation Runtime: 584.58 seconds
- Evaluation Samples per Second: 8.622
- Evaluation Steps per Second: 0.539
๐ค 6. Model Card Contact
Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University Email: [email protected]
- Downloads last month
- 5
Model tree for Petercusin/English-news-category-classifier
Base model
distilbert/distilbert-base-uncased