1. Model Details

Attribute	Value
Developed by	Petercusin (Guisheng Pan)
Model Architecture	DistilBERT
Activation Function	GELU
Dimensions	768
Size	255M
Hidden Dimensions	3072
Attention Dropout	0.1
Dropout	0.1
Sequence Classification Dropout	0.2
Number of Heads	12
Number of Layers	6
Max Position Embeddings	512
Vocabulary Size	30522
Initializer Range	0.02
Tied Weights	True
Problem Type	Multi-Label Classification

2. Model Description

This model is designed to classify English news articles into various domains or categories. It can be used for tasks such as news categorization, content organization, and topic-based filtering.

⚙️3. How to Get Started with the Model

# -*- coding: utf-8 -*-
"""
Created on Sat Apr 26 08:48:07 2025

@author: Petercusin
"""

import torch
import torch.nn.functional as F
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Step 1: Load the trained model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("English-news-category-classifier")
model = DistilBertForSequenceClassification.from_pretrained("English-news-category-classifier")

# Step 2: Define a function to preprocess the input text
def preprocess_text(text):
    inputs = tokenizer(text, padding='max_length', truncation=True, return_tensors='pt')
    return inputs

# Step 3: Define a function to make predictions
def predict(text):
    # Preprocess the input text
    inputs = preprocess_text(text)

    # Make predictions
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted class probabilities
    logits = outputs.logits
    probabilities = F.softmax(logits, dim=1).squeeze().tolist()
    predicted_class_id = torch.argmax(logits, dim=1).item()

    return predicted_class_id, probabilities

# Step 4: Load the label map from the model's configuration
label_map = model.config.id2label

# Example usage
new_titles = [
"Stock markets reach all-time high amid economic recovery",
"Scientists discover new species in Amazon rainforest",
"Congress passes new bill on healthcare reforms",
"The stairway to love: Chongqing's real-life fairy tale",
"African delegation take in Shanghai sights on Huangpu cruise",
"China expected to achieve higher grain output in 2025: report",
"China continued its dominance at the 2025 World Aquatics Diving World Cup in Guadalajara, sweeping all four gold medals on the third day of competitions on Saturday, along with one silver.",
"A 'DeepSeek moment for AI agents' as China launches Manus",
"Developed by Monica.im, Manus achieved top scores on the GAIA (General AI Assistant) benchmark, exceeding those of OpenAI's GPT (generative pre-trained transformer) tools. GAIA is a real-world benchmark for general AI assistants.",
"This week and without warning, a horrid video popped up on my phone. A puppy had its mouth and paws bound with tape, and was hanging in a plastic bag by the motorway. I immediately flicked past, but the image stayed with me. This was something I didn’t want to see, yet there it was at 11am on a Tuesday."
]


for v in new_titles:
    input_text=v
    predicted_class_id, probabilities = predict(input_text)
    predicted_category = label_map[predicted_class_id]
    print(f"Predicted category: {predicted_category}")
    print(f"Text to classify: {v}")
    
    predicted_probability = probabilities[predicted_class_id]
    print(f"Probability of the predicted category: {predicted_probability:.4f}\n")

Result

Predicted category: BUSINESS
Text to classify: Stock markets reach all-time high amid economic recovery
Probability of the predicted category: 0.5707

Predicted category: SCIENCE
Text to classify: Scientists discover new species in Amazon rainforest
Probability of the predicted category: 0.5186

Predicted category: POLITICS
Text to classify: Congress passes new bill on healthcare reforms
Probability of the predicted category: 0.6175

Predicted category: ARTS
Text to classify: The stairway to love: Chongqing's real-life fairy tale
Probability of the predicted category: 0.2746

Predicted category: WORLDPOST
Text to classify: African delegation take in Shanghai sights on Huangpu cruise
Probability of the predicted category: 0.4686

Predicted category: GREEN
Text to classify: China expected to achieve higher grain output in 2025: report
Probability of the predicted category: 0.2889

Predicted category: SPORTS
Text to classify: China continued its dominance at the 2025 World Aquatics Diving World Cup in Guadalajara, sweeping all four gold medals on the third day of competitions on Saturday, along with one silver.
Probability of the predicted category: 0.4540

Predicted category: TECH
Text to classify: A 'DeepSeek moment for AI agents' as China launches Manus
Probability of the predicted category: 0.3297

Predicted category: TECH
Text to classify: Developed by Monica.im, Manus achieved top scores on the GAIA (General AI Assistant) benchmark, exceeding those of OpenAI's GPT (generative pre-trained transformer) tools. GAIA is a real-world benchmark for general AI assistants.
Probability of the predicted category: 0.8065

Predicted category: GOOD NEWS
Text to classify: This week and without warning, a horrid video popped up on my phone. A puppy had its mouth and paws bound with tape, and was hanging in a plastic bag by the motorway. I immediately flicked past, but the image stayed with me. This was something I didn’t want to see, yet there it was at 11am on a Tuesday.
Probability of the predicted category: 0.1350

4. Training Data

The model was trained on a dataset of news articles categorized into 42 different domains. The categories include:

Column 1	Column 2
0 LATINO VOICES	21 WORLD NEWS
1 ARTS	22 QUEER VOICES
2 CULTURE & ARTS	23 PARENTING
3 HOME & LIVING	24 MONEY
4 ARTS & CULTURE	25 SPORTS
5 THE WORLDPOST	26 POLITICS
6 GOOD NEWS	27 WELLNESS
7 FIFTY	28 GREEN
8 CRIME	29 BUSINESS
9 RELIGION	30 TECH
10 PARENTS	31 ENVIRONMENT
11 TASTE	32 WOMEN
12 WORLDPOST	33 U.S. NEWS
13 EDUCATION	34 HEALTHY LIVING
14 ENTERTAINMENT	35 DIVORCE
15 FOOD & DRINK	36 MEDIA
16 TRAVEL	37 WEDDINGS
17 STYLE & BEAUTY	38 BLACK VOICES
18 IMPACT	39 STYLE
19 WEIRD NEWS	40 COMEDY
20 COLLEGE	41 SCIENCE

5. Evaluation

The model was evaluated on a test set, and the following metrics were obtained:
Evaluation Loss: 1.6844
Evaluation Accuracy: 0.5371
Evaluation F1 Score: 0.5282
Evaluation Precision: 0.5347
Evaluation Recall: 0.5371
Evaluation Runtime: 584.58 seconds
Evaluation Samples per Second: 8.622
Evaluation Steps per Second: 0.539

🤝 6. Model Card Contact

Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University Email: [email protected]

Downloads last month: 5

Safetensors

Model size

67M params

Tensor type

F32

Model tree for Petercusin/English-news-category-classifier

Base model

distilbert/distilbert-base-uncased

Finetuned

(10354)

this model