1. Model Details

Attribute Value
Developed by Petercusin (Guisheng Pan)
Model Architecture DistilBERT
Activation Function GELU
Dimensions 768
Size 255M
Hidden Dimensions 3072
Attention Dropout 0.1
Dropout 0.1
Sequence Classification Dropout 0.2
Number of Heads 12
Number of Layers 6
Max Position Embeddings 512
Vocabulary Size 30522
Initializer Range 0.02
Tied Weights True
Problem Type Multi-Label Classification

2. Model Description

This model is designed to classify English news articles into various domains or categories. It can be used for tasks such as news categorization, content organization, and topic-based filtering.

โš™๏ธ3. How to Get Started with the Model

# -*- coding: utf-8 -*-
"""
Created on Sat Apr 26 08:48:07 2025

@author: Petercusin
"""

import torch
import torch.nn.functional as F
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Step 1: Load the trained model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("English-news-category-classifier")
model = DistilBertForSequenceClassification.from_pretrained("English-news-category-classifier")

# Step 2: Define a function to preprocess the input text
def preprocess_text(text):
    inputs = tokenizer(text, padding='max_length', truncation=True, return_tensors='pt')
    return inputs

# Step 3: Define a function to make predictions
def predict(text):
    # Preprocess the input text
    inputs = preprocess_text(text)

    # Make predictions
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted class probabilities
    logits = outputs.logits
    probabilities = F.softmax(logits, dim=1).squeeze().tolist()
    predicted_class_id = torch.argmax(logits, dim=1).item()

    return predicted_class_id, probabilities

# Step 4: Load the label map from the model's configuration
label_map = model.config.id2label

# Example usage
new_titles = [
"Stock markets reach all-time high amid economic recovery",
"Scientists discover new species in Amazon rainforest",
"Congress passes new bill on healthcare reforms",
"The stairway to love: Chongqing's real-life fairy tale",
"African delegation take in Shanghai sights on Huangpu cruise",
"China expected to achieve higher grain output in 2025: report",
"China continued its dominance at the 2025 World Aquatics Diving World Cup in Guadalajara, sweeping all four gold medals on the third day of competitions on Saturday, along with one silver.",
"A 'DeepSeek moment for AI agents' as China launches Manus",
"Developed by Monica.im, Manus achieved top scores on the GAIA (General AI Assistant) benchmark, exceeding those of OpenAI's GPT (generative pre-trained transformer) tools. GAIA is a real-world benchmark for general AI assistants.",
"This week and without warning, a horrid video popped up on my phone. A puppy had its mouth and paws bound with tape, and was hanging in a plastic bag by the motorway. I immediately flicked past, but the image stayed with me. This was something I didnโ€™t want to see, yet there it was at 11am on a Tuesday."
]


for v in new_titles:
    input_text=v
    predicted_class_id, probabilities = predict(input_text)
    predicted_category = label_map[predicted_class_id]
    print(f"Predicted category: {predicted_category}")
    print(f"Text to classify: {v}")
    
    predicted_probability = probabilities[predicted_class_id]
    print(f"Probability of the predicted category: {predicted_probability:.4f}\n")

Result

Predicted category: BUSINESS
Text to classify: Stock markets reach all-time high amid economic recovery
Probability of the predicted category: 0.5707

Predicted category: SCIENCE
Text to classify: Scientists discover new species in Amazon rainforest
Probability of the predicted category: 0.5186

Predicted category: POLITICS
Text to classify: Congress passes new bill on healthcare reforms
Probability of the predicted category: 0.6175

Predicted category: ARTS
Text to classify: The stairway to love: Chongqing's real-life fairy tale
Probability of the predicted category: 0.2746

Predicted category: WORLDPOST
Text to classify: African delegation take in Shanghai sights on Huangpu cruise
Probability of the predicted category: 0.4686

Predicted category: GREEN
Text to classify: China expected to achieve higher grain output in 2025: report
Probability of the predicted category: 0.2889

Predicted category: SPORTS
Text to classify: China continued its dominance at the 2025 World Aquatics Diving World Cup in Guadalajara, sweeping all four gold medals on the third day of competitions on Saturday, along with one silver.
Probability of the predicted category: 0.4540

Predicted category: TECH
Text to classify: A 'DeepSeek moment for AI agents' as China launches Manus
Probability of the predicted category: 0.3297

Predicted category: TECH
Text to classify: Developed by Monica.im, Manus achieved top scores on the GAIA (General AI Assistant) benchmark, exceeding those of OpenAI's GPT (generative pre-trained transformer) tools. GAIA is a real-world benchmark for general AI assistants.
Probability of the predicted category: 0.8065

Predicted category: GOOD NEWS
Text to classify: This week and without warning, a horrid video popped up on my phone. A puppy had its mouth and paws bound with tape, and was hanging in a plastic bag by the motorway. I immediately flicked past, but the image stayed with me. This was something I didnโ€™t want to see, yet there it was at 11am on a Tuesday.
Probability of the predicted category: 0.1350

4. Training Data

The model was trained on a dataset of news articles categorized into 42 different domains. The categories include:

Column 1 Column 2
0 LATINO VOICES 21 WORLD NEWS
1 ARTS 22 QUEER VOICES
2 CULTURE & ARTS 23 PARENTING
3 HOME & LIVING 24 MONEY
4 ARTS & CULTURE 25 SPORTS
5 THE WORLDPOST 26 POLITICS
6 GOOD NEWS 27 WELLNESS
7 FIFTY 28 GREEN
8 CRIME 29 BUSINESS
9 RELIGION 30 TECH
10 PARENTS 31 ENVIRONMENT
11 TASTE 32 WOMEN
12 WORLDPOST 33 U.S. NEWS
13 EDUCATION 34 HEALTHY LIVING
14 ENTERTAINMENT 35 DIVORCE
15 FOOD & DRINK 36 MEDIA
16 TRAVEL 37 WEDDINGS
17 STYLE & BEAUTY 38 BLACK VOICES
18 IMPACT 39 STYLE
19 WEIRD NEWS 40 COMEDY
20 COLLEGE 41 SCIENCE

5. Evaluation

  • The model was evaluated on a test set, and the following metrics were obtained:
  • Evaluation Loss: 1.6844
  • Evaluation Accuracy: 0.5371
  • Evaluation F1 Score: 0.5282
  • Evaluation Precision: 0.5347
  • Evaluation Recall: 0.5371
  • Evaluation Runtime: 584.58 seconds
  • Evaluation Samples per Second: 8.622
  • Evaluation Steps per Second: 0.539

๐Ÿค 6. Model Card Contact

Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University Email: [email protected]

Downloads last month
5
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Petercusin/English-news-category-classifier

Finetuned
(10354)
this model