Spaces:

ferds003
/

DATA103_filipino_spam_detection

Sleeping

App Files Files Community

DATA103_filipino_spam_detection / README.md

ferds003

fixing sdk on readme file!

1471168 3 months ago

preview code

raw

history blame contribute delete

2.89 kB

A newer version of the Streamlit SDK is available: 1.51.0

Upgrade

metadata

title: DATA103_filipino_spam_detection
emoji: 🚀
colorFrom: red
colorTo: purple
sdk: streamlit
app_port: 8501
app_file: app.py
tags:
  - streamlit
pinned: false
short_description: spam classifier in the filipino context
license: mit

Scam/Spam SMS Checker and the Efficacy of the SIM Registration Act

Problem Statement

Spam text messages in the Philippines have increased in number in recent years, with more than 6 million reported SMS scams in 2024. In an article published by GMA News Online, Undersecretary Alexander Ramos, the executive director of the Cybercrime Investigation and Coordinating Center, shares that one of the widespread schemes of scammers has been the imitation of official brands.

The SIM Registration Act, implemented in 2022, was created to reduce such scams and cybercrimes. However, it seems that the issue is still prevalent. Through this project, the group aims to detect spam SMS using machine learning techniques, determine their frequency, and compare the same with data from before the implementation of the SIM Registration Act to determine whether this law has proved to be efficacious or not.

Dataset Description

Merged Filipino SMS Messages

The merged dataset comprise of three different sms messages datasets available online within the filipino-context curated for the application of this project. We avoided the UCI SMS Repository as this does not provide messages being received specific by filipinos.

How to use app

Simply use one of the trained clasifier models to classify if the message in input prompt is spam or not. Alternatively, you may select a sample from our test.csv file for you to check how well the performance is for sms messages.

Model training

The project will consider a train-val-test split for a cross-validation (cv) training with hyperparameter tuning considered per fold-run.

The group will consider four (4) traditional and explainable classifiers that are known to be used for spam detection. These are to be the two variants of Naive-Bayes (NB), multinomial and complement (noted to handle class imbalances well), Support Vector Machine, and RandomForest.

The project utilized MLflow to track training and artificats (evaluation metrics) per general run when the model is called; We have put this all under a function.

Summary of best model configuration and model metrics

The model training above already provides how the model metrics are extracted; All evaluation metrics and visualization are saved as artificats under mlflow. With this, presented below is the training summary done under initial run parameters for preprocessor=tfidf and cv_folds=5 for all models considered of the study.

Models considered are the following: complement_NB (cNB), multinomial_NB (mNB), random_forest (rf), support_vector_machine (svm)