deploying models trained on oversampled ham messages
Browse files- app.py +100 -8
- img/3_run/cNB/confusion_matrix_complementNB_9497e0a34f76466b8c6ab91aa4ec433e.png +0 -0
- img/3_run/cNB/cv_performance_9497e0a34f76466b8c6ab91aa4ec433e.png +0 -0
- img/3_run/mNB/confusion_matrix_multinomialNB_5c755dd20b2c44aa92dff382a9a9073f.png +0 -0
- img/3_run/mNB/cv_performance_5c755dd20b2c44aa92dff382a9a9073f.png +0 -0
- img/3_run/rf/confusion_matrix_random_forest_498f3ef34e954cdeb074bce4766180af.png +0 -0
- img/3_run/rf/cv_performance_498f3ef34e954cdeb074bce4766180af.png +0 -0
- img/3_run/svm/confusion_matrix_svm_05edf44674ee4ffeb25b1284d0a08e83.png +0 -0
- img/3_run/svm/cv_performance_05edf44674ee4ffeb25b1284d0a08e83.png +0 -0
- models/cNB_model_3.pkl +3 -0
- models/mNB_model_3.pkl +3 -0
- models/rf_model_3.pkl +3 -0
- models/svm_model_3.pkl +3 -0
- requirements.txt +1 -0
app.py
CHANGED
|
@@ -8,12 +8,15 @@ import sklearn
|
|
| 8 |
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
|
| 9 |
from wordcloud import WordCloud
|
| 10 |
from PIL import Image
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
MODEL_PATHS = {
|
| 13 |
-
"
|
| 14 |
-
"
|
| 15 |
-
"
|
| 16 |
-
"Complement Naive Bayes": "models/
|
| 17 |
}
|
| 18 |
|
| 19 |
|
|
@@ -244,6 +247,13 @@ with DemoTAB:
|
|
| 244 |
st.warning("Unable to compute token contribution for this model.")
|
| 245 |
else:
|
| 246 |
st.warning("Please input text to classify.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
with DataCurationTAB:
|
| 248 |
st.markdown("""
|
| 249 |
Data cleaning and pre-processing is necessary as we are considering three datasets with different contexts. Below is a summary of the data treatment and insights done to make the versions of the dataset.
|
|
@@ -898,7 +908,7 @@ with TrainingPipelineTAB:
|
|
| 898 |
|
| 899 |
st.markdown("### Example output of run of cell; Check also sample of mflow ui")
|
| 900 |
st.code(sample_run, language="python")
|
| 901 |
-
img1 = Image.open("img/
|
| 902 |
st.image(img1, caption="MLFlow UI", use_container_width=True)
|
| 903 |
|
| 904 |
|
|
@@ -910,7 +920,8 @@ with ModelEvaluationTAB:
|
|
| 910 |
With this, presented below is the training summary done under initial run parameters for `preprocessor=tfidf` and `cv_folds=5` for all models considered of the study.
|
| 911 |
|
| 912 |
Models considered are the following: `complement_NB (cNB)`, `multinomial_NB (mNB)`, `random_forest (rf)`, `support_vector_machine (svm)`
|
| 913 |
-
|
|
|
|
| 914 |
|
| 915 |
st.markdown("""
|
| 916 |
|
|
@@ -923,7 +934,7 @@ with ModelEvaluationTAB:
|
|
| 923 |
""")
|
| 924 |
|
| 925 |
st.markdown("---")
|
| 926 |
-
st.markdown("## 1st Run")
|
| 927 |
|
| 928 |
st.markdown(
|
| 929 |
"""This run used the default `ngram_range` for the TF-IDF vectorizer, which is `(1, 1)`. This means that only single-word tokens `(unigrams)` were considered during feature extraction."""
|
|
@@ -1003,7 +1014,7 @@ with ModelEvaluationTAB:
|
|
| 1003 |
st.image(img9, caption="CV Performance", use_container_width=True)
|
| 1004 |
|
| 1005 |
st.markdown("---")
|
| 1006 |
-
st.write("## 2nd Run")
|
| 1007 |
st.markdown("""
|
| 1008 |
For this second run, the `ngram_range` for the TF-IDF vectorizer was changed, which is now `(1, 2)`.
|
| 1009 |
This means that unigrams from previous run and two-word tokens `(bigrams)` were considered during feature extraction.
|
|
@@ -1081,6 +1092,86 @@ with ModelEvaluationTAB:
|
|
| 1081 |
with col2:
|
| 1082 |
st.image(img18, caption="CV Performance", use_container_width=True)
|
| 1083 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1084 |
with ConTAB:
|
| 1085 |
st.markdown("""
|
| 1086 |
## Conclusion
|
|
@@ -1093,6 +1184,7 @@ with ConTAB:
|
|
| 1093 |
- Considers traditional machine learning classifiers of the two NB variants, SVM, and RF.
|
| 1094 |
- demo app available in `HuggingFace Space` for further collaboration and feedback to target stakeholders and SIM users. """)
|
| 1095 |
|
|
|
|
| 1096 |
st.markdown("""
|
| 1097 |
## Recommendations
|
| 1098 |
|
|
|
|
| 8 |
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
|
| 9 |
from wordcloud import WordCloud
|
| 10 |
from PIL import Image
|
| 11 |
+
from imblearn.pipeline import Pipeline as ImbPipeline
|
| 12 |
+
from imblearn.over_sampling import RandomOverSampler
|
| 13 |
+
|
| 14 |
|
| 15 |
MODEL_PATHS = {
|
| 16 |
+
"Random Forest": "models/rf_model_3.pkl",
|
| 17 |
+
"Multinomial Naive Bayes": "models/mNB_model_3.pkl",
|
| 18 |
+
"Support Vector Machine": "models/svm_model_3.pkl",
|
| 19 |
+
"Complement Naive Bayes": "models/cNB_model_3.pkl",
|
| 20 |
}
|
| 21 |
|
| 22 |
|
|
|
|
| 247 |
st.warning("Unable to compute token contribution for this model.")
|
| 248 |
else:
|
| 249 |
st.warning("Please input text to classify.")
|
| 250 |
+
st.markdown("---")
|
| 251 |
+
st.markdown("""
|
| 252 |
+
## Changelogs:
|
| 253 |
+
- Version 2 (August 2, 2025): Improvements across `precision` and `recall` metrics on training by random oversampling ham classes on `X_train` in training pipeline using `imbalanced-learn`package. Latest deployed models trained under these run params.
|
| 254 |
+
- Version 1 (July 28, 2025): Initial demo of the project with 4 traditional ML classifiers using TFIDF vectorizer.
|
| 255 |
+
""")
|
| 256 |
+
|
| 257 |
with DataCurationTAB:
|
| 258 |
st.markdown("""
|
| 259 |
Data cleaning and pre-processing is necessary as we are considering three datasets with different contexts. Below is a summary of the data treatment and insights done to make the versions of the dataset.
|
|
|
|
| 908 |
|
| 909 |
st.markdown("### Example output of run of cell; Check also sample of mflow ui")
|
| 910 |
st.code(sample_run, language="python")
|
| 911 |
+
img1 = Image.open("img/mlflow_ui_1.png")
|
| 912 |
st.image(img1, caption="MLFlow UI", use_container_width=True)
|
| 913 |
|
| 914 |
|
|
|
|
| 920 |
With this, presented below is the training summary done under initial run parameters for `preprocessor=tfidf` and `cv_folds=5` for all models considered of the study.
|
| 921 |
|
| 922 |
Models considered are the following: `complement_NB (cNB)`, `multinomial_NB (mNB)`, `random_forest (rf)`, `support_vector_machine (svm)`
|
| 923 |
+
|
| 924 |
+
REVISION (August 4, 2025): Upon advise, we have done a oversampling of the ham classes. Improvements on `precision` and `recall` is observed in 3rd Run.""")
|
| 925 |
|
| 926 |
st.markdown("""
|
| 927 |
|
|
|
|
| 934 |
""")
|
| 935 |
|
| 936 |
st.markdown("---")
|
| 937 |
+
st.markdown("## 1st Run (July 28, 2025)")
|
| 938 |
|
| 939 |
st.markdown(
|
| 940 |
"""This run used the default `ngram_range` for the TF-IDF vectorizer, which is `(1, 1)`. This means that only single-word tokens `(unigrams)` were considered during feature extraction."""
|
|
|
|
| 1014 |
st.image(img9, caption="CV Performance", use_container_width=True)
|
| 1015 |
|
| 1016 |
st.markdown("---")
|
| 1017 |
+
st.write("## 2nd Run (July 28, 2025)")
|
| 1018 |
st.markdown("""
|
| 1019 |
For this second run, the `ngram_range` for the TF-IDF vectorizer was changed, which is now `(1, 2)`.
|
| 1020 |
This means that unigrams from previous run and two-word tokens `(bigrams)` were considered during feature extraction.
|
|
|
|
| 1092 |
with col2:
|
| 1093 |
st.image(img18, caption="CV Performance", use_container_width=True)
|
| 1094 |
|
| 1095 |
+
st.markdown("## 3rd Run (August 4, 2025)")
|
| 1096 |
+
st.markdown("""
|
| 1097 |
+
For this third run, we address the class imbalance by oversampling of the ham class using the `imbalance-learn` package with the aim that ham class is twice that of spam during training in pipeline.
|
| 1098 |
+
Improvement on `precision` and `recall` is observed overall. The performance metrics are shown below with `random forest` and `complement NB` considered the best models.
|
| 1099 |
+
""")
|
| 1100 |
+
|
| 1101 |
+
st.markdown("""### 3rd_Run Validation Accuracy""")
|
| 1102 |
+
st.markdown("""
|
| 1103 |
+
| models | val_accuracy | precision_0 | recall_0 | f1-score_0 | support | hyper_params | tfidf_range |
|
| 1104 |
+
|---|---|---|---|---|---|---|---|
|
| 1105 |
+
| cNB | 0.96 | 0.86 | 0.84 | 0.85 | {0:95,1:615} | 'classifier__alpha': [0.3, 0.5, 1.0, 1.5] | (1,2) + X_train = {2(ham):spam} |
|
| 1106 |
+
| mNB | 0.95 | 0.75 | 0.91 | 0.83 | {0:95,1:615} | 'classifier__alpha': [0.3, 0.5, 1.0, 1.5] | (1,2) + X_train = {2(ham):spam} |
|
| 1107 |
+
| rf | 0.97 | 0.79 | 0.86 | 0.82 | {0:95,1:615} | 'classifier__n_estimators':[50, 100, 200]<br>'classifier__max_depth':[None, 10, 20]<br>'classifier__min_samples_split':[1,2,5,10] | (1,2) + X_train = {2(ham):spam} |
|
| 1108 |
+
| svm | 0.96 | 0.96 | 0.72 | 0.82 | {0:95,1:615} | 'classifier__C': [0.1, 1, 10]<br>'classifier__kernel': ['linear','rbf'] | (1,2) + X_train = {2(ham):spam} |
|
| 1109 |
+
|
| 1110 |
+
""")
|
| 1111 |
+
|
| 1112 |
+
st.markdown("""
|
| 1113 |
+
|
| 1114 |
+
### 3rd_Run Test Accuracy
|
| 1115 |
+
|
| 1116 |
+
| models | test_accuracy | precision_0 | recall_0 | f1-score_0 | support | hyper_params | tfidf_range |
|
| 1117 |
+
|---|---|---|---|---|---|---|---|
|
| 1118 |
+
| cNB | 0.89 | 0.89 | 0.86 | 0.87 | {0:125,1:889} | 'classifier__alpha': [0.3, 0.5, 1.0, 1.5] | (1,2) + X_train = {2(ham):spam} |
|
| 1119 |
+
| mNB | 0.95 | 0.76 | 0.91 | 0.83 | {0:125,1:889} | 'classifier__alpha': [0.3, 0.5, 1.0, 1.5] | (1,2) + X_train = {2(ham):spam} |
|
| 1120 |
+
| rf | 0.96 | 0.80 | 0.92 | 0.86 | {0:125,1:889} | 'classifier__n_estimators':[50, 100, 200]<br>'classifier__max_depth':[None, 10, 20]<br>'classifier__min_samples_split':[1,2,5,10] | (1,2) + X_train = {2(ham):spam} |
|
| 1121 |
+
| svm | 0.96 | 0.96 | 0.71 | 0.82 | {0:125,1:889} | 'classifier__C': [0.1, 1, 10]<br>'classifier__kernel': ['linear','rbf'] | (1,2) + X_train = {2(ham):spam} |
|
| 1122 |
+
""")
|
| 1123 |
+
st.markdown("#### SVM Performance Metrics")
|
| 1124 |
+
img19 = Image.open(
|
| 1125 |
+
"img/3_run/svm/confusion_matrix_svm_05edf44674ee4ffeb25b1284d0a08e83.png"
|
| 1126 |
+
)
|
| 1127 |
+
img20 = Image.open(
|
| 1128 |
+
"img/3_run/svm/cv_performance_05edf44674ee4ffeb25b1284d0a08e83.png"
|
| 1129 |
+
)
|
| 1130 |
+
col1, col2 = st.columns(2)
|
| 1131 |
+
with col1:
|
| 1132 |
+
st.image(img19, caption="Confusion Matrix", use_container_width=True)
|
| 1133 |
+
with col2:
|
| 1134 |
+
st.image(img20, caption="CV Performance", use_container_width=True)
|
| 1135 |
+
|
| 1136 |
+
st.markdown("#### RF Performance Metrics")
|
| 1137 |
+
img21 = Image.open(
|
| 1138 |
+
"img/3_run/rf/confusion_matrix_random_forest_498f3ef34e954cdeb074bce4766180af.png"
|
| 1139 |
+
)
|
| 1140 |
+
img22 = Image.open(
|
| 1141 |
+
"img/3_run/rf/cv_performance_498f3ef34e954cdeb074bce4766180af.png"
|
| 1142 |
+
)
|
| 1143 |
+
col1, col2 = st.columns(2)
|
| 1144 |
+
with col1:
|
| 1145 |
+
st.image(img21, caption="Confusion Matrix", use_container_width=True)
|
| 1146 |
+
with col2:
|
| 1147 |
+
st.image(img22, caption="CV Performance", use_container_width=True)
|
| 1148 |
+
|
| 1149 |
+
st.markdown("#### Multinomial Naive Bayes Performance Metrics")
|
| 1150 |
+
img23 = Image.open(
|
| 1151 |
+
"img/3_run/mNB/confusion_matrix_multinomialNB_5c755dd20b2c44aa92dff382a9a9073f.png"
|
| 1152 |
+
)
|
| 1153 |
+
img24 = Image.open(
|
| 1154 |
+
"img/3_run/mNB/cv_performance_5c755dd20b2c44aa92dff382a9a9073f.png"
|
| 1155 |
+
)
|
| 1156 |
+
col1, col2 = st.columns(2)
|
| 1157 |
+
with col1:
|
| 1158 |
+
st.image(img23, caption="Confusion Matrix", use_container_width=True)
|
| 1159 |
+
with col2:
|
| 1160 |
+
st.image(img24, caption="CV Performance", use_container_width=True)
|
| 1161 |
+
|
| 1162 |
+
st.markdown("#### Complement Naive Bayes Performance Metrics")
|
| 1163 |
+
img25 = Image.open(
|
| 1164 |
+
"img/3_run/cNB/confusion_matrix_complementNB_9497e0a34f76466b8c6ab91aa4ec433e.png"
|
| 1165 |
+
)
|
| 1166 |
+
img26 = Image.open(
|
| 1167 |
+
"img/3_run/cNB/cv_performance_9497e0a34f76466b8c6ab91aa4ec433e.png"
|
| 1168 |
+
)
|
| 1169 |
+
col1, col2 = st.columns(2)
|
| 1170 |
+
with col1:
|
| 1171 |
+
st.image(img25, caption="Confusion Matrix", use_container_width=True)
|
| 1172 |
+
with col2:
|
| 1173 |
+
st.image(img26, caption="CV Performance", use_container_width=True)
|
| 1174 |
+
|
| 1175 |
with ConTAB:
|
| 1176 |
st.markdown("""
|
| 1177 |
## Conclusion
|
|
|
|
| 1184 |
- Considers traditional machine learning classifiers of the two NB variants, SVM, and RF.
|
| 1185 |
- demo app available in `HuggingFace Space` for further collaboration and feedback to target stakeholders and SIM users. """)
|
| 1186 |
|
| 1187 |
+
st.markdown("---")
|
| 1188 |
st.markdown("""
|
| 1189 |
## Recommendations
|
| 1190 |
|
img/3_run/cNB/confusion_matrix_complementNB_9497e0a34f76466b8c6ab91aa4ec433e.png
ADDED
|
img/3_run/cNB/cv_performance_9497e0a34f76466b8c6ab91aa4ec433e.png
ADDED
|
img/3_run/mNB/confusion_matrix_multinomialNB_5c755dd20b2c44aa92dff382a9a9073f.png
ADDED
|
img/3_run/mNB/cv_performance_5c755dd20b2c44aa92dff382a9a9073f.png
ADDED
|
img/3_run/rf/confusion_matrix_random_forest_498f3ef34e954cdeb074bce4766180af.png
ADDED
|
img/3_run/rf/cv_performance_498f3ef34e954cdeb074bce4766180af.png
ADDED
|
img/3_run/svm/confusion_matrix_svm_05edf44674ee4ffeb25b1284d0a08e83.png
ADDED
|
img/3_run/svm/cv_performance_05edf44674ee4ffeb25b1284d0a08e83.png
ADDED
|
models/cNB_model_3.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:32371782b918891e72f0f771ab1fbe31926dfa88f3102e822aefb4b20b825fd5
|
| 3 |
+
size 1119394
|
models/mNB_model_3.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9b82ab40ba9273c87bad415634d7b84e7a5284c8b12a7562abbec0c36b9a0333
|
| 3 |
+
size 984278
|
models/rf_model_3.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:466411d34de5afeeba78b64984f7fdc2ee3dfd17b98fd84fab87e89a2b09c02b
|
| 3 |
+
size 4130299
|
models/svm_model_3.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3b47ef7c63836664c4c56797fa231db1557ecfeace0727e522e8246040d469b1
|
| 3 |
+
size 801049
|
requirements.txt
CHANGED
|
@@ -5,3 +5,4 @@ scikit-learn>=1.0
|
|
| 5 |
plotly>=5.9.0
|
| 6 |
wordcloud>=1.8.1
|
| 7 |
Pillow>=8.0.0
|
|
|
|
|
|
| 5 |
plotly>=5.9.0
|
| 6 |
wordcloud>=1.8.1
|
| 7 |
Pillow>=8.0.0
|
| 8 |
+
imbalanced-learn>=0.12.4
|