ferds003 commited on
Commit
020c30f
·
1 Parent(s): 04d8f1c

deploying models trained on oversampled ham messages

Browse files
app.py CHANGED
@@ -8,12 +8,15 @@ import sklearn
8
  from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
9
  from wordcloud import WordCloud
10
  from PIL import Image
 
 
 
11
 
12
  MODEL_PATHS = {
13
- "Support Vector Machine": "models/svm_model_2.pkl",
14
- "Random Forest": "models/rf_model_2.pkl",
15
- "Multinomial Naive Bayes": "models/mNB_model_2.pkl",
16
- "Complement Naive Bayes": "models/cNB_model_2.pkl",
17
  }
18
 
19
 
@@ -244,6 +247,13 @@ with DemoTAB:
244
  st.warning("Unable to compute token contribution for this model.")
245
  else:
246
  st.warning("Please input text to classify.")
 
 
 
 
 
 
 
247
  with DataCurationTAB:
248
  st.markdown("""
249
  Data cleaning and pre-processing is necessary as we are considering three datasets with different contexts. Below is a summary of the data treatment and insights done to make the versions of the dataset.
@@ -898,7 +908,7 @@ with TrainingPipelineTAB:
898
 
899
  st.markdown("### Example output of run of cell; Check also sample of mflow ui")
900
  st.code(sample_run, language="python")
901
- img1 = Image.open("img/mlflow_ui.png")
902
  st.image(img1, caption="MLFlow UI", use_container_width=True)
903
 
904
 
@@ -910,7 +920,8 @@ with ModelEvaluationTAB:
910
  With this, presented below is the training summary done under initial run parameters for `preprocessor=tfidf` and `cv_folds=5` for all models considered of the study.
911
 
912
  Models considered are the following: `complement_NB (cNB)`, `multinomial_NB (mNB)`, `random_forest (rf)`, `support_vector_machine (svm)`
913
- """)
 
914
 
915
  st.markdown("""
916
 
@@ -923,7 +934,7 @@ with ModelEvaluationTAB:
923
  """)
924
 
925
  st.markdown("---")
926
- st.markdown("## 1st Run")
927
 
928
  st.markdown(
929
  """This run used the default `ngram_range` for the TF-IDF vectorizer, which is `(1, 1)`. This means that only single-word tokens `(unigrams)` were considered during feature extraction."""
@@ -1003,7 +1014,7 @@ with ModelEvaluationTAB:
1003
  st.image(img9, caption="CV Performance", use_container_width=True)
1004
 
1005
  st.markdown("---")
1006
- st.write("## 2nd Run")
1007
  st.markdown("""
1008
  For this second run, the `ngram_range` for the TF-IDF vectorizer was changed, which is now `(1, 2)`.
1009
  This means that unigrams from previous run and two-word tokens `(bigrams)` were considered during feature extraction.
@@ -1081,6 +1092,86 @@ with ModelEvaluationTAB:
1081
  with col2:
1082
  st.image(img18, caption="CV Performance", use_container_width=True)
1083
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1084
  with ConTAB:
1085
  st.markdown("""
1086
  ## Conclusion
@@ -1093,6 +1184,7 @@ with ConTAB:
1093
  - Considers traditional machine learning classifiers of the two NB variants, SVM, and RF.
1094
  - demo app available in `HuggingFace Space` for further collaboration and feedback to target stakeholders and SIM users. """)
1095
 
 
1096
  st.markdown("""
1097
  ## Recommendations
1098
 
 
8
  from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
9
  from wordcloud import WordCloud
10
  from PIL import Image
11
+ from imblearn.pipeline import Pipeline as ImbPipeline
12
+ from imblearn.over_sampling import RandomOverSampler
13
+
14
 
15
  MODEL_PATHS = {
16
+ "Random Forest": "models/rf_model_3.pkl",
17
+ "Multinomial Naive Bayes": "models/mNB_model_3.pkl",
18
+ "Support Vector Machine": "models/svm_model_3.pkl",
19
+ "Complement Naive Bayes": "models/cNB_model_3.pkl",
20
  }
21
 
22
 
 
247
  st.warning("Unable to compute token contribution for this model.")
248
  else:
249
  st.warning("Please input text to classify.")
250
+ st.markdown("---")
251
+ st.markdown("""
252
+ ## Changelogs:
253
+ - Version 2 (August 2, 2025): Improvements across `precision` and `recall` metrics on training by random oversampling ham classes on `X_train` in training pipeline using `imbalanced-learn`package. Latest deployed models trained under these run params.
254
+ - Version 1 (July 28, 2025): Initial demo of the project with 4 traditional ML classifiers using TFIDF vectorizer.
255
+ """)
256
+
257
  with DataCurationTAB:
258
  st.markdown("""
259
  Data cleaning and pre-processing is necessary as we are considering three datasets with different contexts. Below is a summary of the data treatment and insights done to make the versions of the dataset.
 
908
 
909
  st.markdown("### Example output of run of cell; Check also sample of mflow ui")
910
  st.code(sample_run, language="python")
911
+ img1 = Image.open("img/mlflow_ui_1.png")
912
  st.image(img1, caption="MLFlow UI", use_container_width=True)
913
 
914
 
 
920
  With this, presented below is the training summary done under initial run parameters for `preprocessor=tfidf` and `cv_folds=5` for all models considered of the study.
921
 
922
  Models considered are the following: `complement_NB (cNB)`, `multinomial_NB (mNB)`, `random_forest (rf)`, `support_vector_machine (svm)`
923
+
924
+ REVISION (August 4, 2025): Upon advise, we have done a oversampling of the ham classes. Improvements on `precision` and `recall` is observed in 3rd Run.""")
925
 
926
  st.markdown("""
927
 
 
934
  """)
935
 
936
  st.markdown("---")
937
+ st.markdown("## 1st Run (July 28, 2025)")
938
 
939
  st.markdown(
940
  """This run used the default `ngram_range` for the TF-IDF vectorizer, which is `(1, 1)`. This means that only single-word tokens `(unigrams)` were considered during feature extraction."""
 
1014
  st.image(img9, caption="CV Performance", use_container_width=True)
1015
 
1016
  st.markdown("---")
1017
+ st.write("## 2nd Run (July 28, 2025)")
1018
  st.markdown("""
1019
  For this second run, the `ngram_range` for the TF-IDF vectorizer was changed, which is now `(1, 2)`.
1020
  This means that unigrams from previous run and two-word tokens `(bigrams)` were considered during feature extraction.
 
1092
  with col2:
1093
  st.image(img18, caption="CV Performance", use_container_width=True)
1094
 
1095
+ st.markdown("## 3rd Run (August 4, 2025)")
1096
+ st.markdown("""
1097
+ For this third run, we address the class imbalance by oversampling of the ham class using the `imbalance-learn` package with the aim that ham class is twice that of spam during training in pipeline.
1098
+ Improvement on `precision` and `recall` is observed overall. The performance metrics are shown below with `random forest` and `complement NB` considered the best models.
1099
+ """)
1100
+
1101
+ st.markdown("""### 3rd_Run Validation Accuracy""")
1102
+ st.markdown("""
1103
+ | models | val_accuracy | precision_0 | recall_0 | f1-score_0 | support | hyper_params | tfidf_range |
1104
+ |---|---|---|---|---|---|---|---|
1105
+ | cNB | 0.96 | 0.86 | 0.84 | 0.85 | {0:95,1:615} | 'classifier__alpha': [0.3, 0.5, 1.0, 1.5] | (1,2) + X_train = {2(ham):spam} |
1106
+ | mNB | 0.95 | 0.75 | 0.91 | 0.83 | {0:95,1:615} | 'classifier__alpha': [0.3, 0.5, 1.0, 1.5] | (1,2) + X_train = {2(ham):spam} |
1107
+ | rf | 0.97 | 0.79 | 0.86 | 0.82 | {0:95,1:615} | 'classifier__n_estimators':[50, 100, 200]<br>'classifier__max_depth':[None, 10, 20]<br>'classifier__min_samples_split':[1,2,5,10] | (1,2) + X_train = {2(ham):spam} |
1108
+ | svm | 0.96 | 0.96 | 0.72 | 0.82 | {0:95,1:615} | 'classifier__C': [0.1, 1, 10]<br>'classifier__kernel': ['linear','rbf'] | (1,2) + X_train = {2(ham):spam} |
1109
+
1110
+ """)
1111
+
1112
+ st.markdown("""
1113
+
1114
+ ### 3rd_Run Test Accuracy
1115
+
1116
+ | models | test_accuracy | precision_0 | recall_0 | f1-score_0 | support | hyper_params | tfidf_range |
1117
+ |---|---|---|---|---|---|---|---|
1118
+ | cNB | 0.89 | 0.89 | 0.86 | 0.87 | {0:125,1:889} | 'classifier__alpha': [0.3, 0.5, 1.0, 1.5] | (1,2) + X_train = {2(ham):spam} |
1119
+ | mNB | 0.95 | 0.76 | 0.91 | 0.83 | {0:125,1:889} | 'classifier__alpha': [0.3, 0.5, 1.0, 1.5] | (1,2) + X_train = {2(ham):spam} |
1120
+ | rf | 0.96 | 0.80 | 0.92 | 0.86 | {0:125,1:889} | 'classifier__n_estimators':[50, 100, 200]<br>'classifier__max_depth':[None, 10, 20]<br>'classifier__min_samples_split':[1,2,5,10] | (1,2) + X_train = {2(ham):spam} |
1121
+ | svm | 0.96 | 0.96 | 0.71 | 0.82 | {0:125,1:889} | 'classifier__C': [0.1, 1, 10]<br>'classifier__kernel': ['linear','rbf'] | (1,2) + X_train = {2(ham):spam} |
1122
+ """)
1123
+ st.markdown("#### SVM Performance Metrics")
1124
+ img19 = Image.open(
1125
+ "img/3_run/svm/confusion_matrix_svm_05edf44674ee4ffeb25b1284d0a08e83.png"
1126
+ )
1127
+ img20 = Image.open(
1128
+ "img/3_run/svm/cv_performance_05edf44674ee4ffeb25b1284d0a08e83.png"
1129
+ )
1130
+ col1, col2 = st.columns(2)
1131
+ with col1:
1132
+ st.image(img19, caption="Confusion Matrix", use_container_width=True)
1133
+ with col2:
1134
+ st.image(img20, caption="CV Performance", use_container_width=True)
1135
+
1136
+ st.markdown("#### RF Performance Metrics")
1137
+ img21 = Image.open(
1138
+ "img/3_run/rf/confusion_matrix_random_forest_498f3ef34e954cdeb074bce4766180af.png"
1139
+ )
1140
+ img22 = Image.open(
1141
+ "img/3_run/rf/cv_performance_498f3ef34e954cdeb074bce4766180af.png"
1142
+ )
1143
+ col1, col2 = st.columns(2)
1144
+ with col1:
1145
+ st.image(img21, caption="Confusion Matrix", use_container_width=True)
1146
+ with col2:
1147
+ st.image(img22, caption="CV Performance", use_container_width=True)
1148
+
1149
+ st.markdown("#### Multinomial Naive Bayes Performance Metrics")
1150
+ img23 = Image.open(
1151
+ "img/3_run/mNB/confusion_matrix_multinomialNB_5c755dd20b2c44aa92dff382a9a9073f.png"
1152
+ )
1153
+ img24 = Image.open(
1154
+ "img/3_run/mNB/cv_performance_5c755dd20b2c44aa92dff382a9a9073f.png"
1155
+ )
1156
+ col1, col2 = st.columns(2)
1157
+ with col1:
1158
+ st.image(img23, caption="Confusion Matrix", use_container_width=True)
1159
+ with col2:
1160
+ st.image(img24, caption="CV Performance", use_container_width=True)
1161
+
1162
+ st.markdown("#### Complement Naive Bayes Performance Metrics")
1163
+ img25 = Image.open(
1164
+ "img/3_run/cNB/confusion_matrix_complementNB_9497e0a34f76466b8c6ab91aa4ec433e.png"
1165
+ )
1166
+ img26 = Image.open(
1167
+ "img/3_run/cNB/cv_performance_9497e0a34f76466b8c6ab91aa4ec433e.png"
1168
+ )
1169
+ col1, col2 = st.columns(2)
1170
+ with col1:
1171
+ st.image(img25, caption="Confusion Matrix", use_container_width=True)
1172
+ with col2:
1173
+ st.image(img26, caption="CV Performance", use_container_width=True)
1174
+
1175
  with ConTAB:
1176
  st.markdown("""
1177
  ## Conclusion
 
1184
  - Considers traditional machine learning classifiers of the two NB variants, SVM, and RF.
1185
  - demo app available in `HuggingFace Space` for further collaboration and feedback to target stakeholders and SIM users. """)
1186
 
1187
+ st.markdown("---")
1188
  st.markdown("""
1189
  ## Recommendations
1190
 
img/3_run/cNB/confusion_matrix_complementNB_9497e0a34f76466b8c6ab91aa4ec433e.png ADDED
img/3_run/cNB/cv_performance_9497e0a34f76466b8c6ab91aa4ec433e.png ADDED
img/3_run/mNB/confusion_matrix_multinomialNB_5c755dd20b2c44aa92dff382a9a9073f.png ADDED
img/3_run/mNB/cv_performance_5c755dd20b2c44aa92dff382a9a9073f.png ADDED
img/3_run/rf/confusion_matrix_random_forest_498f3ef34e954cdeb074bce4766180af.png ADDED
img/3_run/rf/cv_performance_498f3ef34e954cdeb074bce4766180af.png ADDED
img/3_run/svm/confusion_matrix_svm_05edf44674ee4ffeb25b1284d0a08e83.png ADDED
img/3_run/svm/cv_performance_05edf44674ee4ffeb25b1284d0a08e83.png ADDED
models/cNB_model_3.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32371782b918891e72f0f771ab1fbe31926dfa88f3102e822aefb4b20b825fd5
3
+ size 1119394
models/mNB_model_3.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b82ab40ba9273c87bad415634d7b84e7a5284c8b12a7562abbec0c36b9a0333
3
+ size 984278
models/rf_model_3.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:466411d34de5afeeba78b64984f7fdc2ee3dfd17b98fd84fab87e89a2b09c02b
3
+ size 4130299
models/svm_model_3.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3b47ef7c63836664c4c56797fa231db1557ecfeace0727e522e8246040d469b1
3
+ size 801049
requirements.txt CHANGED
@@ -5,3 +5,4 @@ scikit-learn>=1.0
5
  plotly>=5.9.0
6
  wordcloud>=1.8.1
7
  Pillow>=8.0.0
 
 
5
  plotly>=5.9.0
6
  wordcloud>=1.8.1
7
  Pillow>=8.0.0
8
+ imbalanced-learn>=0.12.4