kangourous commited on
Commit
f15e87b
·
verified ·
1 Parent(s): 367e849

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -37
README.md CHANGED
@@ -8,71 +8,99 @@ pinned: false
8
  ---
9
 
10
 
11
- # Audio classification - XGBoost and small deep neural network
12
 
13
- This is an ensemble of 2 models (XGBoost and small CNN) for the Audio task of the Frugal AI Challenge 2024. Instead of giving binary labels (0 or 1), both models predict a probability, between 0 and 1. This allow to make a tradeof between precision and recall by setting a threshold between 0 and 1.
14
- Both models are trained independently, and their prediction has been averaged.
15
 
16
- ### Intended Use
 
 
17
 
18
- - **Primary intended uses**: Identify illegal logging in forests
19
 
 
20
 
21
  ## Training Data
22
 
23
- The model uses the rfcx/frugalai dataset:
24
- - Size: ~50000 examples
25
- - Split: 70% train, 30% test
26
- - Binary labels | 0: Chainsaw; 1: Environnement
 
 
 
 
27
 
28
- ### Data Loading
 
 
 
29
 
30
- Most of the audio are 3 seconds long, with a sampling rate of 12000, wich means that each row of the dataset contains 36000 elements.
31
- A few files have a bigger sampling rate, so we resample them at 12000. For audios that are smaller than 3 seconds, we add a reverse padding at the end.
32
- Raw audio data are stored in a numpy array of size (n, 36000), with float 16 precision to gain in memory usage (there is no drop in precision compared to float 32).
33
 
34
- ## Model Description : XGBoost
35
- This is an XGBoost Regressor, with probability in output, composed of 3000 trees.
36
 
37
- ### Input features
38
 
39
- XGBoost uses as input features:
40
- - **MFCC** : 55 mfcc are retains. A window size of 1024 is used for nfft. I took the mean and standard deviation along the spatial axis (110 features)
41
- - **Mel Spectrogram** : The mel spectrogram is calculated with a window size of 1024 for nfft, and 55 mel.
42
- I took the mean and standard deviation along the spatial axis (110 features). In addition, I added the standard deviation of the delta coefficient of the spectrogram,
43
- in order to capture the caracteristic signature of the chainsaw sound, when it goes from idle to full load (55 more features).
44
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ### Training Details
47
- I used the python library xgboost. I trained the model using cuda, with a learning rate of 0.02. No data augmentation were used.
48
- The notebook used for the training : [notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/XGBoost_train.ipynb)
49
 
 
 
 
 
 
 
 
 
50
 
51
- ## Model Description : CNN
52
- This is an CNN, with a sigmoid activation, of almost 1M parameters.
53
 
54
- ### Input features
55
- This model uses as input features **Log Mel Spectrogram**, calculated with a window size of 1024 for nfft, and 100 mel.
 
 
 
 
56
 
57
  ### Training Details
58
- Pytorch was used to trained this model, with an Adam Optimizer with a learning rate of 0.001, a Binary Cross Entropy Loss. I randomly added sound labeled
59
- as environnement, to add more noise to the dataset, without changing the labels.
60
 
61
- The notebook used for the training : [notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/CNN_training.ipynb)
 
 
 
 
62
 
 
 
 
63
 
64
  ## Performance
65
- In this challenge we mesure accuracy and energy consumption of the model. The generation of the spectrograms and MFCCs, (and model inference of course) are included in the energy consumption tracking.
66
- However, I didn't include the data loading part.
67
 
68
  ### Metrics
69
- - **XGBoost Accuracy**: ~95.3%
70
- - **CNN Accuracy**: ~95.7%
71
- - **Total Accuracy**: ~96.1%
72
- - **Total Energy consumption in Wh (on nvidia T4)**: ~0.164
73
 
74
- ![image](/spaces/kangourous/submission-audio-task/resolve/main/recall_precision.png)
 
 
 
 
 
75
 
 
76
 
77
  ## Environmental Impact
78
 
 
8
  ---
9
 
10
 
11
+ # Audio Classification - XGBoost and Small Deep Neural Network
12
 
13
+ This is an ensemble of two models (XGBoost and a small CNN) for the audio classification task of the Frugal AI Challenge 2024. Instead of providing binary labels (0 or 1), both models predict a probability between 0 and 1. This allows a trade-off between precision and recall by setting a threshold within this range. Both models are trained independently, and their predictions are averaged.
 
14
 
15
+ ---
16
+
17
+ ## Intended Use
18
 
19
+ - **Primary intended use**: Identifying illegal logging in forests.
20
 
21
+ ---
22
 
23
  ## Training Data
24
 
25
+ The model uses the `rfcx/frugalai` dataset:
26
+ - **Size**: ~50,000 examples.
27
+ - **Split**: 70% training, 30% testing.
28
+ - **Binary labels**:
29
+ - `0`: Chainsaw.
30
+ - `1`: Environment.
31
+
32
+ ### Data Preprocessing
33
 
34
+ Most audio samples are 3 seconds long, with a sampling rate of 12,000 Hz. This means each row of the dataset contains 36,000 elements.
35
+ - **Resampling**: Audio files with a higher sampling rate are downsampled to 12,000 Hz.
36
+ - **Padding**: Audio files shorter than 3 seconds are padded with their reversed signal at the end.
37
+ - **Storage**: Raw audio data is stored in a NumPy array of size `(n, 36,000)` with `float16` precision to reduce memory usage without significant precision loss compared to `float32`.
38
 
39
+ ---
 
 
40
 
41
+ ## Model Description: XGBoost
 
42
 
43
+ This is an XGBoost regressor that outputs probabilities. It consists of 3,000 trees.
44
 
45
+ ### Input Features
 
 
 
 
46
 
47
+ XGBoost uses the following input features:
48
+ - **MFCC**:
49
+ - 55 MFCCs are retained.
50
+ - Calculated with a window size of 1,024 for `nfft`.
51
+ - Mean and standard deviation are taken along the spatial axis (resulting in 110 features).
52
+ - **Mel Spectrogram**:
53
+ - Calculated with a window size of 1,024 for `nfft` and 55 mel bands.
54
+ - Mean and standard deviation along the spatial axis (110 features).
55
+ - Standard deviation of the delta coefficients of the spectrogram (55 additional features). This captures the characteristic signature of chainsaw sounds transitioning from idle to full load.
56
+ - (See [Exploratory Data Analysis](/spaces/kangourous/submission-audio-task/blob/main/notebooks/EDA.ipynb))
57
 
58
  ### Training Details
 
 
59
 
60
+ - Framework: Python library `xgboost`.
61
+ - Training on GPU using CUDA.
62
+ - Learning rate: `0.02`.
63
+ - No data augmentation was used.
64
+
65
+ **Training notebook**: [XGBoost Training Notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/XGBoost_train.ipynb)
66
+
67
+ ---
68
 
69
+ ## Model Description: CNN
 
70
 
71
+ This is a small convolutional neural network (CNN) with sigmoid activation and approximately 1M parameters.
72
+
73
+ ### Input Features
74
+
75
+ The CNN uses **Log Mel Spectrograms** as input features:
76
+ - Calculated with a window size of 1,024 for `nfft` and 100 mel bands.
77
 
78
  ### Training Details
 
 
79
 
80
+ - Framework: PyTorch.
81
+ - Optimizer: Adam.
82
+ - Learning rate: `0.001`.
83
+ - Loss function: Binary Cross-Entropy Loss.
84
+ - Data augmentation: Additional environment-labeled sounds were added to increase dataset noise without modifying the labels.
85
 
86
+ **Training notebook**: [CNN Training Notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/CNN_training.ipynb)
87
+
88
+ ---
89
 
90
  ## Performance
91
+
92
+ In this challenge, accuracy and energy consumption are measured. The generation of spectrograms, MFCCs, and model inference are included in the energy consumption tracking. However, data loading is not included.
93
 
94
  ### Metrics
 
 
 
 
95
 
96
+ - **XGBoost Accuracy**: ~95.3%
97
+ - **CNN Accuracy**: ~95.7%
98
+ - **Ensemble Accuracy**: ~96.1%
99
+ - **Total Energy Consumption (in Wh, on NVIDIA T4)**: ~0.164
100
+
101
+ ![Precision-Recall Curve](/spaces/kangourous/submission-audio-task/resolve/main/recall_precision.png)
102
 
103
+ ---
104
 
105
  ## Environmental Impact
106