audio_kangourous

Sleeping

App Files Files Community

kangourous commited on Feb 1

Commit

f15e87b

verified ·

1 Parent(s): 367e849

Update README.md

Browse files

Files changed (1) hide show

README.md +65 -37

README.md CHANGED Viewed

@@ -8,71 +8,99 @@ pinned: false
 ---
-# Audio classification - XGBoost and small deep neural network
-This is an ensemble of 2 models (XGBoost and small CNN) for the Audio task of the Frugal AI Challenge 2024. Instead of giving binary labels (0 or 1), both models predict a probability, between 0 and 1. This allow to make a tradeof between precision and recall by setting a threshold between 0 and 1.
-Both models are trained independently, and their prediction has been averaged.
-### Intended Use
-- **Primary intended uses**: Identify illegal logging in forests
 ## Training Data
-The model uses the rfcx/frugalai dataset:
-- Size: ~50000 examples
-- Split: 70% train, 30% test
-- Binary labels | 0: Chainsaw; 1: Environnement
-### Data Loading
-Most of the audio are 3 seconds long, with a sampling rate of 12000, wich means that each row of the dataset contains 36000 elements.
-A few files have a bigger sampling rate, so we resample them at 12000. For audios that are smaller than 3 seconds, we add a reverse padding at the end.
-Raw audio data are stored in a numpy array of size (n, 36000), with float 16 precision to gain in memory usage (there is no drop in precision compared to float 32).
-## Model Description : XGBoost
-This is an XGBoost Regressor, with probability in output, composed of 3000 trees.
-### Input features
-XGBoost uses as input features:
-- **MFCC** : 55 mfcc are retains. A window size of 1024 is used for nfft. I took the mean and standard deviation along the spatial axis (110 features)
-- **Mel Spectrogram** : The mel spectrogram is calculated with a window size of 1024 for nfft, and 55 mel.
-I took the mean and standard deviation along the spatial axis (110 features). In addition, I added the standard deviation of the delta coefficient of the spectrogram,
-in order to capture the caracteristic signature of the chainsaw sound, when it goes from idle to full load (55 more features).
 ### Training Details
-I used the python library xgboost. I trained the model using cuda, with a learning rate of 0.02. No data augmentation were used.
-The notebook used for the training : [notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/XGBoost_train.ipynb)
-## Model Description : CNN
-This is an CNN, with a sigmoid activation, of almost 1M parameters.
-### Input features
-This model uses as input features **Log Mel Spectrogram**, calculated with a window size of 1024 for nfft, and 100 mel.
 ### Training Details
-Pytorch was used to trained this model, with an Adam Optimizer with a learning rate of 0.001, a Binary Cross Entropy Loss. I randomly added sound labeled
-as environnement, to add more noise to the dataset, without changing the labels.
-The notebook used for the training : [notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/CNN_training.ipynb)
 ## Performance
-In this challenge we mesure accuracy and energy consumption of the model. The generation of the spectrograms and MFCCs, (and model inference of course) are included in the energy consumption tracking.
-However, I didn't include the data loading part.
 ### Metrics
-- **XGBoost Accuracy**: ~95.3%
-- **CNN Accuracy**: ~95.7%
-- **Total Accuracy**: ~96.1%
-- **Total Energy consumption in Wh (on nvidia T4)**: ~0.164
-![image](/spaces/kangourous/submission-audio-task/resolve/main/recall_precision.png)
 ## Environmental Impact

 ---
+# Audio Classification - XGBoost and Small Deep Neural Network
+This is an ensemble of two models (XGBoost and a small CNN) for the audio classification task of the Frugal AI Challenge 2024. Instead of providing binary labels (0 or 1), both models predict a probability between 0 and 1. This allows a trade-off between precision and recall by setting a threshold within this range. Both models are trained independently, and their predictions are averaged.
+---
+## Intended Use
+- **Primary intended use**: Identifying illegal logging in forests.
+---
 ## Training Data
+The model uses the `rfcx/frugalai` dataset:
+- **Size**: ~50,000 examples.
+- **Split**: 70% training, 30% testing.
+- **Binary labels**:
+  - `0`: Chainsaw.
+  - `1`: Environment.
+### Data Preprocessing
+Most audio samples are 3 seconds long, with a sampling rate of 12,000 Hz. This means each row of the dataset contains 36,000 elements.
+- **Resampling**: Audio files with a higher sampling rate are downsampled to 12,000 Hz.
+- **Padding**: Audio files shorter than 3 seconds are padded with their reversed signal at the end.
+- **Storage**: Raw audio data is stored in a NumPy array of size `(n, 36,000)` with `float16` precision to reduce memory usage without significant precision loss compared to `float32`.
+---
+## Model Description: XGBoost
+This is an XGBoost regressor that outputs probabilities. It consists of 3,000 trees.
+### Input Features
+XGBoost uses the following input features:
+- **MFCC**:
+  - 55 MFCCs are retained.
+  - Calculated with a window size of 1,024 for `nfft`.
+  - Mean and standard deviation are taken along the spatial axis (resulting in 110 features).
+- **Mel Spectrogram**:
+  - Calculated with a window size of 1,024 for `nfft` and 55 mel bands.
+  - Mean and standard deviation along the spatial axis (110 features).
+  - Standard deviation of the delta coefficients of the spectrogram (55 additional features). This captures the characteristic signature of chainsaw sounds transitioning from idle to full load.
+  -  (See [Exploratory Data Analysis](/spaces/kangourous/submission-audio-task/blob/main/notebooks/EDA.ipynb))
 ### Training Details
+- Framework: Python library `xgboost`.
+- Training on GPU using CUDA.
+- Learning rate: `0.02`.
+- No data augmentation was used.
+**Training notebook**: [XGBoost Training Notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/XGBoost_train.ipynb)
+---
+## Model Description: CNN
+This is a small convolutional neural network (CNN) with sigmoid activation and approximately 1M parameters.
+### Input Features
+The CNN uses **Log Mel Spectrograms** as input features:
+- Calculated with a window size of 1,024 for `nfft` and 100 mel bands.
 ### Training Details
+- Framework: PyTorch.
+- Optimizer: Adam.
+- Learning rate: `0.001`.
+- Loss function: Binary Cross-Entropy Loss.
+- Data augmentation: Additional environment-labeled sounds were added to increase dataset noise without modifying the labels.
+**Training notebook**: [CNN Training Notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/CNN_training.ipynb)
+---
 ## Performance
+In this challenge, accuracy and energy consumption are measured. The generation of spectrograms, MFCCs, and model inference are included in the energy consumption tracking. However, data loading is not included.
 ### Metrics
+- **XGBoost Accuracy**: ~95.3%
+- **CNN Accuracy**: ~95.7%
+- **Ensemble Accuracy**: ~96.1%
+- **Total Energy Consumption (in Wh, on NVIDIA T4)**: ~0.164
+![Precision-Recall Curve](/spaces/kangourous/submission-audio-task/resolve/main/recall_precision.png)
+---
 ## Environmental Impact