Update README.md
Browse files
README.md
CHANGED
|
@@ -34,26 +34,26 @@ The model uses the `rfcx/frugalai` dataset:
|
|
| 34 |
Most audio samples are 3 seconds long, with a sampling rate of 12,000 Hz. This means each row of the dataset contains 36,000 elements.
|
| 35 |
- **Resampling**: Audio files with a higher sampling rate are downsampled to 12,000 Hz.
|
| 36 |
- **Padding**: Audio files shorter than 3 seconds are padded with their reversed signal at the end.
|
| 37 |
-
- **Storage**: Raw audio data is stored in a NumPy array of size `(n,
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
## Model Description: XGBoost
|
| 42 |
|
| 43 |
-
This is an XGBoost regressor that outputs probabilities. It consists of
|
| 44 |
|
| 45 |
### Input Features
|
| 46 |
|
| 47 |
XGBoost uses the following input features:
|
| 48 |
- **MFCC**:
|
| 49 |
- 55 MFCCs are retained.
|
| 50 |
-
- Calculated with a window size of
|
| 51 |
- Mean and standard deviation are taken along the spatial axis (resulting in 110 features).
|
| 52 |
- **Mel Spectrogram**:
|
| 53 |
-
- Calculated with a window size of
|
| 54 |
- Mean and standard deviation along the spatial axis (110 features).
|
| 55 |
- Standard deviation of the delta coefficients of the spectrogram (55 additional features). This captures the characteristic signature of chainsaw sounds transitioning from idle to full load.
|
| 56 |
-
|
| 57 |
|
| 58 |
### Training Details
|
| 59 |
|
|
@@ -81,7 +81,8 @@ The CNN uses **Log Mel Spectrograms** as input features:
|
|
| 81 |
- Optimizer: Adam.
|
| 82 |
- Learning rate: `0.001`.
|
| 83 |
- Loss function: Binary Cross-Entropy Loss.
|
| 84 |
-
- Data augmentation:
|
|
|
|
| 85 |
|
| 86 |
**Training notebook**: [CNN Training Notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/CNN_training.ipynb)
|
| 87 |
|
|
@@ -111,7 +112,8 @@ Environmental impact is tracked using CodeCarbon, measuring:
|
|
| 111 |
This tracking helps establish a baseline for the environmental impact of model deployment and inference.
|
| 112 |
|
| 113 |
## Limitations
|
| 114 |
-
- The model can
|
|
|
|
| 115 |
|
| 116 |
## Ethical Considerations
|
| 117 |
- Environmental impact is tracked to promote awareness of AI's carbon footprint
|
|
|
|
| 34 |
Most audio samples are 3 seconds long, with a sampling rate of 12,000 Hz. This means each row of the dataset contains 36,000 elements.
|
| 35 |
- **Resampling**: Audio files with a higher sampling rate are downsampled to 12,000 Hz.
|
| 36 |
- **Padding**: Audio files shorter than 3 seconds are padded with their reversed signal at the end.
|
| 37 |
+
- **Storage**: Raw audio data is stored in a NumPy array of size `(n, 36000)` with `float16` precision to reduce memory usage without significant precision loss compared to `float32`.
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
## Model Description: XGBoost
|
| 42 |
|
| 43 |
+
This is an XGBoost regressor that outputs probabilities. It consists of 3000 trees.
|
| 44 |
|
| 45 |
### Input Features
|
| 46 |
|
| 47 |
XGBoost uses the following input features:
|
| 48 |
- **MFCC**:
|
| 49 |
- 55 MFCCs are retained.
|
| 50 |
+
- Calculated with a window size of 1024 for `nfft`.
|
| 51 |
- Mean and standard deviation are taken along the spatial axis (resulting in 110 features).
|
| 52 |
- **Mel Spectrogram**:
|
| 53 |
+
- Calculated with a window size of 1024 for `nfft` and 55 mel bands.
|
| 54 |
- Mean and standard deviation along the spatial axis (110 features).
|
| 55 |
- Standard deviation of the delta coefficients of the spectrogram (55 additional features). This captures the characteristic signature of chainsaw sounds transitioning from idle to full load.
|
| 56 |
+
(See [Exploratory Data Analysis](/spaces/kangourous/submission-audio-task/blob/main/notebooks/EDA.ipynb))
|
| 57 |
|
| 58 |
### Training Details
|
| 59 |
|
|
|
|
| 81 |
- Optimizer: Adam.
|
| 82 |
- Learning rate: `0.001`.
|
| 83 |
- Loss function: Binary Cross-Entropy Loss.
|
| 84 |
+
- Data augmentation: Mixup was used with pairs of raw audios, such as the second element is always environnement (labeled as 1), and the resulting audio take the label of
|
| 85 |
+
the first element. Thus, there is more variance in the environnment class, and more difficult samples in the chainsaw class.
|
| 86 |
|
| 87 |
**Training notebook**: [CNN Training Notebook](/spaces/kangourous/submission-audio-task/blob/main/notebooks/CNN_training.ipynb)
|
| 88 |
|
|
|
|
| 112 |
This tracking helps establish a baseline for the environmental impact of model deployment and inference.
|
| 113 |
|
| 114 |
## Limitations
|
| 115 |
+
- The model can misclassify a motor sound (Chainsaw vs Motocross, for example)
|
| 116 |
+
- This model is optimized to run on a specific hardware (Nvidia GPU)
|
| 117 |
|
| 118 |
## Ethical Considerations
|
| 119 |
- Environmental impact is tracked to promote awareness of AI's carbon footprint
|