Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,157 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 5.33.2
|
| 8 |
app_file: app.py
|
| 9 |
-
pinned:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Siswati-English Linguistic Translation Tool
|
| 3 |
+
emoji: π¬
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: green
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 5.33.2
|
| 8 |
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
tags:
|
| 12 |
+
- translation
|
| 13 |
+
- siswati
|
| 14 |
+
- linguistics
|
| 15 |
+
- african-languages
|
| 16 |
+
- nlp
|
| 17 |
+
- research
|
| 18 |
+
- corpus-analysis
|
| 19 |
+
- bantu-languages
|
| 20 |
+
- m2m100
|
| 21 |
+
- multilingual
|
| 22 |
---
|
| 23 |
|
| 24 |
+
# π¬ Siswati-English Linguistic Translation Tool
|
| 25 |
+
|
| 26 |
+
An advanced AI-powered translation system with comprehensive linguistic analysis features, designed specifically for linguists, researchers, and language documentation projects working with Siswati and English.
|
| 27 |
+
|
| 28 |
+
## π Features
|
| 29 |
+
|
| 30 |
+
### π Translation Capabilities
|
| 31 |
+
- **Bidirectional Translation**: High-quality English β Siswati translation
|
| 32 |
+
- **Advanced Model Architecture**: Built on M2M100 transformer models
|
| 33 |
+
- **Batch Processing**: Process multiple texts simultaneously for corpus analysis
|
| 34 |
+
- **Real-time Analysis**: Instant linguistic metrics and feature detection
|
| 35 |
+
|
| 36 |
+
### π Linguistic Analysis
|
| 37 |
+
- **Morphological Complexity**: Word length, sentence structure analysis
|
| 38 |
+
- **Lexical Diversity**: Vocabulary richness measurements
|
| 39 |
+
- **Language-Specific Features**: Siswati agglutination, click consonants, tone markers
|
| 40 |
+
- **Translation Ratios**: Comparative analysis between source and target languages
|
| 41 |
+
- **Statistical Metrics**: Character count, word count, sentence segmentation
|
| 42 |
+
|
| 43 |
+
### π¬ Research Tools
|
| 44 |
+
- **Translation History**: Track and analyze translation patterns over time
|
| 45 |
+
- **CSV Export**: Research-ready data export for statistical analysis
|
| 46 |
+
- **Corpus Management**: Batch processing for linguistic corpora
|
| 47 |
+
- **Performance Metrics**: Processing time and efficiency tracking
|
| 48 |
+
|
| 49 |
+
## π£οΈ About Siswati
|
| 50 |
+
|
| 51 |
+
**Siswati** (also known as **Swati** or **Swazi**) is a Bantu language spoken by approximately 2.3 million people, primarily in:
|
| 52 |
+
- πΈπΏ **Eswatini** (Kingdom of Eswatini) - Official language
|
| 53 |
+
- πΏπ¦ **South Africa** - One of 11 official languages
|
| 54 |
+
|
| 55 |
+
### Linguistic Features
|
| 56 |
+
- **Language Family**: Niger-Congo β Bantu β Southeast Bantu
|
| 57 |
+
- **Script**: Latin alphabet
|
| 58 |
+
- **Characteristics**: Agglutinative morphology, click consonants, tonal
|
| 59 |
+
- **ISO Code**: ss (ISO 639-1), ssw (ISO 639-3)
|
| 60 |
+
|
| 61 |
+
## π€ Model Information
|
| 62 |
+
|
| 63 |
+
This tool uses state-of-the-art transformer models developed by the **Data Science for Social Impact Research Group**:
|
| 64 |
+
|
| 65 |
+
- **English β Siswati**: `dsfsi/en-ss-m2m100-combo`
|
| 66 |
+
- **Siswati β English**: `dsfsi/ss-en-m2m100-combo`
|
| 67 |
+
|
| 68 |
+
Both models are based on Meta's M2M100 architecture, fine-tuned specifically for Siswati-English translation pairs.
|
| 69 |
+
|
| 70 |
+
## π― Use Cases
|
| 71 |
+
|
| 72 |
+
### For Linguists & Researchers
|
| 73 |
+
- **Language Documentation**: Analyze translation patterns and linguistic features
|
| 74 |
+
- **Corpus Studies**: Process large text collections with batch translation
|
| 75 |
+
- **Comparative Analysis**: Study morphological and syntactic differences
|
| 76 |
+
- **Quality Assessment**: Evaluate translation adequacy and fluency
|
| 77 |
+
|
| 78 |
+
### For Educators & Students
|
| 79 |
+
- **Language Learning**: Understand translation patterns and linguistic structures
|
| 80 |
+
- **Academic Research**: Export data for statistical analysis and publications
|
| 81 |
+
- **Computational Linguistics**: Study machine translation for low-resource languages
|
| 82 |
+
|
| 83 |
+
### For Community & Cultural Projects
|
| 84 |
+
- **Language Preservation**: Support Siswati language documentation efforts
|
| 85 |
+
- **Cultural Exchange**: Facilitate communication between English and Siswati speakers
|
| 86 |
+
- **Content Translation**: Assist in translating educational and cultural materials
|
| 87 |
+
|
| 88 |
+
## π Getting Started
|
| 89 |
+
|
| 90 |
+
1. **Single Translation**: Enter text and select translation direction
|
| 91 |
+
2. **Batch Processing**: Upload `.txt` files or paste multiple lines for corpus analysis
|
| 92 |
+
3. **Analysis Export**: Use the research tools to export translation data as CSV
|
| 93 |
+
4. **Linguistic Study**: Explore the real-time analysis features for detailed insights
|
| 94 |
+
|
| 95 |
+
## π Linguistic Metrics Explained
|
| 96 |
+
|
| 97 |
+
### Text Complexity
|
| 98 |
+
- **Word Count**: Total number of words in the text
|
| 99 |
+
- **Character Count**: Total characters including spaces and punctuation
|
| 100 |
+
- **Sentence Count**: Number of sentences detected
|
| 101 |
+
- **Average Word Length**: Mean character length per word
|
| 102 |
+
- **Lexical Diversity**: Ratio of unique words to total words (vocabulary richness)
|
| 103 |
+
|
| 104 |
+
### Translation Analysis
|
| 105 |
+
- **Word Ratio**: Target word count / Source word count
|
| 106 |
+
- **Character Ratio**: Target character count / Source character count
|
| 107 |
+
- **Processing Time**: Time taken for model inference
|
| 108 |
+
|
| 109 |
+
### Siswati-Specific Features
|
| 110 |
+
- **Agglutination Detection**: Identification of potentially agglutinated words (>10 characters)
|
| 111 |
+
- **Click Consonants**: Count of clicks (c, q, x sounds)
|
| 112 |
+
- **Tone Markers**: Detection of acute (Μ) and grave (Μ) accent marks
|
| 113 |
+
|
| 114 |
+
## π Academic Usage
|
| 115 |
+
|
| 116 |
+
If you use this tool in your research, please cite the original models:
|
| 117 |
+
|
| 118 |
+
```bibtex
|
| 119 |
+
@misc{dsfsi-siswati-translation,
|
| 120 |
+
title={Siswati-English Translation Models},
|
| 121 |
+
author={Marivate, Vukosi and Lastrucci, Richard},
|
| 122 |
+
year={2024},
|
| 123 |
+
publisher={Data Science for Social Impact Research Group},
|
| 124 |
+
url={https://github.com/dsfsi/}
|
| 125 |
+
}
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
## π Related Resources
|
| 129 |
+
|
| 130 |
+
- **Model Repositories**: [En-Ss Model](https://github.com/dsfsi/en-ss-m2m100-combo) | [Ss-En Model](https://github.com/dsfsi/ss-en-m2m100-combo)
|
| 131 |
+
- **Research Group**: [DSFSI](https://dsfsi.github.io/)
|
| 132 |
+
- **Feedback**: [Research Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/viewform)
|
| 133 |
+
|
| 134 |
+
## π€ Contributing
|
| 135 |
+
|
| 136 |
+
We welcome contributions from the linguistic and NLP communities! Areas of interest:
|
| 137 |
+
- Improving translation quality
|
| 138 |
+
- Adding more linguistic analysis features
|
| 139 |
+
- Expanding to other African languages
|
| 140 |
+
- Enhancing the user interface for research workflows
|
| 141 |
+
|
| 142 |
+
## π License
|
| 143 |
+
|
| 144 |
+
This project is licensed under the Apache 2.0 License. The underlying models may have their own licensing terms - please check the individual model repositories.
|
| 145 |
+
|
| 146 |
+
## π Supporting African Languages
|
| 147 |
+
|
| 148 |
+
This tool is part of a broader effort to support African language technology and computational linguistics research. By providing advanced NLP tools for Siswati, we aim to:
|
| 149 |
+
|
| 150 |
+
- Preserve and promote African languages in the digital age
|
| 151 |
+
- Support linguistic research and documentation
|
| 152 |
+
- Enable better communication across language barriers
|
| 153 |
+
- Contribute to the development of multilingual AI systems
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
**Built with β€οΈ for the African NLP community**
|