The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper β’ 2510.13996 β’ Published Oct 15 β’ 7
CST5: Data Augmentation for Code-Switched Semantic Parsing Paper β’ 2211.07514 β’ Published Nov 14, 2022 β’ 1
MT5 release Collection The MT5 release follows the T5 family, but is pretrained on multilingual data. The update UMT5 models are pretrained on an updated corpus. β’ 10 items β’ Updated Jul 10 β’ 22
view article Article Introducing the Polish ASR Leaderboard (PAL) and Benchmark Intended Grouping of Open Speech (BIGOS) Corpora Jul 10, 2024 β’ 4
The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP Paper β’ 2510.05644 β’ Published Oct 7 β’ 23
view article Article Automated Discovery of High-Performance GPU Kernels with OpenEvolve Jun 27 β’ 23
H-Net Collection The family of hierarchical networks (H-Nets) from https://arxiv.org/abs/2507.07955 β’ 8 items β’ Updated Jul 11 β’ 20
OmniGEC Collection This is a collection of multilingual silver-standard datasets and models for the task of Grammatical Error Correction (GEC). β’ 9 items β’ Updated Sep 19 β’ 8
view article Article Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM Mar 12 β’ 471
Gemma 3 Collection All versions of Google's new multimodal models including QAT in 1B, 4B, 12B, and 27B sizes. In GGUF, dynamic 4-bit and 16-bit formats. β’ 55 items β’ Updated 23 days ago β’ 93
MT Quality Estimation Collection Models for reference-free quality estimation of machine translation β’ 10 items β’ Updated Jan 29 β’ 4
GTE models Collection General Text Embedding Models Released by Tongyi Lab of Alibaba Group β’ 21 items β’ Updated Jan 21 β’ 32
OWLS: Scaling Laws for Speech Recognition and Translation Collection π¦ A suite of Whisper-style models from 250M to 18B parameters. Trained on up to 360K hours of data. 16k sampling rate. β’ 8 items β’ Updated May 3 β’ 7
view article Article From Llasa to Llasagna π: Finetuning LLaSA to generates Italian speech and other languages Feb 11 β’ 33
NeMo Curator - Classifier Models Collection Classifier models that can be used in NeMo Curator for labelling/filtering datasets. β’ 11 items β’ Updated 1 day ago β’ 24
Ukrainian Text-to-Speech datasets Collection Five voices: Mykyta, Oleksa, Lada, Kateryna or Tetiana β’ 6 items β’ Updated Feb 26 β’ 4
Crimean Tatar Text-to-Speech datasets Collection Three voices: Abibullah, Sevil, or Arslan β’ 4 items β’ Updated May 27 β’ 2