11 3 55

Luca Di Liello

lucadiliello

https://lucadiliello.github.io

lucadiliello

AI & ML interests

Applied Scientist II in Amazon AGI

Recent Activity

reacted to lysandre's post with 🚀 8 days ago

We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez! v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025. Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!

reacted to Norod78's post with 🔥 26 days ago

Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999

reacted to Norod78's post with 👍 26 days ago

View all activity

Organizations

None yet

New activity in lucadiliello/STORIES about 1 year ago

Hi Luca, what's the source of this dataset?

#1 opened about 1 year ago by

liyucheng

New activity in mistral-community/pixtral-12b about 1 year ago

Ask for guilding batch inference

#15 opened about 1 year ago by

nguyen-brat

New activity in lucadiliello/BLEURT-20 almost 2 years ago

Error tokenization

#2 opened almost 2 years ago by

explorista

New activity in dandelin/vilt-b32-finetuned-coco almost 2 years ago

batch inference

#1 opened over 2 years ago by

luckylight

New activity in lucadiliello/bart-small about 2 years ago

Request: DOI

#3 opened about 2 years ago by

arshiyak

New activity in mistralai/Mistral-7B-v0.1 about 2 years ago

Which Mistral datacenter was used for training ?

#25 opened about 2 years ago by

niko32

New activity in lucadiliello/newsqa over 2 years ago

Does lucadiliello / newsqa have any unaswerable question?

#2 opened over 2 years ago by

MaggiePai

New activity in tiiuae/falcon-40b over 2 years ago

In addition to task 'text-generation', can falcon be used for other tasks like summarization, QA etc?

👍 1

#37 opened over 2 years ago by

VS9205

New activity in lucadiliello/bart-small over 2 years ago

Adding `safetensors` variant of this model