Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
241.3
TFLOPS
11
3
55
Luca Di Liello
lucadiliello
Follow
NickyNicky's profile picture
OccasionallyNLP's profile picture
lamaa's profile picture
7 followers
·
5 following
https://lucadiliello.github.io
lucadiliello
AI & ML interests
Applied Scientist II in Amazon AGI
Recent Activity
reacted
to
lysandre
's
post
with 🚀
8 days ago
We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez! v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025. Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!
reacted
to
Norod78
's
post
with 🔥
26 days ago
Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999
reacted
to
Norod78
's
post
with 👍
26 days ago
Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999
View all activity
Organizations
None yet
lucadiliello
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
New activity in
lucadiliello/STORIES
about 1 year ago
Hi Luca, what's the source of this dataset?
3
#1 opened about 1 year ago by
liyucheng
New activity in
mistral-community/pixtral-12b
about 1 year ago
Ask for guilding batch inference
5
#15 opened about 1 year ago by
nguyen-brat
New activity in
lucadiliello/BLEURT-20
almost 2 years ago
Error tokenization
1
#2 opened almost 2 years ago by
explorista
New activity in
dandelin/vilt-b32-finetuned-coco
almost 2 years ago
batch inference
2
#1 opened over 2 years ago by
luckylight
New activity in
lucadiliello/bart-small
about 2 years ago
Request: DOI
1
#3 opened about 2 years ago by
arshiyak
New activity in
mistralai/Mistral-7B-v0.1
about 2 years ago
Which Mistral datacenter was used for training ?
2
#25 opened about 2 years ago by
niko32
New activity in
lucadiliello/newsqa
over 2 years ago
Does lucadiliello / newsqa have any unaswerable question?
1
#2 opened over 2 years ago by
MaggiePai
New activity in
tiiuae/falcon-40b
over 2 years ago
In addition to task 'text-generation', can falcon be used for other tasks like summarization, QA etc?
👍
1
3
#37 opened over 2 years ago by
VS9205
New activity in
lucadiliello/bart-small
over 2 years ago
Adding `safetensors` variant of this model
#1 opened over 2 years ago by
SFconvertbot
New activity in
microsoft/bloom-deepspeed-inference-int8
over 2 years ago
Can I load these weights into a model using 8 gpus?
2
#2 opened about 3 years ago by
bournezz