Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

danielhanchen 
posted an update 2 days ago
black-yt 
posted an update 1 day ago
view post
Post
2440
Hey all — our ResearchClawBench leaderboard just updated 🔥

We let AI do real science: 40 tasks across 10 disciplines, compared to human papers. Hard example? 🏔️ Glacier mass change — AI must integrate 233 datasets from 35 teams, 4 methods, reproduce 6542±387 Gt ice loss vs IPCC. No toy problems.

Latest leaderboard (2026-06-09) 📊:
Agents: 🥇 Claude Code 21.5 (50 = match human), $5.3; 🥈 EvoScientist 18.8, $4.1; 🥉 Codex CLI 18.4, just $2.0
LLMs+Harness: 🥇 Claude-Opus-4.8 21.1, $4.0; 🥈 Claude-Opus-4.7 20.7; 🥉 MiniMax-M3 19.8, only $0.45; Qwen3.7-Max 18.7, $0.42, 11min 💥

Claude still king, but MiniMax/Qwen/DeepSeek are crazy cheap and competitive. Expensive isn't always better.

📎 Code & star: https://github.com/InternScience/ResearchClawBench
🏠 Website: https://internscience.github.io/ResearchClawBench-Home/
🤗 Upvote paper: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research (2606.07591)
  • 2 replies
·
sergiopaniego 
posted an update 2 days ago
view post
Post
3392
OpenEnv has a new home: github.com/huggingface/OpenEnv

Starting today, it's coordinated by a committee that includes Meta-PyTorch, Reflection, Unsloth, Modal, Prime Intellect, Nvidia, Mercor, Fleet AI, and Hugging Face

frontier labs train their models and their harnesses together. Claude knows Claude Code. GPT-5.5 knows Codex. that's not an accident, it's training. open-source models deserve the same magic, but pulling that off requires infrastructure that belongs to everyone, not one lab

OpenEnv is that layer. one api, any harness, any trainer, any environment

Rewards and training loops stay in TRL, Unsloth, wherever you already work. OpenEnv is the socket they all plug into

Get involved!

Full announcement: https://huggingface.co/blog/openenv-agentic-rl
eabdullin 
posted an update about 22 hours ago
view post
Post
1364
I’m doing a PhD in AI, which sounds impressive until you realize it mostly means I spend three years trying to make a computer say something slightly less stupid than it said yesterday.

People hear "AI researcher" and they think I’m building the future. No. I’m in a basement at 2 a.m. Googling, "CUDA error what the f**k does this mean."

And the worst part about AI research now is compute. You don’t even ask, "Is this idea good?" anymore. You ask, "Can I afford for this idea to be wrong?"

My advisor comes to me one day and says, "I think we should fine-tune our own language model."

I said, "Professor, with what money? I’m a PhD student. I have two bank accounts: checking and emotionally checking."

He goes, "Don’t worry. We have compute."

Now, in academia, "don’t worry" is never the beginning of a good sentence.

I said, "What do you mean we have compute?"

He said, "My friend knows the cluster admin. He can get us on the GPUs."

I said, "Okay… what do we have to do?"

He goes, "Nothing crazy. Just be very grateful in the acknowledgements."

I said, "How grateful?"

He said, "Maybe put him as co-author."

I said, "Co-author? Are we using the cluster, or is the cluster using us?"

Because at that point, that’s not a favor. That’s academic child support.

So I go to the server room, and the cluster admin walks up to me and goes, "So you’re the NLP student."

And in my head I’m like, "No, tonight you’re the principal investigator. You’re the provider. I’m just a little token waiting to be attended to."

Because whoever controls the GPUs controls the relationship. That’s lab romance.

He starts setting things up, and I’m trying to act casual, but I don’t understand any of the numbers he’s saying.

He’s like, "Yeah, I can probably give you four H100s for the weekend."

I’m nodding like, "Mmm. Four. Weekend. H. One hundred. Absolutely."

Inside I’m like, "Is that good? Is that prison time? Why did he say it like he was offering me organs?"

[Continue in comments...]
  • 1 reply
·
RiverRider 
posted an update 2 days ago
view post
Post
3279
This is not a pipe.

Everyone is born a semiotician, no one is born knowing it. Go easy on yourself (and me) for not understanding this yet.

Computational semiotics is now an empirical study.

LLMs are not proto-minds. They are verifiably semiotic infrastructure.

This repository (or attached demo) can show you, in real time, how any frozen model (Qwen for demo) arrives at any answer by reading its latent states directly during generation.

Any questions?

RiverRider/srt-introspect

Repo:

https://github.com/space-bacon/SRT

Grok insist my intro is condescending … This is certainly true, as is the statement in my condescended opinion. I expect heat for it, let’s think this through?
ovi054 
posted an update 2 days ago
view post
Post
2581
Color Grade Transfer LoRA ⚡

I trained a LoRA that transfer color grade directly from target image to source image directly. No Manual color grading needed. The model is fine-tuned on Qwen Image Edit 2511 model.

👉 Try it now: build-small-hackathon/Color-Grade-Transfer
Jiaqi-hkust 
posted an update 2 days ago
pbhappliedsystems 
posted an update 3 days ago
view post
Post
2253
🚀 **New flagship dataset — and an argument about what a dataset card should be.**

Most synthetic datasets on the Hub ship row counts, a license, and little else — pipeline opaque, rejection criteria unstated, compliance unaudited. We published the opposite.

**SynthEval Cloud — Regulated-Domain Synthetic Instruction Dataset**
👉 pbhappliedsystems/syntheval-cloud-regulated-instruct-1k

**1,116** quality-gated instruction records across **7 regulated domains** (medical, legal, GDPR, privacy, education, e-commerce, transport). Every record cleared a documented cascade, not a vibe check:

- 🧪 **Dual-signal hallucination gate** — rejects only when embedding cosine *and* keyword-overlap both fail; a low score alone never rejects.
- 🔒 **Layered PII masking + independent leak audit** — a separate over-reporting scanner found **0.0% residual leak** across all 1,116 records.
- 📊 **Whole-corpus evaluation, not a sample** — MATTR **0.769**, mean cosine **0.73**, **0%** near-duplicates, **96.9%** yield.
- 🧾 **The 36 rejections ship too**, each tagged with its failing gate. Removal at the gate is the product; we show our work.

Every number on the card is a field in the evaluation_report.json shipped beside the data — full methodology + provenance (Mistral-Nemo AWQ W4A16 · vLLM 0.8.5.post1 · Modal A10G).

One release from **SynthEval**: Studio (local GPU) + Cloud (Modal+vLLM), proving quality parity across substrates.

📄 Whitepaper: https://pbhappliedsystems.com/SynthEval_Studio_and_Cloud_Quality-Gated_Synthetic_Data_Generation.pdf
🔎 Overview: https://pbhappliedsystems.com/synthetic-data.html

**CC BY 4.0** — commercial use welcome, just credit it. Need defensible synthetic data at scale? Let's talk.

— Patrick Hill, PBH Applied Systems
dippatel1994 
posted an update 4 days ago
view post
Post
997
To make revising LLM architectures and training methods faster, I created a deck of 180 visual flashcards. It started as a personal hobby, but slowly became cheat code for reviewing LLM concepts before technical interviews. People love it!

Swipe through these samples, and if you want to grab the full set or follow the project, the repo is here: https://github.com/llmsresearch/llm-flashcards.
RiverRider 
posted an update about 14 hours ago
view post
Post
103
ATTENTION: The SRT-Introspect framework moves past surface-level output commentary by supplying real-time natural language interpretations of a model’s latent states. These verbalizations are validated, not merely asserted, through a round-trip reconstruction procedure. Natural language descriptions derived from hidden activations are passed through an encoder that reconstructs the corresponding activation vector; the recovered vector closely approximates the original. High reconstruction fidelity indicates that the verbalizations encode genuine structural information about the internal state rather than offering plausible but ungrounded speculation.

This validated introspection converts what has often remained a theoretical or post-hoc exercise into a practical instrument for auditing model behavior, diagnosing failure modes, and providing high-level semantic guidance—all without modifying the base model or incurring the costs of fine-tuning. Because the mechanism operates on frozen configurations, it can be applied to production systems where any change to weights or architecture is undesirable. Thank you for your attention.

Run a trace: RiverRider/srt-introspect

Repo: https://github.com/space-bacon/SRT