view article Article PrediBench: Testing AI models on prediction markets By charles-azam and 1 other β’ Sep 24 β’ 5
view article Article ScreenSuite - The most comprehensive evaluation suite for GUI Agents! Jun 6 β’ 54
view article Article Introducing smolagents: simple agents that write actions in code. Dec 31, 2024 β’ 1.14k
view article Article Expert Support case study: Bolstering a RAG app with LLM-as-a-Judge Oct 28, 2024 β’ 29