igormolybog
			's Collections
			 
		
			
		evals
		
	updated
			
 
				
				
	
	
	
			
			Holistic Evaluation of Text-To-Image Models
		
			Paper
			
•
			2311.04287
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			MEGAVERSE: Benchmarking Large Language Models Across Languages,
  Modalities, Models and Tasks
		
			Paper
			
•
			2311.07463
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			Trusted Source Alignment in Large Language Models
		
			Paper
			
•
			2311.06697
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			DiLoCo: Distributed Low-Communication Training of Language Models
		
			Paper
			
•
			2311.08105
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Instruction-Following Evaluation for Large Language Models
		
			Paper
			
•
			2311.07911
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			GPQA: A Graduate-Level Google-Proof Q&A Benchmark
		
			Paper
			
•
			2311.12022
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			GAIA: a benchmark for General AI Assistants
		
			Paper
			
•
			2311.12983
			
•
			Published
				
			•
				
				238
			
 
	
	 
	
	
	
			
			Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
		
			Paper
			
•
			2312.04724
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Evaluation of Large Language Models for Decision Making in Autonomous
  Driving
		
			Paper
			
•
			2312.06351
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			PromptBench: A Unified Library for Evaluation of Large Language Models
		
			Paper
			
•
			2312.07910
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			TrustLLM: Trustworthiness in Large Language Models
		
			Paper
			
•
			2401.05561
			
•
			Published
				
			•
				
				69
			
 
	
	 
	
	
	
			
			OLMo: Accelerating the Science of Language Models
		
			Paper
			
•
			2402.00838
			
•
			Published
				
			•
				
				84
			
 
	
	 
	
	
	
			
			Can Large Language Models Understand Context?
		
			Paper
			
•
			2402.00858
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Design2Code: How Far Are We From Automating Front-End Engineering?
		
			Paper
			
•
			2403.03163
			
•
			Published
				
			•
				
				97
			
 
	
	 
	
	
	
			
			MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
  Math Problems?
		
			Paper
			
•
			2403.14624
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			Long-context LLMs Struggle with Long In-context Learning
		
			Paper
			
•
			2404.02060
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
  for Video Generation
		
			Paper
			
•
			2410.05363
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			LongGenBench: Long-context Generation Benchmark
		
			Paper
			
•
			2410.04199
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			GLEE: A Unified Framework and Benchmark for Language-based Economic
  Environments
		
			Paper
			
•
			2410.05254
			
•
			Published
				
			•
				
				84