 walterShen
			's Collections
			walterShen
			's Collections
			
			
		Code LMs Evaluation
		
	updated
			
 
				
				
 - Unifying the Perspectives of NLP and Software Engineering: A Survey on
  Language Models for Code- 
			Paper
			 •- 
			2311.07989
			 •
			Published
				
			•- 
				26
			 
 - SWE-bench: Can Language Models Resolve Real-World GitHub Issues?- 
			Paper
			 •- 
			2310.06770
			 •
			Published
				
			•- 
				9
			 
 - CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution- 
			Paper
			 •- 
			2401.03065
			 •
			Published
				
			•- 
				11
			 
 - Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming- 
			Paper
			 •- 
			2402.14261
			 •
			Published
				
			•- 
				11
			 
 - CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code- 
			Paper
			 •- 
			2302.05527
			 •
			Published
				
			•- 
				1
			 
 - Copilot Refinement: Addressing Code Smells in Copilot-Generated Python
  Code- 
			Paper
			 •- 
			2401.14176
			 •
			Published
 - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model- 
			Paper
			 •- 
			2310.06266
			 •
			Published
				
			•- 
				2
			 
 - TACO: Topics in Algorithmic COde generation dataset- 
			Paper
			 •- 
			2312.14852
			 •
			Published
				
			•- 
				4
			 
 - CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code
  Completion- 
			Paper
			 •- 
			2310.11248
			 •
			Published
				
			•- 
				4
			 
 - DevEval: Evaluating Code Generation in Practical Software Projects- 
			Paper
			 •- 
			2401.06401
			 •
			Published
 - CodeApex: A Bilingual Programming Evaluation Benchmark for Large
  Language Models- 
			Paper
			 •- 
			2309.01940
			 •
			Published
				
			•- 
				1
			 
 - Improving Natural Language Capability of Code Large Language Model- 
			Paper
			 •- 
			2401.14242
			 •
			Published
				
			•- 
				1
			 
 - Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
  Large Language Models for Code Generation- 
			Paper
			 •- 
			2305.01210
			 •
			Published
				
			•- 
				3
			 
 - A Static Evaluation of Code Completion by Large Language Models- 
			Paper
			 •- 
			2306.03203
			 •
			Published
				
			•- 
				3
			 
 - RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems- 
			Paper
			 •- 
			2306.03091
			 •
			Published
				
			•- 
				1
			 
 - MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural
  Code Generation- 
			Paper
			 •- 
			2208.08227
			 •
			Published
				
			•- 
				1
			 
 - Large Language Models Are State-of-the-Art Evaluators of Code Generation- 
			Paper
			 •- 
			2304.14317
			 •
			Published
				
			•- 
				2
			 
 - Textbooks Are All You Need II: phi-1.5 technical report- 
			Paper
			 •- 
			2309.05463
			 •
			Published
				
			•- 
				88
			 
 - Textbooks Are All You Need- 
			Paper
			 •- 
			2306.11644
			 •
			Published
				
			•- 
				146
			 
 - Evaluating Large Language Models Trained on Code- 
			Paper
			 •- 
			2107.03374
			 •
			Published
				
			•- 
				8
			 
 - Design2Code: How Far Are We From Automating Front-End Engineering?- 
			Paper
			 •- 
			2403.03163
			 •
			Published
				
			•- 
				97
			 
 - Large Language Models Meet NL2Code: A Survey- 
			Paper
			 •- 
			2212.09420
			 •
			Published
				
			•- 
				1
			 
 - Large Language Models for Software Engineering: A Systematic Literature
  Review- 
			Paper
			 •- 
			2308.10620
			 •
			Published
				
			•- 
				1
			 
 - Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey- 
			Paper
			 •- 
			2310.17903
			 •
			Published
				
			•- 
				1
			 
 - A Survey on Pretrained Language Models for Neural Code Intelligence- 
			Paper
			 •- 
			2212.10079
			 •
			Published
				
			•- 
				1
			 
 - An Empirical Comparison of Pre-Trained Models of Source Code- 
			Paper
			 •- 
			2302.04026
			 •
			Published
				
			•- 
				1
			 
 - Towards an Understanding of Large Language Models in Software
  Engineering Tasks- 
			Paper
			 •- 
			2308.11396
			 •
			Published
				
			•- 
				1
			 
 - StarCoder 2 and The Stack v2: The Next Generation- 
			Paper
			 •- 
			2402.19173
			 •
			Published
				
			•- 
				148
			 
 - DeepSeek-Coder: When the Large Language Model Meets Programming -- The
  Rise of Code Intelligence- 
			Paper
			 •- 
			2401.14196
			 •
			Published
				
			•- 
				66
			 
 - Unsupervised Evaluation of Code LLMs with Round-Trip Correctness- 
			Paper
			 •- 
			2402.08699
			 •
			Published
				
			•- 
				1
			 
 - Understanding the Effectiveness of Large Language Models in Detecting
  Security Vulnerabilities- 
			Paper
			 •- 
			2311.16169
			 •
			Published
				
			•- 
				1
			 
 - Magicoder: Source Code Is All You Need- 
			Paper
			 •- 
			2312.02120
			 •
			Published
				
			•- 
				81
			 
 - On the Effectiveness of Large Language Models in Domain-Specific Code
  Generation- 
			Paper
			 •- 
			2312.01639
			 •
			Published
				
			•- 
				2
			 
 - CodeScope: An Execution-based Multilingual Multitask Multidimensional
  Benchmark for Evaluating LLMs on Code Understanding and Generation- 
			Paper
			 •- 
			2311.08588
			 •
			Published
 - Fusion-Eval: Integrating Evaluators with LLMs- 
			Paper
			 •- 
			2311.09204
			 •
			Published
				
			•- 
				6
			 
 - CoderEval: A Benchmark of Pragmatic Code Generation with Generative
  Pre-trained Models- 
			Paper
			 •- 
			2302.00288
			 •
			Published
 - Beyond Accuracy: Evaluating Self-Consistency of Code Large Language
  Models with IdentityChain- 
			Paper
			 •- 
			2310.14053
			 •
			Published
 - Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability
  of Large Language Model Code Generation- 
			Paper
			 •- 
			2308.10335
			 •
			Published
 - CodeScore: Evaluating Code Generation by Learning Code Execution- 
			Paper
			 •- 
			2301.09043
			 •
			Published
 - InterCode: Standardizing and Benchmarking Interactive Coding with
  Execution Feedback- 
			Paper
			 •- 
			2306.14898
			 •
			Published
 - Language Models for Code Completion: A Practical Evaluation- 
			Paper
			 •- 
			2402.16197
			 •
			Published
				
			•- 
				1
			 
 - DevBench: A Comprehensive Benchmark for Software Development- 
			Paper
			 •- 
			2403.08604
			 •
			Published
				
			•- 
				2
			 
 - CodeEditorBench: Evaluating Code Editing Capability of Large Language
  Models- 
			Paper
			 •- 
			2404.03543
			 •
			Published
				
			•- 
				18
			 
 - CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code
  Authoring- 
			Paper
			 •- 
			2305.12050
			 •
			Published
				
			•- 
				2
			 
 - Stable Code Technical Report- 
			Paper
			 •- 
			2404.01226
			 •
			Published
				
			•- 
				1
			 
 - CodeShell Technical Report- 
			Paper
			 •- 
			2403.15747
			 •
			Published
				
			•- 
				1
			 
 - A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond- 
			Paper
			 •- 
			2403.14734
			 •
			Published
				
			•- 
				21
			 
 - PRD: Peer Rank and Discussion Improve Large Language Model based
  Evaluations- 
			Paper
			 •- 
			2307.02762
			 •
			Published
 - Prometheus 2: An Open Source Language Model Specialized in Evaluating
  Other Language Models- 
			Paper
			 •- 
			2405.01535
			 •
			Published
				
			•- 
				124
			 
 - Replacing Judges with Juries: Evaluating LLM Generations with a Panel of
  Diverse Models- 
			Paper
			 •- 
			2404.18796
			 •
			Published
				
			•- 
				71