- 
	
	
	
Ultra-Sparse Memory Network
Paper • 2411.12364 • Published • 23 - 
	
	
	
Hyper-Connections
Paper • 2409.19606 • Published • 24 - 
	
	
	
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Paper • 2411.03884 • Published • 28 - 
	
	
	
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper • 2501.16975 • Published • 31 
Collections
Discover the best community collections!
Collections including paper arxiv:2501.16975 
						
					
				- 
	
	
	
				distilbert/distilbert-base-uncased-finetuned-sst-2-english
Text Classification • 67M • Updated • 6.41M • • 842 - 
	
	
	
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper • 2501.16975 • Published • 31 - 
	
	
	
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200 
- 
	
	
	
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 41 - 
	
	
	
Octopus v4: Graph of language models
Paper • 2404.19296 • Published • 118 - 
	
	
	
Octo-planner: On-device Language Model for Planner-Action Agents
Paper • 2406.18082 • Published • 48 - 
	
	
	
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper • 2408.15518 • Published • 42 
- 
	
	
	
Compression Represents Intelligence Linearly
Paper • 2404.09937 • Published • 28 - 
	
	
	
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Paper • 2404.06395 • Published • 24 - 
	
	
	
Long-context LLMs Struggle with Long In-context Learning
Paper • 2404.02060 • Published • 37 - 
	
	
	
Are large language models superhuman chemists?
Paper • 2404.01475 • Published • 19 
- 
	
	
	
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 115 - 
	
	
	
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper • 2501.10120 • Published • 52 - 
	
	
	
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
Paper • 2501.09775 • Published • 33 - 
	
	
	
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
Paper • 2501.10132 • Published • 22 
- 
	
	
	
Video Creation by Demonstration
Paper • 2412.09551 • Published • 9 - 
	
	
	
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 48 - 
	
	
	
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 72 - 
	
	
	
APOLLO: SGD-like Memory, AdamW-level Performance
Paper • 2412.05270 • Published • 38 
- 
	
	
	
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 57 - 
	
	
	
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 - 
	
	
	
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 44 - 
	
	
	
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63 
- 
	
	
	
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 66 - 
	
	
	
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Paper • 2404.07839 • Published • 47 - 
	
	
	
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper • 2404.05892 • Published • 39 - 
	
	
	
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 146 
- 
	
	
	
Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 93 - 
	
	
	
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
Paper • 2404.10667 • Published • 22 - 
	
	
	
Instruction-tuned Language Models are Better Knowledge Learners
Paper • 2402.12847 • Published • 26 - 
	
	
	
DoRA: Weight-Decomposed Low-Rank Adaptation
Paper • 2402.09353 • Published • 29 
- 
	
	
	
Ultra-Sparse Memory Network
Paper • 2411.12364 • Published • 23 - 
	
	
	
Hyper-Connections
Paper • 2409.19606 • Published • 24 - 
	
	
	
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Paper • 2411.03884 • Published • 28 - 
	
	
	
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper • 2501.16975 • Published • 31 
- 
	
	
	
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 115 - 
	
	
	
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper • 2501.10120 • Published • 52 - 
	
	
	
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
Paper • 2501.09775 • Published • 33 - 
	
	
	
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
Paper • 2501.10132 • Published • 22 
- 
	
	
	
Video Creation by Demonstration
Paper • 2412.09551 • Published • 9 - 
	
	
	
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 48 - 
	
	
	
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 72 - 
	
	
	
APOLLO: SGD-like Memory, AdamW-level Performance
Paper • 2412.05270 • Published • 38 
- 
	
	
	
				distilbert/distilbert-base-uncased-finetuned-sst-2-english
Text Classification • 67M • Updated • 6.41M • • 842 - 
	
	
	
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper • 2501.16975 • Published • 31 - 
	
	
	
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200 
- 
	
	
	
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 57 - 
	
	
	
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 - 
	
	
	
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 44 - 
	
	
	
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63 
- 
	
	
	
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 41 - 
	
	
	
Octopus v4: Graph of language models
Paper • 2404.19296 • Published • 118 - 
	
	
	
Octo-planner: On-device Language Model for Planner-Action Agents
Paper • 2406.18082 • Published • 48 - 
	
	
	
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper • 2408.15518 • Published • 42 
- 
	
	
	
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 66 - 
	
	
	
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Paper • 2404.07839 • Published • 47 - 
	
	
	
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper • 2404.05892 • Published • 39 - 
	
	
	
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 146 
- 
	
	
	
Compression Represents Intelligence Linearly
Paper • 2404.09937 • Published • 28 - 
	
	
	
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Paper • 2404.06395 • Published • 24 - 
	
	
	
Long-context LLMs Struggle with Long In-context Learning
Paper • 2404.02060 • Published • 37 - 
	
	
	
Are large language models superhuman chemists?
Paper • 2404.01475 • Published • 19 
- 
	
	
	
Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 93 - 
	
	
	
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
Paper • 2404.10667 • Published • 22 - 
	
	
	
Instruction-tuned Language Models are Better Knowledge Learners
Paper • 2402.12847 • Published • 26 - 
	
	
	
DoRA: Weight-Decomposed Low-Rank Adaptation
Paper • 2402.09353 • Published • 29