kaizuberbuehler
			's Collections
			 
		
			
		LM Inference
		
	updated
			
 
				
				
	
	
	
			
			The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
		
			Paper
			
•
			2402.17764
			
•
			Published
				
			•
				
				625
			
 
	
	 
	
	
	
			
			BitNet: Scaling 1-bit Transformers for Large Language Models
		
			Paper
			
•
			2310.11453
			
•
			Published
				
			•
				
				105
			
 
	
	 
	
	
	
			
			Mixture-of-Depths: Dynamically allocating compute in transformer-based
  language models
		
			Paper
			
•
			2404.02258
			
•
			Published
				
			•
				
				107
			
 
	
	 
	
	
	
			
			TransformerFAM: Feedback attention is working memory
		
			Paper
			
•
			2404.09173
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			Megalodon: Efficient LLM Pretraining and Inference with Unlimited
  Context Length
		
			Paper
			
•
			2404.08801
			
•
			Published
				
			•
				
				66
			
 
	
	 
	
	
	
			
			Leave No Context Behind: Efficient Infinite Context Transformers with
  Infini-attention
		
			Paper
			
•
			2404.07143
			
•
			Published
				
			•
				
				111
			
 
	
	 
	
	
	
			
			Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
		
			Paper
			
•
			2404.05892
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
  Understanding
		
			Paper
			
•
			2404.05726
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
		
			Paper
			
•
			2402.13753
			
•
			Published
				
			•
				
				116
			
 
	
	 
	
	
	
			
			How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
		
			Paper
			
•
			2404.14047
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			SnapKV: LLM Knows What You are Looking for Before Generation
		
			Paper
			
•
			2404.14469
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
		
			Paper
			
•
			2404.16710
			
•
			Published
				
			•
				
				80
			
 
	
	 
	
	
	
			
			Octopus v4: Graph of language models
		
			Paper
			
•
			2404.19296
			
•
			Published
				
			•
				
				118
			
 
	
	 
	
	
	
			
			Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
		
			Paper
			
•
			2404.18911
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
		
			Paper
			
•
			2405.00732
			
•
			Published
				
			•
				
				121
			
 
	
	 
	
	
	
			
			Imp: Highly Capable Large Multimodal Models for Mobile Devices
		
			Paper
			
•
			2405.12107
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Transformers are SSMs: Generalized Models and Efficient Algorithms
  Through Structured State Space Duality
		
			Paper
			
•
			2405.21060
			
•
			Published
				
			•
				
				67
			
 
	
	 
	
	
	
			
			Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
  Language Modeling
		
			Paper
			
•
			2406.07522
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo
  Tree Self-refine with LLaMa-3 8B
		
			Paper
			
•
			2406.07394
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
		
			Paper
			
•
			2407.21787
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			ThinK: Thinner Key Cache by Query-Driven Pruning
		
			Paper
			
•
			2407.21018
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language
  Models: An Experimental Analysis up to 405B
		
			Paper
			
•
			2409.11055
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			Discovering the Gems in Early Layers: Accelerating Long-Context LLMs
  with 1000x Input Token Reduction
		
			Paper
			
•
			2409.17422
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			Thinking LLMs: General Instruction Following with Thought Generation
		
			Paper
			
•
			2410.10630
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large
  Language Models
		
			Paper
			
•
			2409.17066
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Efficiently Serving LLM Reasoning Programs with Certaindex
		
			Paper
			
•
			2412.20993
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Token-Budget-Aware LLM Reasoning
		
			Paper
			
•
			2412.18547
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
  Reasoning Capability
		
			Paper
			
•
			2411.19943
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for
  Quantized LLMs with 100T Training Tokens
		
			Paper
			
•
			2411.17691
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Star Attention: Efficient LLM Inference over Long Sequences
		
			Paper
			
•
			2411.17116
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			SageAttention2 Technical Report: Accurate 4 Bit Attention for
  Plug-and-play Inference Acceleration
		
			Paper
			
•
			2411.10958
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			BitNet a4.8: 4-bit Activations for 1-bit LLMs
		
			Paper
			
•
			2411.04965
			
•
			Published
				
			•
				
				69
			
 
	
	 
	
	
	
			
			1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on
  CPUs
		
			Paper
			
•
			2410.16144
			
•
			Published
				
			•
				
				5
			
 
	
	 
	
	
	
			
			FlatQuant: Flatness Matters for LLM Quantization
		
			Paper
			
•
			2410.09426
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers
  in LLMs
		
			Paper
			
•
			2410.05265
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			Tensor Product Attention Is All You Need
		
			Paper
			
•
			2501.06425
			
•
			Published
				
			•
				
				89
			
 
	
	 
	
	
	
			
			Test-Time Preference Optimization: On-the-Fly Alignment via Iterative
  Textual Feedback
		
			Paper
			
•
			2501.12895
			
•
			Published
				
			•
				
				61
			
 
	
	 
	
	
	
			
			Qwen2.5-1M Technical Report
		
			Paper
			
•
			2501.15383
			
•
			Published
				
			•
				
				72
			
 
	
	 
	
	
	
			
			Reward-Guided Speculative Decoding for Efficient LLM Reasoning
		
			Paper
			
•
			2501.19324
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
		
			Paper
			
•
			2502.01142
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
		
			Paper
			
•
			2502.05003
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
		
			Paper
			
•
			2502.04416
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
		
			Paper
			
•
			2502.06786
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			Lossless Acceleration of Large Language Models with Hierarchical
  Drafting based on Temporal Locality in Speculative Decoding
		
			Paper
			
•
			2502.05609
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			TransMLA: Multi-head Latent Attention Is All You Need
		
			Paper
			
•
			2502.07864
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on
  a Single GPU
		
			Paper
			
•
			2502.08910
			
•
			Published
				
			•
				
				148
			
 
	
	 
	
	
	
			
			Diverse Inference and Verification for Advanced Reasoning
		
			Paper
			
•
			2502.09955
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
  Attention
		
			Paper
			
•
			2502.11089
			
•
			Published
				
			•
				
				165
			
 
	
	 
	
	
	
			
			LightThinker: Thinking Step-by-Step Compression
		
			Paper
			
•
			2502.15589
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			MoBA: Mixture of Block Attention for Long-Context LLMs
		
			Paper
			
•
			2502.13189
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
		
			Paper
			
•
			2502.18137
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			SEAP: Training-free Sparse Expert Activation Pruning Unlock the
  Brainpower of Large Language Models
		
			Paper
			
•
			2503.07605
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			TTRL: Test-Time Reinforcement Learning
		
			Paper
			
•
			2504.16084
			
•
			Published
				
			•
				
				120
			
 
	
	 
	
	
	
			
			φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time
  Exploration and Exploitation
		
			Paper
			
•
			2503.13288
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language
  Models
		
			Paper
			
•
			2503.16257
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			FFN Fusion: Rethinking Sequential Computation in Large Language Models
		
			Paper
			
•
			2503.18908
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior
  Accuracy Preservation
		
			Paper
			
•
			2503.19950
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through
  Lightweight Vocabulary Adaptation
		
			Paper
			
•
			2503.19693
			
•
			Published
				
			•
				
				76
			
 
	
	 
	
	
	
			
			Efficient Inference for Large Reasoning Models: A Survey
		
			Paper
			
•
			2503.23077
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			A Survey of Efficient Reasoning for Large Reasoning Models: Language,
  Multimodality, and Beyond
		
			Paper
			
•
			2503.21614
			
•
			Published
				
			•
				
				42
			
 
	
	 
	
	
	
			
			Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
		
			Paper
			
•
			2504.06261
			
•
			Published
				
			•
				
				110
			
 
	
	 
	
	
	
			
			C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization
  for Test-Time Expert Re-Mixing
		
			Paper
			
•
			2504.07964
			
•
			Published
				
			•
				
				61
			
 
	
	 
	
	
	
			
			Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning
  Models
		
			Paper
			
•
			2504.04823
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient
  MoE Inference
		
			Paper
			
•
			2504.05897
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Accelerate Parallelizable Reasoning via Parallel Decoding within One
  Sequence
		
			Paper
			
•
			2503.20533
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday
  Home Clusters
		
			Paper
			
•
			2504.08791
			
•
			Published
				
			•
				
				136
			
 
	
	 
	
	
	
			
			BitNet b1.58 2B4T Technical Report
		
			Paper
			
•
			2504.12285
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU
  Inference via Dynamic-Length Float
		
			Paper
			
•
			2504.11651
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Efficient Reasoning Models: A Survey
		
			Paper
			
•
			2504.10903
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Sleep-time Compute: Beyond Inference Scaling at Test-time
		
			Paper
			
•
			2504.13171
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			SpecReason: Fast and Accurate Inference-Time Compute via Speculative
  Reasoning
		
			Paper
			
•
			2504.07891
			
•
			Published
				
			•
				
				5
			
 
	
	 
	
	
	
			
			EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language
  Models
		
			Paper
			
•
			2504.15133
			
•
			Published
				
			•
				
				25