Tempo14
			's Collections
			 
		
			
		Attention
		
	updated
			
 
				
				
	
	
	
			
			Selective Attention Improves Transformer
		
			Paper
			
•
			2410.02703
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
		
			Paper
			
•
			2410.05258
			
•
			Published
				
			•
				
				179
			
 
	
	 
	
	
	
			
			TidalDecode: Fast and Accurate LLM Decoding with Position Persistent
  Sparse Attention
		
			Paper
			
•
			2410.05076
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
		
			Paper
			
•
			2410.13276
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Star Attention: Efficient LLM Inference over Long Sequences
		
			Paper
			
•
			2411.17116
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			KV Shifting Attention Enhances Language Modeling
		
			Paper
			
•
			2411.19574
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			Entropy-Guided Attention for Private LLMs
		
			Paper
			
•
			2501.03489
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			Not All Language Model Features Are Linear
		
			Paper
			
•
			2405.14860
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			Your Transformer is Secretly Linear
		
			Paper
			
•
			2405.12250
			
•
			Published
				
			•
				
				158
			
 
	
	 
	
	
	
			
			MiniMax-01: Scaling Foundation Models with Lightning Attention
		
			Paper
			
•
			2501.08313
			
•
			Published
				
			•
				
				298
			
 
	
	 
	
	
	
			
			Tensor Product Attention Is All You Need
		
			Paper
			
•
			2501.06425
			
•
			Published
				
			•
				
				89
			
 
	
	 
	
	
	
			
			Sigma: Differential Rescaling of Query, Key and Value for Efficient
  Language Models
		
			Paper
			
•
			2501.13629
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			TransMLA: Multi-head Latent Attention Is All You Need
		
			Paper
			
•
			2502.07864
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
  Attention
		
			Paper
			
•
			2502.11089
			
•
			Published
				
			•
				
				165
			
 
	
	 
	
	
	
			
			Does Time Have Its Place? Temporal Heads: Where Language Models Recall
  Time-specific Information
		
			Paper
			
•
			2502.14258
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			How Do Large Vision-Language Models See Text in Image? Unveiling the
  Distinctive Role of OCR Heads
		
			Paper
			
•
			2505.15865
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			Learning to Skip the Middle Layers of Transformers
		
			Paper
			
•
			2506.21103
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Limitations of Normalization in Attention Mechanism
		
			Paper
			
•
			2508.17821
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			Native Hybrid Attention for Efficient Sequence Modeling
		
			Paper
			
•
			2510.07019
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Attention Sinks in Diffusion Language Models
		
			Paper
			
•
			2510.15731
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			Every Attention Matters: An Efficient Hybrid Architecture for
  Long-Context Reasoning
		
			Paper
			
•
			2510.19338
			
•
			Published
				
			•
				
				108
			
 
	
	 
	
	
	
		
			Paper
			
•
			2510.23052
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Kimi Linear: An Expressive, Efficient Attention Architecture
		
			Paper
			
•
			2510.26692
			
•
			Published
				
			•
				
				85