
MathyAIwithMike
This episode dives into the "Mixture-of-Recursions" (MoR) paper, exploring how it enhances large language model efficiency. MoR combines parameter efficiency and adaptive computation by dynamically adjusting the 'thinking' depth for each token. Simpler tokens require fewer passes, while complex ones get more attention. This approach, coupled with smart KV-cache management, reduces memory and computational costs, leading to faster training and inference. The discussion covers routing strategies (expert-choice vs. token-choice) and KV-cache management techniques, highlighting MoR's potential to achieve better performance with fewer resources.