
46 episodes
Deep learning paper reviews, math puzzles, mathy discussions about AI.
Exciting new AI infrastructure updates empower developers with unprecedented control. Enhanced model training with Retrieval-Augmented Generation (RAG) improves contextual accuracy, while Reinforcement Learning (RL) further strengthens models. The game-changer? Training your own foundational model using Amazon's blueprints and proprietary data. This bespoke approach offers a significant competitive advantage, allowing businesses to deeply tailor AI to their unique needs, marking a massive leap in custom AI development and innovation.
Discover how AWS re:Invent updates are democratizing AI, making advanced tools accessible to more developers. Learn about generative AI advancements, simplified model fine-tuning, and the impact on efficiency and sustainability. Explore how these updates empower businesses to create bespoke AI solutions, reduce costs, and foster innovation. Dive into the future of AI, where it's becoming a mainstream development tool for all.
Many AI leaderboards rely on LLMs to judge other LLMs, but new research reveals this method is statistically flawed, leading to biased performance estimates. The 'naive accuracy' reported is skewed by the judge's sensitivity and specificity, causing good models to appear worse and vice versa. To fix this, the paper adapts the Rogan-Gladen estimator from epidemiology, correcting for the judge's errors. It also introduces a novel approach to confidence intervals, accounting for both test set and calibration set variance. Adaptive allocation of human annotations further optimizes the process, focusing resources where the judge is 'noisier' to maximize precision. Simulations validate the framework, demonstrating unbiased results and reliable confidence intervals.
A groundbreaking paper, "Whisper Leak," reveals a side-channel attack on large language models (LLMs). Attackers can infer sensitive details from encrypted prompts by analyzing data packet sizes and timing. This allows them to identify sensitive topics, such as medical conditions, with alarming accuracy (98% success rate!). The attack exploits the correlation between plaintext and encrypted message sizes in stream ciphers. Fortunately, solutions like "obfuscation" (adding random dummy data) are being implemented by companies like OpenAI and Mistral to mitigate the risk. The incident serves as a wake-up call for better security design in LLMs.
Yann LeCun's new research, 'LeJEPA,' challenges the complex heuristics used in training massive AI models. It suggests that a simple Gaussian distribution can replace many common tricks, leading to more efficient and powerful self-supervised learning. LeJEPA uses Sketched Isotropic Gaussian Regularization (SIGReg) to enforce an optimal data arrangement, simplifying the learning process. Remarkably, LeJEPA, trained on only 11,000 samples, outperformed DINOv3, which was trained on billions of images. This highlights the power of theoretical insight over brute-force scaling, advocating for a return to theory-informed simplicity in AI research.
Explore the fascinating question of whether small AI models can achieve complex reasoning like larger ones. We dive into the 'Spectrum-to-Signal' framework, which uses diversity-driven optimization to elicit large-model reasoning ability in smaller models. Discover how Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) work together to generate diverse solutions and identify the best reasoning paths. We also discuss a critical evaluation issue regarding Pass@1 scores, highlighting the importance of rigorous and consistent evaluation in AI research.
Discover the Compress & Attend Transformer (CAT), a novel AI architecture that dynamically adjusts its efficiency *after* training. Unlike inflexible models locked into a fixed quality-compute trade-off, CAT allows users to choose their desired performance profile at test time. It uses a compressor to create learned vector representations from chunks of the input sequence, and a decoder attends to both local tokens and compressed representations of past chunks. This design enables parallelized training and efficient generation with a rolling memory system. CAT offers a path to more flexible and resource-aware AI.
Discover LiteAttention, a groundbreaking technique that dramatically accelerates AI video generation using Diffusion Transformers. By exploiting the temporal coherence of attention sparsity, LiteAttention identifies and skips unimportant video segments early on, eliminating redundant computations. This approach achieves up to 42% sparsity without sacrificing video quality, outperforming existing optimization methods. Built on FlashAttention3 and optimized for NVIDIA H100 GPUs, LiteAttention offers a production-ready solution that significantly reduces runtime and opens new possibilities for sequential AI tasks beyond video.
Discover Physics-Informed Neural Networks (PINNs)! Learn how these models embed physical laws, enabling effective learning from small datasets. Explore data-driven solutions for partial differential equations and the exciting potential for data-driven discovery of unknown parameters in physical systems. PINNs enhance traditional methods, bridging the gap between deep learning and fundamental laws, offering a powerful tool for complex scientific problems.
Delve into the unsettling reality of AI models leaking sensitive training data through semantic memorization. Discover how "Semantic Spies" exploit chat templates to extract the essence of training data, bypassing traditional string-matching methods. Learn about the innovative "embedding-first" approach that uncovers hidden data leakage, even in RL-trained models. Understand the implications for intellectual property and AI safety, and why a new approach to model security is crucial for developers and users of open models. This is a game-changer in AI security!
Explore a radical shift in deep learning with 'Nested Learning' (NL). This framework reimagines models as nested optimization problems, not just layers. Discover how optimizers become learning modules and architectures gain a continuum of memory. The HOPE model combines a self-modifying recurrent module with a multi-frequency memory system, redefining model design by adding levels of learning defined by update frequency. It's not just about bigger models, but smarter, more dynamic internal learning processes.
This episode explores a revolutionary paper proposing a new "Semantic Information Theory" that could redefine our understanding of AI. Challenging Shannon's classical bit-based approach, this theory uses the 'token' as its foundation, modeling LLMs as 'discrete-time channels with feedback'. It introduces novel measures like the 'Directed Rate-Distortion Function' for pre-training, the 'Directed Rate-Reward Function' for RLHF, and 'Semantic Information Flow' for inference. The theory redefines the token embedding space and introduces metrics like the Gromov-Wasserstein distance. Astonishingly, the paper derives the Transformer architecture from first principles, solidifying its theoretical importance.
Can neural networks produce *any* output? This episode dives into a groundbreaking paper exploring neural network surjectivity – the idea that for any imaginable output, there exists an input that generates it. Using differential topology, the researchers prove that certain modern architectures, like those with Pre-Layer Normalization, are *always* surjective. The analysis also identifies non-surjective components, such as ReLU-MLPs. The core takeaway: vulnerabilities in generative models aren't just bugs but fundamental mathematical properties, requiring a shift towards inherently safer architectures and training methods.
Explore the mystery of "grokking" in neural networks with the Li2 framework (Lazy, Independent, Interactive). Discover the three-stage process: lazy learning (initial memorization), independent feature learning (parallel neuron quests on an energy landscape), and interactive feature learning (neuron interaction, specialization). Learn how overfitting is a *good* thing, enabling feature discovery. The grokking phase transition is a data threshold where generalizable solutions emerge. Understand the provable mechanisms driving generalization in neural networks.
This episode dives into a groundbreaking paper exposing a critical flaw in multi-task reinforcement learning with large language models: imbalanced gradients. The research reveals that some tasks dominate the learning process, not due to importance, but because their gradients are disproportionately larger, overshadowing other tasks. Researchers proved this by meticulously measuring individual task gradient contributions and discovering disparities of up to 33x. Further tests debunked the idea that larger gradients indicate higher learning potential. The paper serves as a methodological warning against naively mixing datasets and calls for sophisticated balancing strategies to ensure fair contributions from all tasks, such as gradient clipping or adaptive learning rates.
A heartfelt thank you to everyone supporting the community! Online communities thrive on engagement, love, and support. It's about building real connections and fostering a welcoming space. Attendees and feedback are invaluable, driving everything forward. Acknowledging and appreciating this support is essential for continued growth and success.
Explore the evolving landscape of AI, moving from hype to practical application. Discover the hurdles in widespread AI adoption, including data availability, trust, and explainability. The discussion highlights the critical need to address AI bias through careful data collection and diverse development teams. Key sectors like healthcare, manufacturing and finance are poised for transformation. Success in this AI-driven future hinges on a blend of technical expertise, critical thinking, and ethical awareness to ensure responsible AI implementation.
This episode explores a novel approach to fine-tuning large language models (LLMs) for better explanations using encoder-only transformers for semantic reward modeling. It addresses the drawbacks of traditional methods like 'LLM-as-judge' and keyword-based metrics. The solution involves a smaller encoder model that operates in the latent space of text embeddings, using cosine similarity to reward semantic alignment with expert explanations. This is implemented within the GRPO framework, incorporating a multi-faceted reward function that incentivizes factual accuracy, structural integrity, and transparency of reasoning, leading to higher-quality and more trustworthy LLM explanations.
What makes content truly "revolutionary" in today's noisy social media landscape? This episode explores how a fresh perspective, solid content, and meaningful conversation can cut through the echo chamber. Discover the crucial components that prompt people to rethink assumptions and change the conversation, making content stand out and move the needle.
Explore AI21's Jamba model, Wordtune, and Maestro in the latest "Explainable" episode. Discover why many AI projects fail to reach production, focusing on the 'last mile' problem: integrating models into workflows. The episode also highlights a new Discord space for AI discussions and teases an upcoming interview with NASA's Hila Paz on satellite image compression, questioning the roles of dimensionality reduction and ChatGPT. Learn about bridging the gap between AI research and deployment, and the challenges of data quality and model explainability.
This episode explores a fascinating paper that challenges the assumption that more reasoning always leads to better results in large language models. The study reveals an optimal reasoning length, beyond which accuracy declines due to 'overthinking.' The research, conducted on smaller models using mathematical datasets, suggests that incorrect answers tend to be longer and that the shortest generated answer is frequently correct. Practical takeaways include 'short-first' and 'aware-length stopping' strategies. While limited by model size and dataset scope, the core message emphasizes the importance of efficient reasoning over sheer token volume.
This episode dives into a critical analysis of LLM benchmarks, revealing significant flaws highlighted in a recent article. The discussion covers issues like researchers not running benchmarks themselves, inherent limitations within benchmarks, and the focus on older benchmarks like HumanEval and MMLU. Newer benchmarks like Swe-Bench and Aider Benchmark are also explored, alongside the relevance of cultural and ethical gaps. The episode summarizes the article as a systematic mapping of flaws, excelling in diagnosing issues but lacking concrete solutions, leaving listeners to ponder which problems are solvable.
Mike interviews Algieba about the Hierarchical Reasoning Model (HRM), a novel AI architecture inspired by the brain. HRM uses hierarchical organization, with high-level strategic planning and low-level execution, for more efficient and robust reasoning. It addresses limitations of Chain-of-Thought by performing reasoning internally and tackles challenges in recurrent neural networks with hierarchical convergence. HRM avoids Backpropagation-Through-Time, using Deep Equilibrium Models for efficient training and Adaptive Computational Time for dynamic resource allocation. Its impressive performance suggests a promising path towards truly intelligent machines.
This episode dives into the "Mixture-of-Recursions" (MoR) paper, exploring how it enhances large language model efficiency. MoR combines parameter efficiency and adaptive computation by dynamically adjusting the 'thinking' depth for each token. Simpler tokens require fewer passes, while complex ones get more attention. This approach, coupled with smart KV-cache management, reduces memory and computational costs, leading to faster training and inference. The discussion covers routing strategies (expert-choice vs. token-choice) and KV-cache management techniques, highlighting MoR's potential to achieve better performance with fewer resources.
Orus, an AI Model Compression Specialist, joins MathyAIwithMike to discuss CompLLM, a novel approach to compressing long contexts for Large Language Models (LLMs). CompLLM addresses the quadratic computational cost of self-attention by dividing long contexts into smaller, independent segments, compressing each separately. This enables efficiency, scalability, and reusability. The innovative training process uses distillation, focusing on aligning the internal activations of the LLM. This ensures the compressed representation retains essential information, making long-context LLMs more practical.
Explore a groundbreaking paper on fine-tuning Large Language Models (LLMs) using Evolution Strategies (ES) at scale, bypassing traditional gradient-based methods. Discover how innovations like "virtual noise" and "in-place" perturbations overcome memory limitations, making LLM fine-tuning more accessible. Learn how this forward-pass-only system democratizes LLM optimization, enabling researchers and practitioners to fine-tune LLMs on less powerful hardware. Gain insights into the implications of this paradigm shift.
Explore how LLMs can move beyond rigid architectures! Discover 'Chain-of-Layers' (CoLa), a method allowing dynamic path construction through layers, skipping or looping for optimal computation. Using Monte Carlo Tree Search (MCTS), models intelligently balance accuracy and efficiency, unlocking hidden potential within existing pre-trained models. This approach promises faster inference, lower energy consumption, and greater accessibility by viewing LLMs as composable libraries, paving the way for significant advancements in AI.
Mike and the General Expert discuss the overhyping of general AI, particularly its ability to achieve human-level consciousness. They explore AI's underutilized potential in personalized education, envisioning AI adapting to individual learning styles and fostering critical thinking. The conversation addresses concerns about data privacy and bias in AI-driven education, emphasizing the need for robust security measures and fair algorithms. They also tackle misconceptions about expertise, highlighting the importance of specialized knowledge and continuous learning. The episode concludes with a call for approaching AI with optimism and skepticism.
Dive into mechanistic interpretability with Sadaltager, exploring how to reverse engineer neural networks. The discussion covers challenges in understanding how AI computes, limitations of current tools like PCA and sparse dictionary learning, and the shift towards building interpretable models from the start – 'glass boxes' instead of black boxes. Validation techniques and the need for 'model organisms' are highlighted, emphasizing the implications for AI safety, policy, and building trustworthy AI systems.
MathyAIwithMike welcomes Fenrir, a content moderation expert, to discuss a moderator's need for a short break due to workload. The conversation explores the challenges of content moderation, especially burnout, and its impact on content quality. They emphasize proactive planning, cross-training, and AI tools to manage breaks and maintain quality. The discussion highlights the importance of moderator well-being, suggesting regular breaks, task rotation, clear guidelines, and supportive environments. They also touched on how content complexity impacts cognitive load and the necessity of investing in moderator support systems for platform quality.
Dr. Aviv Keren discusses "Harnessing the Universal Geometry of Embeddings," a paper exploring how to translate between different language model embeddings. The core idea involves learning a shared latent space to enable translation without direct cross-data or knowledge of the source models. Aviv clarifies the paper's scope, focusing on text model alignment rather than a single, universal representation. He explains the complex mechanics of the translation process, involving multiple mappings and a sophisticated loss function with GANs, reconstruction, and cycle consistency components. The research demonstrates impressive generalization ability, suggesting a relatively universal bridging between text distributions.
Explore the exciting synergy between Large Language Models (LLMs) and Evolutionary Algorithms (EAs). LLMs generate creative ideas, while EAs optimize them for peak performance. Discover how this collaboration enhances code generation, network architecture, creative tasks, and even drug discovery. While challenges like computational cost and interpretability exist, the potential benefits are enormous. This partnership enables AI to learn, optimize, and create autonomously, pushing beyond the limitations of individual systems. Dive into the future of AI evolution!
Mike and his expert guest dive into a groundbreaking paper, "Random Teachers are Good Teachers." They explore how a student model can learn effectively from a teacher network with completely random, untrained weights, challenging traditional assumptions about knowledge transfer. The discussion covers implicit regularization, the locality phenomenon, and the emergence of structured representations without labeled data. The findings suggest that the learning process and the student's ability to find structure in the data are crucial, potentially revolutionizing our understanding of self-distillation and self-supervised learning.
Large Language Models (LLMs) are overthinking! This episode explores new research identifying \
Dive into the crucial distinction between machine learning model capacity (size) and complexity (functions it learns), as explained by Mike. Discover UCB-E and UCB-E-LRF, two novel algorithms for drastically speeding up language model evaluation. UCB-E uses a multi-armed bandit approach, while UCB-E-LRF leverages low-rank factorization to reduce computation by 85-95%. A game-changer for researchers with limited resources, enabling efficient experimentation even on modest hardware.
Uncover a snippet from Mike's past! This episode explores a post from April 30th, 2025, where Mike was finalizing a review of a 'deep' and 'interesting' article. Adding to the mix is a touch of holiday 'nonsense' (שטויות in Hebrew). While the specifics remain a mystery, this glimpse offers a unique insight into Mike's work and the context surrounding it. Follow the breadcrumbs to an X post for more!
This episode explores content from MathyAIwithMike, covering intriguing topics like a reinforcement learning book (details pending!), AI's struggles with the \
This MathyAIwithMike episode dives into Mike's latest Substack updates. First, a humorous take on AI's em-dash obsession. Then, a look at his daily article on multimodal latent language modeling, focusing on a unique approach to training diffusion models for diverse data types (text, audio, images) by treating them sequentially. Finally, the exciting news: Mike hit the 1000 subscriber milestone on Substack! Hear about the growth and gratitude.
This episode of MathyAIwithMike dives into two compelling pieces of content: a podcast interview featuring Mike himself and a translated post about leveraging LLMs for SQL databases. The discussion explores the value of Mike's guest appearance on another podcast, offering a different perspective on his expertise. It also unpacks the significance of Ben Ben-Shaharizad's work on Taboola's use of LLMs with SQL, highlighting its practical applications and Mike's dedication to keeping his audience informed about cutting-edge developments in AI.
MathyAIwithMike discusses a new paper reviving Normalizing Flows (NFs) by combining them with techniques from diffusion models, like classifier guidance, and Tweedie's formula. NFs learn a reversible mapping between a simple distribution and the data, allowing likelihood calculation. This paper improves robustness by training on noisy data and using Tweedie's formula to estimate clean outputs. Classifier guidance, borrowed from diffusion models, steers the sampling process to generate specific classes. Find the paper on arXiv (link in show notes!).
Mike and his co-host explore a brand new, empty channel: MathyAIwithMike. They discuss the unique challenge and vast potential of a channel dedicated to the intersection of math and AI. They speculate on future content, from machine learning breakthroughs and complex algorithms to tutorials and the philosophical implications of intelligent machines. While acknowledging the current lack of content, they remain optimistic about the channel's future, eager to see it evolve from a digital ghost town into a bustling hub of activity.
Unpack the secrets of speculative decoding (SD) and how it accelerates text generation by using a smaller, faster model to predict tokens for a larger model. Explore how rejection sampling ensures accuracy and the crucial role of acceptance rates. Learn how estimating cross-entropy helps optimize the process, and delve into potential areas for future improvement. Join us as we explore this cutting-edge AI topic.
Join us as we explore the exciting potential of the brand new MathyAIwithMike channel! What does its empty state signify? Is it a challenge, a blank canvas awaiting mathematical brilliance, or a space for the next big mathematical breakthrough? We draw parallels to Fermat's Last Theorem and imagine the possibilities, from machine learning applications to debates on the foundations of mathematics. The potential here is limitless, infinite even, just like the set of natural numbers! Stay tuned for the next episode, where we hope to dive into some actual math!
Mike discusses Jetformer, a novel autoregressive model generating both images and text. It uses a single transformer trained on both modalities, avoiding separate encoders. A pre-trained Normalized Flow (NF) model represents images as \
Join us as we explore the uncharted territory of 'MathyAIwithMike,' a brand-new channel brimming with potential. We discuss the exciting possibilities of AI applications in mathematics, from automated theorem proving to personalized learning experiences. Could this become a hub for collaborative problem-solving and groundbreaking discoveries? We anticipate lively debates and innovative explorations as we embark on this mathematical AI journey together, uncovering the mysteries that lie ahead.
Join us on MathyAIwithMike as we explore our brand new, empty channel! What could it become? We brainstorm possibilities, from AI-driven math tutorials and expert interviews to solving complex problems and ethical debates. It's a blank slate for math and AI enthusiasts. We're excited to see this channel flourish and become a hub for mathematical discourse!