Interesting Papers

Papers I find notable, opinionated curation — not a comprehensive survey.

2024

KAN 2.0: Kolmogorov-Arnold Networks Meet Science

MIT

Ziming Liu, Pingchuan Ma, Yixuan Wang, Wojciech Matusik, Max Tegmark

47 citations

Paper: KAN 2.0: Kolmogorov-Arnold Networks Meet Science

GitHub repo: pykan

The paper frames a tension between connectionist deep learning and the symbolic structure of science, and proposes using Kolmogorov–Arnold Networks (KANs) as a bridge. It organizes KANs around three scientific-discovery goals: finding relevant features, exposing modular structure, and recovering symbolic formulas, with a bidirectional flow: injecting prior scientific knowledge into KANs and extracting interpretable laws from trained networks. It highlights new tooling in the pykan ecosystem—MultKAN with explicit multiplication nodes, kanpiler to compile symbolic expressions into KANs, and a tree converter from KANs (or general nets) to tree graphs—and demonstrates discovering conserved quantities, Lagrangians, symmetries, and constitutive relations in example physics settings.

2025

Continuous Thought Machines (CTM)

Sakana AI

Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, Llion Jones

35 citations

Paper: Continuous Thought Machines

GitHub repo: ctm

The Continuous Thought Machine (CTM) introduces a novel neural network architecture that integrates neuron-level temporal processing and neural synchronization to reintroduce neural timing as a foundational element in artificial intelligence. Unlike standard neural networks that ignore the complexity of individual neurons, the CTM leverages neural dynamics as its core representation through two key innovations: neuron-level temporal processing where each neuron uses unique weight parameters to process incoming signal histories, and neural synchronization as a latent representation. The CTM demonstrates strong performance across diverse tasks including ImageNet-1K classification, 2D maze solving, sorting, parity computation, question-answering, and reinforcement learning, while naturally supporting adaptive computation where it can stop earlier for simpler tasks or continue processing for more challenging instances.

2026

Attention Residuals (AttnRes)

Kimi Team

Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Zhilin Yang, Yulun Du, Yuxin Wu, Xinyu Zhou

0 citations

Paper: Attention Residuals

Residual connections with PreNorm are standard in modern LLMs, but they sum every layer output with fixed unit weights. That uniform aggregation lets hidden states grow unchecked with depth and progressively dilutes each layer's contribution. Attention Residuals (AttnRes) replace fixed accumulation with softmax attention over prior layer outputs so each layer can mix earlier representations with learned, input-dependent weights. Block AttnRes groups layers into blocks and attends over block-level summaries to cut memory and communication cost while keeping most of the benefit; with cache-friendly pipeline communication it is framed as a practical drop-in for standard residuals. Scaling-law results are reported across sizes, and the authors integrate AttnRes into Kimi Linear (48B total / 3B active) on 1.4T tokens, reporting more stable magnitudes and gradients across depth and gains on downstream tasks.

2026

Mixture-of-Depths Attention (MoDA)

HUST VL Lab

Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang

0 citations

Paper: Mixture-of-Depths Attention

GitHub repo: MoDA

Deeper LLMs often show signal degradation: features from shallow layers get diluted by repeated residual updates and are hard to recover deeper in the stack. Mixture-of-depths attention (MoDA) lets each attention head attend both to the current layer's sequence KV pairs and to depth KV pairs from earlier layers, so representations from different depths stay in play. The authors describe a hardware-oriented kernel that tames non-contiguous memory access and report about 97.3% of FlashAttention-2 efficiency at 64K sequence length. On 1.5B-parameter models they report consistent gains over strong baselines: roughly 0.2 lower average perplexity on ten validation sets and about 2.11% higher average downstream performance on ten tasks, with roughly 3.7% FLOPs overhead. They also find MoDA works better with post-norm than with pre-norm in their setup.