Interesting Papers
Papers I find notable, opinionated curation — not a comprehensive survey.

KAN 2.0: Kolmogorov-Arnold Networks Meet Science
MIT
47 citations
Paper: KAN 2.0: Kolmogorov-Arnold Networks Meet Science
GitHub repo: pykan
The paper frames a tension between connectionist deep learning and the symbolic structure of science, and proposes using Kolmogorov–Arnold Networks (KANs) as a bridge. It organizes KANs around three scientific-discovery goals: finding relevant features, exposing modular structure, and recovering symbolic formulas, with a bidirectional flow: injecting prior scientific knowledge into KANs and extracting interpretable laws from trained networks. It highlights new tooling in the pykan ecosystem—MultKAN with explicit multiplication nodes, kanpiler to compile symbolic expressions into KANs, and a tree converter from KANs (or general nets) to tree graphs—and demonstrates discovering conserved quantities, Lagrangians, symmetries, and constitutive relations in example physics settings.

Continuous Thought Machines (CTM)
Sakana AI
35 citations
Paper: Continuous Thought Machines
GitHub repo: ctm
The Continuous Thought Machine (CTM) introduces a novel neural network architecture that integrates neuron-level temporal processing and neural synchronization to reintroduce neural timing as a foundational element in artificial intelligence. Unlike standard neural networks that ignore the complexity of individual neurons, the CTM leverages neural dynamics as its core representation through two key innovations: neuron-level temporal processing where each neuron uses unique weight parameters to process incoming signal histories, and neural synchronization as a latent representation. The CTM demonstrates strong performance across diverse tasks including ImageNet-1K classification, 2D maze solving, sorting, parity computation, question-answering, and reinforcement learning, while naturally supporting adaptive computation where it can stop earlier for simpler tasks or continue processing for more challenging instances.

Attention Residuals (AttnRes)
Kimi Team
0 citations
Paper: Attention Residuals
Residual connections with PreNorm are standard in modern LLMs, but they sum every layer output with fixed unit weights. That uniform aggregation lets hidden states grow unchecked with depth and progressively dilutes each layer's contribution. Attention Residuals (AttnRes) replace fixed accumulation with softmax attention over prior layer outputs so each layer can mix earlier representations with learned, input-dependent weights. Block AttnRes groups layers into blocks and attends over block-level summaries to cut memory and communication cost while keeping most of the benefit; with cache-friendly pipeline communication it is framed as a practical drop-in for standard residuals. Scaling-law results are reported across sizes, and the authors integrate AttnRes into Kimi Linear (48B total / 3B active) on 1.4T tokens, reporting more stable magnitudes and gradients across depth and gains on downstream tasks.

Mixture-of-Depths Attention (MoDA)
HUST VL Lab
0 citations
Paper: Mixture-of-Depths Attention
GitHub repo: MoDA
Deeper LLMs often show signal degradation: features from shallow layers get diluted by repeated residual updates and are hard to recover deeper in the stack. Mixture-of-depths attention (MoDA) lets each attention head attend both to the current layer's sequence KV pairs and to depth KV pairs from earlier layers, so representations from different depths stay in play. The authors describe a hardware-oriented kernel that tames non-contiguous memory access and report about 97.3% of FlashAttention-2 efficiency at 64K sequence length. On 1.5B-parameter models they report consistent gains over strong baselines: roughly 0.2 lower average perplexity on ten validation sets and about 2.11% higher average downstream performance on ten tasks, with roughly 3.7% FLOPs overhead. They also find MoDA works better with post-norm than with pre-norm in their setup.