Unlocking Transformer Efficiency: The Routing and Filtering Structure of Attention

Discover how decomposing attention into routing and filtering components reveals inefficiencies in transformer models, leading to new, stable, and highly efficient AI architectures.

Unlocking Transformer Efficiency: The Routing and Filtering Structure of Attention

      In the rapidly evolving landscape of artificial intelligence, transformer models have become the backbone of many advanced applications, from natural language processing to computer vision. Central to their power is the "attention mechanism," a complex calculation that allows models to weigh the importance of different parts of an input sequence. However, despite their success, current transformer designs often allocate computational resources uniformly across all layers, leading to inefficiencies. A recent academic paper, "The Routing and Filtering Structure of Attention," from Shafayeth Jamil and Rehan Kapadia at the University of Southern California, sheds light on this hidden inefficiency by dissecting the attention mechanism into its fundamental components and proposing a path toward more optimized and scalable AI architectures.

Unpacking Attention: Beyond the Black Box

      At its core, the attention mechanism calculates an "interaction matrix" (often denoted QKᵀ) that determines how much each part of an input sequence relates to every other part. Traditionally, this matrix has been treated as a single, complex operation. However, this paper reveals that the QKᵀ matrix actually entwines two distinct computations:

Routing: This is the skew-symmetric* component, responsible for redistributing information directionally between different positions in the input. Think of it like a network of one-way streets where information is channeled from a "source" to a "sink." If token A routes information to token B, token B experiences an equal and opposite directional preference towards A. This conserves information flow, meaning information moves but is neither created nor destroyed. Filtering: This is the symmetric* component, which scales the mutual relevance between tokens. It acts like a two-way street, where if token A finds token B relevant, token B equally finds A relevant. This component selectively amplifies or suppresses the importance of certain relationships, much like a filter bank adjusting gains on different signal frequencies.

      These two functions have different computational requirements and depth profiles within a neural network, but standard attention mechanisms blend them into one opaque operation, making it impossible to identify where capacity is being used—or wasted.

The Hidden Inefficiency of Standard Attention

      To understand how these components operate in practice, researchers analyzed 1,776 attention heads across five widely used pretrained transformer models, including different versions of GPT-2 and BERT-base (Source: Jamil & Kapadia, 2026). Their findings were striking: in standard attention mechanisms, the filtering component overwhelmingly dominates the routing component. Furthermore, routing consistently operates at a very low "rank" on real-world text.

      In technical terms, "rank" refers to the number of independent patterns or dimensions a matrix can represent. A low rank implies a simpler, less complex interaction. The study found that while the attention mechanism's underlying "weight kernel" (the learned parameters) allocated significant capacity for routing, the heads rarely utilized this capacity. This suggests that routing's potential for complex, directional information flow is largely underutilized in existing transformer designs, hidden by the more dominant filtering computations. This inefficiency means significant computational resources are often allocated to capabilities that aren't fully exercised.

Introducing S–D Attention: A Stable & Disentangled Approach

      To overcome the entanglement of routing and filtering and reveal their true operational dynamics, the authors introduced a novel diagnostic parameterization called S–D attention. In this architecture, the interaction matrix is explicitly constructed as L = S − D, where 'S' is a learned skew-symmetric matrix (for routing) and 'D' is a learned positive diagonal matrix (for filtering). This structural separation offers a critical advantage: it guarantees stability, specifically Re(λ) ≤ 0 for every eigenvalue.

      This inherent stability is so robust that S–D attention models can train reliably without the need for "layer normalization," a common technique in neural networks used to stabilize training by normalizing activations within each layer. This ability to train a 355M parameter model stably without any normalization highlights the profound implications of disentangling these core attention functions. It demonstrates that a well-structured mechanism can achieve stability through design, reducing architectural complexity.

The "Spectral Cascade": A Blueprint for Efficient AI

      When routing and filtering are disentangled, and without the uniformizing effect of layer normalization, the routing mechanism self-organizes into a remarkable "spectral cascade." This refers to a hierarchical pattern of routing complexity across the model's layers:

  • Early Layers: The first layers of the network exhibit extremely low-rank routing, often converging to an "effective rank" of two. This means initial information redistribution is very simple, typically involving a single, broad rotation plane that gently redirects information across the sequence.
  • Deeper Layers: As the network deepens, the routing complexity gradually expands, reaching ten or more independent rotation planes in the final layers. This signifies a progressive increase in the sophistication of directional information flow as the model processes information more deeply.


      This spectral cascade appears consistently across models of varying scales, from 7 million to 355 million parameters. The inverse trend is observed for filtering, which collapses from a high-rank landscape in standard attention to a single scalar per head when routing is properly organized. This suggests that if routing effectively manages directional information, simpler filtering is sufficient.

      This cascade is not an inherent property of S–D attention itself, but rather the natural organization the optimizer discovers when routing and filtering are separated and not forced into uniformity by normalization. It offers a crucial measurement tool that precisely identifies which layers genuinely require high-rank, quadratic attention, and which can be replaced by more computationally efficient, low-rank, or linear mechanisms.

Practical Implications for AI System Design

      The insights from the routing–filtering decomposition and the spectral cascade are highly actionable for designing more efficient and scalable AI systems:

  • Targeted Simplification: The cascade predicts that early layers, operating at low routing rank, are prime candidates for simplification. For example, linearizing the first seven layers of a 125M S–D attention model resulted in a minimal perplexity increase of less than 5%. In contrast, applying the same intervention to standard attention causes the model to collapse. This "linearizable region" widens with network depth, providing a clear roadmap for where to introduce simpler, faster attention variants.
  • Hybrid Architectures: By replacing the initial four layers with a more efficient ELU+1 linear attention, models achieved performance within 1.4% of the baseline with full head dimension. This demonstrates a principled way to build hybrid transformer architectures, similar to those being explored by models like Jamba and Nemotron-H, but guided by measurement rather than exhaustive search.
  • Parameter Reduction: Architectures designed with this cascade in mind can significantly reduce attention parameters—by 47–65% in experiments—with only a modest trade-off in perplexity (3.9% to 8.4%). This substantial reduction in computational footprint is crucial for deploying advanced AI in resource-constrained environments. For example, ARSA Technology’s AI Box Series, designed for edge AI deployments, could leverage such insights to deliver powerful, real-time analytics with optimized hardware.
  • Enhanced Deployability: The ability to train stable models without layer normalization and to identify layers suitable for simpler attention mechanisms contributes directly to the deployability of AI. This is particularly relevant for applications that demand low latency, minimal compute, and on-premise processing, areas where ARSA excels in developing custom AI solutions and AI Video Analytics.


      The Routing and Filtering Structure of Attention provides a powerful framework for understanding and optimizing transformer models. By making the spectral budget "legible" through decomposition, and the cascade making it "actionable," this research offers a fundamental shift from brute-force architectural search to principled, measurement-driven design. This paves the way for the next generation of AI that is not only powerful but also inherently efficient and adaptable across various industries.

      To explore how these advancements can be integrated into your enterprise AI and IoT strategies, feel free to contact ARSA for a free consultation.

      Source: Jamil, S., & Kapadia, R. (2026). The Routing and Filtering Structure of Attention. arXiv preprint arXiv:2605.18826. https://arxiv.org/abs/2605.18826