Beyond T-Shirt Sizing: Why Traditional Agile Estimation Fails in AI Projects

Discover the five fatal assumptions that undermine T-shirt sizing in AI development. Learn why AI's non-linear nature demands an adaptive estimation framework like Checkpoint Sizing for project success.

Beyond T-Shirt Sizing: Why Traditional Agile Estimation Fails in AI Projects

The Unpredictable Nature of AI Projects: A New Estimation Challenge

      Agile estimation techniques, particularly T-shirt sizing, have long been cornerstones of successful software development. Their simplicity and focus on relative sizing—categorizing tasks as Small (S), Medium (M), Large (L), or Extra-Large (XL)—have provided software teams with an effective way to scope work, communicate complexity, and plan releases. This approach relies on the intuitive idea that past experience with similar tasks allows for reasonably confident future estimations. However, as organizations increasingly embark on ambitious artificial intelligence initiatives, these traditional methods are proving to be systematically misleading.

      When applied to the unique landscape of AI development, especially projects involving large language models (LLMs) and intricate multi-agent systems, the estimates frequently unravel. This isn't just about minor scheduling adjustments; it can lead to projects extending by months or encountering unforeseen technical roadblocks that were entirely off the initial radar. For instance, what appears to be a straightforward "chatbot" task can explode in complexity due to challenges like data quality, rigorous evaluation needs, and the inherently unpredictable nature of multi-turn conversations. The significant "slippage" observed in these scenarios is often not a failing of the team, but rather an intrinsic characteristic of the AI problem itself, as highlighted in a recent preprint by Soundaramourty et al. (2026) titled "Five Fatal Assumptions: Why T-Shirt Sizing Systematically Fails for AI Projects" (arXiv:2602.17734).

Why Standard Estimation Falls Short: The "Fatal Five" Assumptions

      The fundamental issue lies in the mental model traditionally used for software development, which simply doesn't align with the realities of AI. T-shirt sizing, despite its utility, rests upon five implicit assumptions that hold true for conventional software but systematically collapse when confronted with the non-linear and data-dependent world of AI. Understanding these assumptions is critical for anyone involved in planning and executing AI projects.

      These five fatal assumptions are: (1) linear effort scaling, (2) repeatability from prior experience, (3) effort-duration fungibility, (4) task decomposability, and (5) deterministic completion criteria. Each of these, when violated in an AI context, leads to significant underestimation of scope, integration complexity, and achievable timelines.

Assumption 1: Linear Effort Scaling - The Cascade Effect

      In traditional software development, effort often scales linearly. Adding a new field to a database or a button to a user interface generally requires a predictable amount of work, and doubling a feature's scope might roughly double the effort. This predictability allows for straightforward T-shirt sizing. However, AI projects defy this linearity.

      A minor change in data or model requirements can trigger a "tight coupling" effect, where a small alteration cascades through the entire AI stack, necessitating extensive retraining, re-validation, and integration work. For example, enhancing a Vision AI system to recognize a new object class might seem small, but the need to gather, label, and retrain on new data, then re-evaluate performance across all existing classes, can introduce non-linear performance jumps and disproportionate effort. This challenge is further exacerbated in multi-agent systems, where interaction complexity can grow non-linearly, following an N(N−1) pattern, making small additions hugely impactful.

Assumption 2: Repeatability from Prior Experience - The Illusion of Familiarity

      Traditional software estimation benefits greatly from prior experience; if a team has built a user authentication module before, the next one can be estimated with high confidence. The patterns are largely reusable, and the environment is typically controlled. For AI, this assumption is an illusion.

      While an organization might have experience deploying a face recognition API in one environment, replicating it for another scenario can introduce entirely new complexities. Subtle differences in data distribution, sensor types, environmental conditions (e.g., lighting, occlusions), or new privacy regulations mean that a "similar" AI task is often a unique experimental endeavor. Each deployment is less about repeating a proven recipe and more about discovering a new one, making direct experience less reliable for sizing.

Assumption 3: Effort-Duration Fungibility - Time is Not Always Money

      In many traditional software tasks, adding more resources (e.g., engineers) can proportionally reduce the duration of a project, assuming tasks are sufficiently parallelizable. This concept of "effort-duration fungibility" underpins many project management strategies. However, this often doesn't hold true in AI development.

      Many critical AI tasks are inherently sequential and experimental. Model training, large-scale data labeling, hyperparameter tuning, and rigorous validation processes require dedicated time and iterative refinement, regardless of how many people are involved. Throwing more engineers at an unstable model or a complex data annotation pipeline won't necessarily speed it up; it can even introduce more overhead. This is further complicated by the characteristics of machine learning workflows, which emphasize experimental iteration and constantly evolving model decay, making parallelization difficult.

Assumption 4: Task Decomposability - The Interconnected Web

      Traditional software tasks are often designed to be modular and decomposable, allowing different teams or individuals to work on independent components that are later integrated. This isolation reduces interdependencies and simplifies estimation. For AI systems, particularly complex ones, this assumption fundamentally breaks down.

      AI projects, especially those involving LLMs and multi-agent systems, feature complex interaction surfaces and tight coupling between components. A change or bug in one part of an AI model or data pipeline can have unpredictable ripple effects across the entire system. Research on multi-agent coordination failures, for instance, identifies 14 distinct failure modes clustered into issues of system design, inter-agent misalignment, and task verification, highlighting the profound interconnectedness. For example, a video analytics solution designed for traffic monitoring might involve intertwined modules for vehicle counting, classification, and congestion prediction, where the performance of one heavily influences the others. Breaking such systems into independent, estimable units is often highly challenging.

Assumption 5: Deterministic Completion Criteria - The Shifting Goalposts

      In traditional software, the "definition of done" is typically clear and deterministic: a feature is implemented, all tests pass, and it meets predefined functional requirements. This clarity allows teams to confidently estimate when a task will be completed. However, in AI, completion criteria are far less fixed and can feel like shifting goalposts.

      For LLM applications, meeting reliability and safety targets can be an ongoing challenge, as model behavior is often probabilistic and sensitive to input variations. The problem is compounded in multi-turn conversations, which have shown an average 39% performance degradation due to their unpredictable nature. "Done" for an AI project often means "performs acceptably under tested conditions," which can change dramatically with new data or evolving real-world scenarios. This inherent uncertainty makes it incredibly difficult to define a fixed end-state for many AI tasks, rendering deterministic estimations unreliable.

Introducing Checkpoint Sizing: A Human-Centric Alternative for AI

      Recognizing the systemic failures of traditional estimation, the authors propose Checkpoint Sizing as a more appropriate framework for AI projects. This human-centric, iterative approach moves away from rigid upfront estimations, instead advocating for explicit decision gates at key project milestones. At each checkpoint, teams reassess the scope, feasibility, and remaining effort based on concrete learnings and actual progress made during development, rather than relying on initial assumptions.

      Checkpoint Sizing embraces the experimental nature of AI, allowing for adaptability and continuous adjustment. It forces teams to acknowledge uncertainty upfront and plan for learning, making it a more realistic and effective method for navigating the complex and often unpredictable world of AI development.

      The unique challenges of AI project estimation underscore the need for a strategic approach and experienced partners. ARSA Technology specializes in delivering production-ready AI and IoT solutions that address these complexities head-on. With expertise in areas like Computer Vision, Large Language Models, and industrial IoT, ARSA adopts a full-stack AI engineering approach to solve real-world operational problems.

      ARSA provides custom AI solutions, understanding the critical importance of data dependencies, rigorous experimental iteration, and robust evaluation design for achieving measurable business outcomes. By focusing on practical, reliable, and privacy-conscious deployments, ARSA helps enterprises navigate the intricacies of AI development and unlock true value.

      Successfully deploying AI solutions requires a partner who understands both cutting-edge machine learning and the operational realities of deployment. To learn more about how to bring your AI initiatives from concept to production with confidence, explore ARSA's proven solutions or contact ARSA for a free consultation.

      Source: Soundaramourty, R., Kilic, O., & Chenchaiah, R. (2026). FIVE FATAL ASSUMPTIONS: WHY T-SHIRT SIZING SYSTEMATICALLY FAILS FOR AI PROJECTS. arXiv preprint arXiv:2602.17734.