Why rebuilding beats automating in the race to production-grade AI

23 Apr

For the past 18 months, a pattern has been quietly repeating itself inside the world’s biggest companies. A bank automates its loan approvals and watches the cost of doing it triple. A retailer deploys an agent into customer service and finds it has just scaled a broken escalation process across every market it operates in. A hospital turns on a clinical documentation assistant and discovers that the data it is drawing from is a decade of inconsistency no one had the incentive to fix. The pilot looked brilliant. The production deployment is quietly, expensively failing.

The reflex, almost every time, is to blame the model. Tune it. Retrain it. Swap it out for a newer one. Bring in a consultancy. Add a committee. Write a governance framework. None of it works, because none of it touches the actual problem.

The actual problem is that the infrastructure most enterprises are running AI on was built for cloud-first strategies that assumed elastic, general-purpose workloads. It cannot handle AI economics. The processes agents are being asked to automate were designed for human workers, and they do not translate. The security perimeter most enterprises still rely on was built for a world where attackers moved at human speed. Agents and the adversaries exploiting them do not. Digital transformation, as most businesses have defined it for the last ten years, is solving the wrong problem. And the companies that have understood this are pulling permanently ahead.

The cloud assumption does not survive contact with AI economics

The first place the old foundations start to crack is the infrastructure bill.

Cloud was built for workloads that could scale up and down on demand, where the whole point was not knowing in advance how much compute you would need. AI turns that assumption inside out. It is data-driven first, latency-sensitive, and relentless in its appetite for GPU resources. Consumption patterns that looked like healthy elasticity when they were spreadsheets and SaaS look like runaway spend when they are models and agents. Generative AI infrastructure at production scale has, on average, been running three to five times initial projections. Finance teams that approved the original business case are now approving the fourth revision of it.

Haider Aziz, General Manager for META at VAST Data, argues the industry was solving the wrong problem long before AI made it obvious. By 2016, three decades of enterprise data infrastructure had been built around scarcity: spinning disks, tiered storage, fragmented architectures. A decade of cloud had been layered on top, built around convenience rather than performance. “We didn’t go through a moment of realisation,” Aziz said, “because we were founded on the basis that the industry was already solving the wrong problem.” VAST’s response was an architecture called Disaggregated Shared-Everything, which decouples compute and storage so they can scale independently while keeping the data globally shared, consistent and free of copies. “Rebuilding for AI starts with a simple idea,” Aziz said. “Stop moving data around to fit the system. Build a system that fits the data.” It is a rebuild, not a retrofit.

The temptation most enterprises face now is to paper over the mismatch by running two environments in parallel, one for the legacy workloads and one for AI. That is the wrong instinct. “The more you invest in that legacy architecture, it will make it much harder for you to leave that legacy architecture,” said Mohammad Abulhouf, Senior Systems Engineer Manager at Nutanix. Every optimisation made to the old stack deepens the sunk cost, and every process tuned against it adds another reason not to migrate. Fibre channel, still the backbone of most enterprise data centres, caps at 32 to 64 megabits per second per port. AI workloads need hundreds. The architecture is not neutral. It is quietly voting against the rebuild every day it stays in place.

What Abulhouf is arguing for instead is a unified platform that handles both AI and general-purpose workloads on the same infrastructure. “It has to be unified architecture because we believe in unification, we believe in simplifying, because you can build lots of simplicity out of unification,” he said. “It should be unified for AI and general purpose workload, and it should be unified with public cloud which will allow you to move everywhere smoothly and easily.” That portability is not a feature. It is the precondition for running AI at production scale without being trapped inside any single vendor’s economics.

The data layer has to come first

The second crack opens in the data layer, and it is the one that catches almost everyone by surprise.

The moment an agent moves from pilot to production is the moment it inherits everything the business did not fix about its data. Fragmentation across systems that were acquired rather than integrated. Ownership disputes between departments that have never agreed on a single source of truth. Definitions of “customer” that vary between the CRM, the billing system and the support desk. Gartner has attributed roughly 85% of all AI project failures to poor data quality, and predicts that 60% of projects unsupported by AI-ready data will be abandoned through 2026.

“Businesses discover how the actual data is tightly intertwined with outdated systems, fragmented, politically owned, and poorly managed,” said Avinash Gujje, Practice Head at Cloud Box Technologies. Agents surface all of it simultaneously. And the damage is not only technical. The governance frameworks, risk committees and operating models enterprises rely on were built for deterministic software, where the same input produced the same output every time. AI does not behave that way. “AI introduces probable outcomes while organisations are more used to yes-or-no kind of responses,” Gujje said. “Enterprises are not structured to handle ‘mostly right’ answers at scale. They lack clear accountability for AI decisions, escalation paths for failure, and processes to manage continuous model drift.”

The rebuild, as every serious practitioner now describes it, starts with the data. Walid Issa, Senior Manager, Solutions Engineering for Middle East and Africa at NetApp, traces the insight back to the point at which the industry realised automation on its own was not the answer. “Customers kept automating inefficient processes and still struggled with scale, cost, and complexity,” he said. “The real issue was fragmented data and inconsistent operations across environments. Once we focused on unifying data management and delivering consistent, automated workflows end-to-end, automation finally delivered meaningful impact.” The sequence matters. Unify the data first. Then automate on top of it. Reverse the order and the automation simply propagates the fragmentation faster.

Rebuilding means starting with the data layer underneath and designing it around data flow, performance and governance rather than retrofitting legacy systems. It means building hybrid reasoning architectures that use language models where probabilistic behaviour adds value and keep rule-based automation where the process actually requires determinism. The rebuild is not a software purchase. It is a redesign of how the business treats the thing AI runs on.

Security was designed for an enemy that no longer exists

The third crack is the one most boards are waking up to last, and the one that will embarrass them first.

Perimeter defence was built on an assumption the industry no longer gets to make: that you can draw a line around your estate and defend it. Data does not live inside a perimeter anymore. It lives across clouds, on-premises, at the edge, in SaaS tools no one has fully inventoried. Agents interact with it continuously, often without a human in the loop. The speed and scale of those interactions changes the problem completely, because the adversaries targeting them now operate at the same tempo. A human security team cannot match it.

The only viable defence against machine-speed attack is machine-speed defence. Zero-trust principles, continuous verification, micro-segmentation, anomaly detection and ransomware monitoring have to be built into the platform rather than layered on top of it. Issa describes the shift in data terms. “Perimeter-based security couldn’t keep up with machine-speed threats, especially as AI agents interact directly with data,” he said. “Our approach focuses on zero-trust principles, continuous verification, and real-time anomaly detection built into the data layer itself. By securing data, not just the perimeter, we can respond at machine speed and contain risks before they propagate.” It is the same logic that drove the data unification argument, applied to threat containment.

Aziz puts it more directly. “What you need instead is security that travels with the data. Policy, identity, and access controls that are embedded directly into the system, enforced in real time, and consistently applied no matter where the data lives or how it’s being used.” Security has to travel with the data, not wait for the adversary at the door.

The operational side of this matters just as much. Levent Ergin, Global Chief Strategist for Climate Sustainability and ESG at Informatica, argues that the consequences of skipping this work are now visible in the insurance market itself. “We now have insurers coming up with insurance for AI, for GenAI, for data poisoning,” he said, “because if you deploy something and you haven’t gone through the appropriate controls testing, maybe you’ve got a prompt injection attack, and all of a sudden your agent, which has access to very sensitive data, is now leaking data.” When insurers begin pricing a category of risk, they are pricing an expectation that it will materialise. Before any agent reaches production, Ergin argues, the controls around it have to be tested by a cross-functional committee that verifies whether zero trust is actually enforced rather than nominally in place. The testing cannot be cosmetic. “By the time you get to pre-production, you don’t want to be fixing data quality issues. You want to be looking at regression testing,” he said. The question being asked at that stage is not whether the agent works, but whether it has had unintended knock-on consequences anywhere else in the live environment.

Resilience is the test most enterprises have not run

Then the AWS outage happened.

In March, an Availability Zone in the UAE was impacted by objects that struck the data centre, and enterprises across the region discovered, in real time, how much of their critical workload sat in a single location they could not migrate away from. Disaster recovery plans written around the assumption that failures would be localised and technical did not survive a region-level event driven by geopolitics. The outage did not cause the resilience problem. It made it visible.

For production-grade AI, resilience cannot be a compliance afterthought. It has to be a first-class design principle. “Focus on operational resilience, which is identifying your key business processes, people, process, technology and most importantly the data that is supporting those critical processes,” Ergin said. “We saw cloud first, and now we’re seeing agentic enterprise. AWS in the Middle East is down. If you had thought about your operational resilience, you would already know your crown jewels in terms of process and data, and you would be able to switch from one cloud pod to another, or from AWS to GCP, or whatever multicloud solution.” The enterprises that have done that mapping can move. The ones that have not, and are simultaneously running agents with access to sensitive data, are carrying concentration risk on two axes at once. Language models are rented. Business context is owned. The companies that have internalised the difference are the ones that can deploy AI knowing they have backup plans if something goes wrong.

What rebuilding actually looks like

The objection at this point is usually the same. We cannot rebuild everything. We have spent a decade on the current stack.

The rebuild does not require starting from scratch. It requires stopping the investment in the legacy architecture and building the new one in parallel. As servers come up for refresh, workloads migrate onto the unified platform one at a time. Over three to four years, the legacy estate retires itself. Abulhouf describes the destination's operational shape in movement terms. “Train your model in the cloud, bring it to your main data centre, fine-tune it, and then push it to the edge where you can do all of the inferencing work. Bring it back, fine-tune it, bring it back to the public cloud. Keep playing with it, because that ease of management, ease of movement, mobility and portability brings lots and lots of efficiency.” That continuous movement is the actual shape of production-grade AI, and no infrastructure designed three decades ago for batch processing can deliver it.

Asked what he would do differently with the benefit of hindsight, Issa’s answer is short. “If we were to do anything differently, we would accelerate the shift to AI-ready architectures earlier instead of adapting existing workflows.” It is the same instinct that runs through every serious version of this conversation, and Gujje frames it as a question of organisational design rather than technology choice. “AI ships fastest in organisations designed for decision velocity, not technical perfection,” he said. “One accountable leader must own outcomes, including risk acceptance. Risk budgets must be defined upfront. Governance must run in parallel. When organisations design for decision velocity, the conversation shifts from ‘can we deploy AI?’ to ‘under what conditions will we deploy, learn, and adjust?’”

Enterprises still optimising their legacy transformation programmes are lagging. They are getting very good at the wrong thing. Every month they spend refining the old stack is a month when the gap with the businesses that have started rebuilding widens. Eventually, the gap becomes the market.

The Source Code Editorial