Chasing AI Autonomy Misses Near-Term Agentic Returns

Worker wearing advanced full body powered exoskeleton.

Apple recently published a paper asking large reasoning models (LRMs) to solve some simple but lengthy algorithmic challenges, such as the Towers of Hanoi disc sorting puzzle. The models failed explosively. The models were able to solve the Towers of Hanoi challenge (in which discs are shifted across pegs according to simple rules) with three discs but failed at eight or more. The paper showed that the models guess at the output of rules, even when the algorithm is provided.

Apple’s findings aren’t unique. In a paper titled “Mind The Gap: Deep Learning Doesn’t Learn Deeply,” Subbarao Kambhampati writes that inspection of models’ inner workings shows that models that are successful at algorithm challenges aren’t faithful to the algorithms internally. In other words, models that get to the right answer may be using “alternate strategies,” akin to my teenager cramming imperial dynasties the night before exams: The correct answer is the name you recognize. Kambhampati argues that LRMs aren’t fundamentally different from the large language models (LLMs) they’re adapted from.

As statisticians say: All models are wrong, but some are useful.

Less Inference, More Algorithm

As Gary Marcus has written, LLMs “are no substitute for good, well-specified conventional algorithms.” I was prompting Claude to output the algorithm for me. I asked it to write a validator for the algorithm. Then I asked it to give me a demo application showing the solution.

The model solved the Towers of Hanoi on the first try.

The algorithm written by the LLM works better than calling the AI for the same result: It scales well, is more efficient than a machine learning (ML) model and should run as accurately as anything else in your browser. The model that wrote the code was aided by the many published reference solutions to this learn-to-code staple, which is also true of critical business questions like “Am I making money?” and “When does my pizza arrive?”

Experimentation and Intuition

To get value from foundation models, we need to point them at appropriately scoped problems. In practice, the work of scoping AI problems is a combination of experimentation and developer intuition, and in an enterprise context, using developer platforms that make small batch experimentation safe to try. (Disclosure: I work on that at the VMware Tanzu Platform.) You can label this combination of full stack dev and model awareness as “AI engineering.”

My teammate and AI engineer Brian Friedman says: “Effort is required … you have to provide the specifics of your org in a narrow manner in order to solicit specific and accurate responses. We need to view things like retrieval-augmented generation not as stopgaps or anti-patterns, but as the way forward for safe and effective use of AI.”

Agents: Less Bad Than What We’ve Been Doing

This gets us to agents, the reason that reasoning models exist. It would be really lovely if reasoning models could think through long workflows as zero-shot solutions: Find me a flight, I don’t care how. The more likely situation is that we’ll keep writing software.

We’ll identify small intermediate goals and store the results. We’ll call services and tools and validators. We’ll implement algorithms in Java and Python and Go. We’ll get feedback from humans along the way. We’ll worry about latency and security and carbon emissions.

It’s likely that coding assistants like Claude, Devstral and Gemini can do some of that work for us. But the slow work of figuring out what users want and testing product market fit still has to happen.

Foundation models solve hard problems. I’m skeptical by nature, so I’ve spent the last five years testing LLMs with real-world problems, building code analysis, search, summary, checkout and customer support gadgets. Within the scope of “transport a relevant bit of JSON from a database to a UI,” the reasoning models are proving stable, accurate and fast. Integrations are cheap now. Classification works. You can throw business rules into an application with natural language — “Do this thing, unless it’s one of the following situations” — and it works pretty well on Day 1.

As a developer, the delightful surprises are starting to outnumber the terrifying ones.

The recent influencer walkback from “agents” to “agentic applications” is an encouraging hype-cycle correction. As I’ve written previously, existing software workflows are the near-term target for not-quite-agent adoption. We’ve solved natural language understanding in an incredibly general way. This is just now getting productized in high-value workflows (I’ll leave consumer chatbots like ChatGPT a topic for another day). Enterprises start with domains that are easy to measure as dollars (sales team prioritization, customer support, site reliability engineering [SRE]), but the next decade will see it creeping into places like pizza restaurants and retail stockrooms as small but useful process improvements. The users don’t need to know there’s a foundation model in there. They just want better software.

Robots Do Laundry, but Only for Software Developers

The first wave of somewhat autonomous agents is running today on developer laptops. In a post modestly titled “My AI Skeptic Friends Are All Nuts,” software engineer Thomas Ptacek writes:

“​​People coding with LLMs today use agents. Agents get to poke around your codebase on their own. They author files directly. They run tools. They compile code, run tests, and iterate on the results. They also:

  • pull in arbitrary code from the tree, or from other trees online, into their context windows,
  • run standard Unix tools to navigate the tree and extract information,
  • interact with Git,
  • run existing tooling, like linters, formatters, and model checkers, and
  • make essentially arbitrary tool calls (that you set up) through MCP.”

My experience and various research support this. Ticketing, testing, code repos and deployment pipelines are represented as tools, often with the Model Context Protocol (MCP) connecting them. Bots connected to a code editor are allowing developers to ask questions, like “Hey Tanzu Platform, what data infrastructure is approved for this use case?” and get answers that are contextually shaped by what they’re working on.

It’s a hint at what an “agentic” organization looks like for all knowledge workers: not a high autonomy robot but a steerable power armor wrapped around empowered humans, who can get started faster, work more safely and spend more time on the hard (or even fun) problems.

Meanwhile, builders keep building. My former Sprout Social colleague Kevin Stanton posts, “Can’t wait for this hype cycle to end. The only way through is building real stuff and ignoring the chatter.”

Run small tests. Measure results. Use realistic data in realistic constraints, assembling your model’s context inputs by hand if you have to, or better yet, using a platform to wire up data, compute, inference and eval components quickly and safely. All this power is not aimed at impressing your bosses with vaporware prototypes, or throwing an AI press release at a slumping stock price, but to make better products. Try things. Show it to customers. Listen.

OK, but what about the big transformative changes? This is that. You’ll see it in hindsight.

The post Chasing AI Autonomy Misses Near-Term Agentic Returns appeared first on The New Stack.