
Jacek Pietsch
Principal Solution Architect
Hand an AI agent your tools and it looks like magic — until it spends thirteen dollars to buy one cheap item. A week inside personal LLM automation: where the model really fits, and what autonomy actually costs.

I have worked with LLMs since nearly the beginning — every major model, coding in projects small and large. From that vantage point the next step looked obvious: hand an agent all the tools — inbox, calendar, browser, card — describe the outcome, and let it run without me in the loop. So I built it. Not the engine — I built my own personal autonomous agent on top of Hermes, an existing agent framework: it lives on a server, takes instructions over Telegram, runs scheduled tasks on its own, and browses the web like a person. One week with it can be summed up in a single number. The agent was told to find one cheap product, add it to the cart, and pay. It did — and it spent about thirteen dollars to do it.
This is not an article about AI failing. AI works, and it works impressively. It is about what LLM automation actually is once you step out of the chat window into a system meant to run itself — and why the distance between a smooth demo and a dependable product is the entire job.
In a chat the model feels precise: you ask, it answers; you correct, it adapts. The conclusion writes itself — if it understands this well, I just describe the goal and it handles the means.
The catch: in a chat, you are the control loop. You read every answer, judge it, ask for the fix, decide when to stop — hundreds of micro-corrections made on reflex and never counted. Build an autonomous agent and you remove yourself from that loop. What looked like understanding turns out to be the impression of understanding, and it holds only while you are watching its hands.
An LLM is not a precise employee. It is an employee handed a two-sentence brief — who took it seriously.
Hand it thirty sentences instead and you get thirty interpretations, sometimes each one independently. "Understanding," as an LLM performs it, is not predictability.
If the model cannot be taken at its word, something deterministic has to keep it honest. In a working agent the model is maybe ten percent of the system; the rest is scaffolding that checks, retries, throttles, validates, guards credentials, and decides when to cut the model off. That scaffolding is the product. Look at where the model actually sits in three of my flows:
The pattern repeats: the model is a small, fenced reasoning component; the deterministic code around it wraps it, gates it, blinds it, and decides when to stop it. The model's intelligence is not the system's reliability — reliability is earned one operation at a time, in exactly the code no one puts in a demo.
There is no "plug it in and it works." Email triage is one project, with its own rules and calibration. A purchase at one store is a second, unrelated project. The same purchase at a different store is work from scratch — different layout, different anti-bot defenses, different checkout. The agent does not "learn stores" the way a person does after one trip; each scenario is fresh engineering, refined by hand over many failed runs before it holds.
And cost is not linear. At every step the model is re-fed the entire accumulated history, so each retry is billed on top of all the ones before it. That is how a single cheap purchase reached thirteen dollars: the store's defenses forced attempt after attempt, and every attempt re-billed the whole transcript. It worked — the order went through — but the success is what cost thirteen dollars.
That cost can be driven down — trimming re-sent history, narrowing the toolset, pinning one model provider. None of these is a switch; each is its own piece of deep engineering, stacked run after run. And even on the cheapest models, even fully optimized, the result lands right at the line where automating the job barely beats doing it by hand.
"The agent will do your shopping for you" meant, in practice: it can spend thirteen dollars doing one small piece of your shopping.
Business takeaway: before counting what automation saves, count two things it costs — the engineering required to make the per-run price bearable at all, and the price of every attempt that fights back.
This is not one homemade system's failing. The whole class — terminal-launched agents, autonomy frameworks — is early and unoptimized. A wave of agents with real graphical interfaces is arriving now, from research labs and enterprise vendors alike. That so many serious players are starting at once does not weaken the immaturity claim — it confirms it. The advice is unromantic: experiment, by all means; build a money-critical process on it, not yet.
Be honest to the end: this technology is superb where there is a hard truth to anchor to.
class always means the same thing; syntax cannot be read thirty ways. It is no accident that an AI assistant led most of the diagnosis and fixes that same week, and did it well.It is weak where it must create from scratch, or infer what you actually meant when you never fully said it. That frontier — between transforming what you were given and inventing what you did not say — is where the real front line of today's automation runs, and it deserves its own piece.
Access our exclusive whitepapers, expert webinars, and in-depth articles on the latest breakthroughs and strategic implications of advanced automation and AI.