LLM Automation — The Adventure and the Business Takeaways

Hand an AI agent your tools and it looks like magic — until it spends thirteen dollars to buy one cheap item. A week inside personal LLM automation: where the model really fits, and what autonomy actually costs.

Jacek Pietsch

Principal Solution Architect

I have worked with LLMs since nearly the beginning — every major model, coding in projects small and large. From that vantage point the next step looked obvious: hand an agent all the tools — inbox, calendar, browser, card — describe the outcome, and let it run without me in the loop. So I built it. Not the engine — I built my own personal autonomous agent on top of Hermes, an existing agent framework: it lives on a server, takes instructions over Telegram, runs scheduled tasks on its own, and browses the web like a person. One week with it can be summed up in a single number. The agent was told to find one cheap product, add it to the cart, and pay. It did — and it spent about thirteen dollars to do it.

This is not an article about AI failing. AI works, and it works impressively. It is about what LLM automation actually is once you step out of the chat window into a system meant to run itself — and why the distance between a smooth demo and a dependable product is the entire job.

The chat window hides who is doing the work

In a chat the model feels precise: you ask, it answers; you correct, it adapts. The conclusion writes itself — if it understands this well, I just describe the goal and it handles the means.

The catch: in a chat, you are the control loop. You read every answer, judge it, ask for the fix, decide when to stop — hundreds of micro-corrections made on reflex and never counted. Build an autonomous agent and you remove yourself from that loop. What looked like understanding turns out to be the impression of understanding, and it holds only while you are watching its hands.

An LLM is not a precise employee. It is an employee handed a two-sentence brief — who took it seriously.

Hand it thirty sentences instead and you get thirty interpretations, sometimes each one independently. "Understanding," as an LLM performs it, is not predictability.

In a working agent, the model is roughly a tenth of the system

If the model cannot be taken at its word, something deterministic has to keep it honest. In a working agent the model is maybe ten percent of the system; the rest is scaffolding that checks, retries, throttles, validates, guards credentials, and decides when to cut the model off. That scaffolding is the product. Look at where the model actually sits in three of my flows:

Email triage is mostly Python. Fetching the mailbox, parsing, deduping, applying the unambiguous rules, logging, retrying, and the schedule that keeps it alive are all deterministic code. The LLM is invoked only for the ambiguous minority — and each such message in a fresh, isolated context, so cost and behavior stay bounded. The "AI inbox assistant" is a small reasoning call fenced inside a pipeline that is overwhelmingly plumbing.
In the purchase, the model proposes and the code disposes. The agent drives the browser, but it cannot move money. Which identity to check out as — personal or company — is decided deterministically, not by the model. Payment halts at a hard gate: an explicit human confirmation before a single transaction goes through. The autonomy is real right up to the line where a mistake would cost money; at that line, code takes the wheel.
The model never sees a password. Credentials live in a vault; a deterministic layer fetches and injects them at the moment of use, and the LLM operates blind to the actual values. The most security-sensitive step in the system is the one the "intelligence" is deliberately kept out of.

The pattern repeats: the model is a small, fenced reasoning component; the deterministic code around it wraps it, gates it, blinds it, and decides when to stop it. The model's intelligence is not the system's reliability — reliability is earned one operation at a time, in exactly the code no one puts in a demo.

Every flow is its own project, and failure is metered

There is no "plug it in and it works." Email triage is one project, with its own rules and calibration. A purchase at one store is a second, unrelated project. The same purchase at a different store is work from scratch — different layout, different anti-bot defenses, different checkout. The agent does not "learn stores" the way a person does after one trip; each scenario is fresh engineering, refined by hand over many failed runs before it holds.

And cost is not linear. At every step the model is re-fed the entire accumulated history, so each retry is billed on top of all the ones before it. That is how a single cheap purchase reached thirteen dollars: the store's defenses forced attempt after attempt, and every attempt re-billed the whole transcript. It worked — the order went through — but the success is what cost thirteen dollars.

That cost can be driven down — trimming re-sent history, narrowing the toolset, pinning one model provider. None of these is a switch; each is its own piece of deep engineering, stacked run after run. And even on the cheapest models, even fully optimized, the result lands right at the line where automating the job barely beats doing it by hand.

"The agent will do your shopping for you" meant, in practice: it can spend thirteen dollars doing one small piece of your shopping.

Business takeaway: before counting what automation saves, count two things it costs — the engineering required to make the per-run price bearable at all, and the price of every attempt that fights back.

The tooling is early — across the field, not just here

This is not one homemade system's failing. The whole class — terminal-launched agents, autonomy frameworks — is early and unoptimized. A wave of agents with real graphical interfaces is arriving now, from research labs and enterprise vendors alike. That so many serious players are starting at once does not weaken the immaturity claim — it confirms it. The advice is unromantic: experiment, by all means; build a money-critical process on it, not yet.

Where AI is genuinely excellent

Be honest to the end: this technology is superb where there is a hard truth to anchor to.

Programming. It is deterministic. class always means the same thing; syntax cannot be read thirty ways. It is no accident that an AI assistant led most of the diagnosis and fixes that same week, and did it well.
The "text calculator." Defined input, defined notion of the output: translate, summarize, classify, extract signal from mess. Transformation and extraction are its strength.

It is weak where it must create from scratch, or infer what you actually meant when you never fully said it. That frontier — between transforming what you were given and inventing what you did not say — is where the real front line of today's automation runs, and it deserves its own piece.

Join the Newsletter

Get weekly automation new right into your email inbox. No spam, only quality content!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Blog

Advanced Automation Insights.

Access our exclusive whitepapers, expert webinars, and in-depth articles on the latest breakthroughs and strategic implications of advanced automation and AI.

Visit Blog