I Built My Own AI Coding Agent. Then I Actually Used It.

June 10, 2026
9 min read
By Rahat Kabir

Contents

For the past couple of months I’ve been building astra-claw, a CLI coding agent in Python. Tool loop, streaming, persistent sessions, memory, an approval gate for file edits. The kind of project where every week you add a new capability and feel a little proud of yourself.

Then I did the thing most side project builders never do: I stopped adding features and started using it. Real tasks, real files, a friction log, and one rule: don’t fix anything mid-week, just write it down and move on.

Building an agent teaches you architecture. Using one teaches you what actually matters. Here are the four things the log taught me.

The setup

Astra-claw is a terminal agent: you type a request, it streams a response, and it can call tools to run shell commands, read and write files, search, and save memories. File edits go through a preview-and-approve gate where you see a diff and answer y/n/a. I pointed it at a sandbox folder and gave it normal, boring tasks. The kind a real user would.

astra-claw startup banner

Story 1: It has no eyes, so it built some

I asked it to read a PNG, a scanned page I wanted text from. Astra-claw has no image support. My message pipeline assumes everything is a string; there’s no multimodal anything. The file reference got rejected as binary, and I expected that to be the end of it.

It wasn’t. The agent decided to build its own OCR pipeline. Here’s the whole thing: failures in red, the pivot, the win.

astra-claw improvising an OCR pipeline: failed heredoc and python -c attempts, then finding uv and chaining uv run --with pillow --with rapidocr until the image is read

Watch the arc: a failed heredoc, a failed bare python -c, raw struct level poking at PNG bytes. Then where uv, and suddenly it’s installing pillow and rapidocr mid-conversation and reading the image. Thirteen shell calls later, it had extracted the content. Correctly. That’s a real summary of the infographic at the bottom.

A native vision model call would have done this in one step. My agent did it in thirteen, because I never gave it the one capability the task needed, and it routed around the gap instead of giving up. It finds a way by itself.

I stared at this for a while. It’s the most impressive thing my agent has ever done, and it’s also a design smell: when your agent burns thirteen tool calls compensating for a missing feature, that emergent behavior is your roadmap telling you what to build next.

But the impressive part isn’t what stuck with me. What stuck with me is what happened during those thirteen calls.

Story 2: And nobody asked permission

My shell tool has a dangerous command approval system. Destructive things like deletes pause and ask me first. I built it, I tested it, I trusted it.

While building its OCR pipeline, the agent ran uv run --with pillow --with rapidocr .... That flag installs arbitrary packages from PyPI and executes code, mid-conversation, on my machine. The approval gate never fired. Later in the week, during a CSV task, it did it again: uv pip install openpyxl, exit 0, no prompt.

> shell  uv pip install openpyxl  (exit 0 - Using Python 3.11.9 environment at: ...)

My approval system works by matching command strings. rm -rf looks dangerous. uv run --with <anything-on-pypi> looks like a normal command, but it’s an arbitrary code execution vector wearing a friendly name.

The lesson: string matching commands is the wrong abstraction for safety. What I actually care about is intent, like “this command installs software”. And intent doesn’t live in the string. It lives in what the command does. The narrow fix is adding package install patterns to the approval list, but the real answer is harder, and I don’t fully have it yet.

Stories 1 and 2 are the same story, really. Missing capability → emergent workaround → safety gap. The agent being resourceful is exactly what pushed it past my guardrails.

Story 3: It remembers my name. It doesn’t remember its mistakes.

Astra-claw has a persistent memory tool, and parts of it work great. Early in the week I mentioned my name and that I prefer Python. The agent saved both to its user profile store without being asked. Next session, it knew. That’s the demo everyone shows.

Then came the real test. A few days after the OCR incident, a fresh session hit a similar image task, and the agent re-solved the entire problem from scratch. Same mistakes, in the same order: the same heredoc syntax error, the same bare python -c failure, before arriving at the same workaround. Thirteen hard-won calls of learning, gone.

It’s not that it can’t save lessons. When I explicitly told it “save what you learned so you don’t repeat the mistake,” it wrote a genuinely good memory entry, specific and actionable. And the next session actually used it: first action was checking the directory, straight to the right command, no fumbling. The storage works. The read path works.

Here’s what makes this interesting: I had already told it to be proactive. My memory tool’s schema literally says “do this proactively, don’t wait to be asked” and lists when to save. And it half works, but look at which half:

  • “My name is Rahat” → saved, unprompted ✓
  • Thirteen-call struggle ending in a non-obvious workaround → never saved ✗

The pattern: it remembers what I tell it, not what it learns. A stated fact sits right there in the conversation, pattern matching the schema’s examples. But a lesson earned through struggle exists only in the agent’s own trajectory, and nothing ever prompts it to look back at that trajectory. Declarative facts get saved; experiential lessons don’t. The missing reflex isn’t “use the memory tool.” It’s self-reflection.

Instructions in a tool description can’t fix that. The model reads them when deciding how to call a tool, not as a standing reminder to call it at all. The trigger has to live somewhere else. Options I’m chewing on:

  • Ask the model to self-assess at session end (“did you learn anything non-obvious?”). Simple, but it adds a call per session and risks saving trivia
  • Detect “struggle then success” patterns: repeated tool failures followed by success is a strong signal something was learned
  • Piggyback on context compaction: when the agent summarizes history anyway, that’s a natural moment to extract lessons

I’d start with struggle-then-success, because it targets exactly the case I hit. But I haven’t built it yet, and I suspect the first version will save the wrong things.

(One smaller find while digging through the memory files: the user store had a duplicate. The agent re-saved my name and language preference as one merged entry. No dedup on write. Memory that slowly fills with redundant copies of itself is its own kind of forgetting.)

Story 4: Death by a thousand approvals

I was proud of my preview-and-approve gate. Every file write shows a unified diff and waits for y/n/a. Safety, right?

Then I watched myself use it. The agent proposed a 109-line script and the preview cut off at fifty:

… 52 more lines
apply write? [y/n/a]

I approved it blind. Later, one feature arrived as three separate patch prompts in a row. By the third one I wasn’t reading anything. I was just pressing y.

And think about what the three options actually offer a fatigued user: y approves a diff you can’t fully see, n blocks work you probably want, and a (“always”) turns the gate off for the rest of the session. The escape hatch I built for approval fatigue is disabling the protection entirely. The gate’s own UX funnels you toward bypassing it.

A safety gate that users rubber-stamp is not a safety gate. The logic was sound; the UX defeated it. Truncated diffs mean blind approval, repeated prompts train the exact reflex the gate exists to prevent, and the convenience option is surrender. UX is part of the security model. I’d read that before, but it’s different when you watch yourself hammer y past your own guardrail.

The scoreboard

What I testedVerdict
Core tool loop (shell, read, write, search)✅ Solid all week
Self-recovery when a dependency was missing✅ Impressive. Found uv, fixed itself
Remembering facts I told it (name, preferences)✅ Saved unprompted
Remembering lessons it learned the hard way❌ Never, unless ordered to
Approval gate on file edits⚠️ Fires reliably, but truncated diffs = blind approval
Approval gate on package installs❌ Never fired
Reading images❌ No support (hence the DIY OCR)

Two of these are cheap fixes: package install approvals and full diff viewing are small patches. The memory reflex is harder; that one gets a design doc before it gets code. And image support is the real project: it means reworking how messages flow through the whole system, so it lives on the roadmap. The point of the week wasn’t to fix everything. It was to find out which problems were real.

What this week actually taught me

The gap between a demo agent and a usable one is invisible until you live in it. Every flaw above survived weeks of development and testing, then fell out of a few days of honest use almost immediately.

If you’re building an agent: stop adding features for a week and just use the thing. Keep a log. Don’t fix anything until the week ends. The log will be embarrassing, and it will be the most valuable design document you have.

The code is at github.com/Rahat-Kabir/astra-claw. The friction log is in the repo too, unedited.