
Three of our engineers spent a day at the first ITCORNER Agentic Engineering Hackathon. We won — and the trophy was the least interesting thing that happened. The brief was deliberately strange: the app you ship doesn't matter. The process does. How do you design a software development lifecycle where autonomous agents plan the work, pick it up, implement it, test it, review it, and self-correct — and where do humans stay in the loop versus get out of the way? The app we built was a throwaway: a doctor–patient booking system. What we actually built, and what we want to write about honestly, was the machine that built it.
The story we're not telling
The industry narrative right now is "the agent writes the code, the engineer goes for coffee." That isn't what happened, and the parts where we tried to make it true are exactly the parts that broke. What we built was closer to an autonomous factory line with a lot of guardrails — and the interesting engineering was almost entirely in the guardrails, not the agent. So this is a technical write-up of the architecture, the economics, and the failure modes. We did not figure this out. Almost nobody in the room had. That's why it was worth a day.
The pipeline: from docs to tickets to PRs, with no human typing
The system has two halves. The first is work generation : we feed the agent the project's specification documents — the PRD, an architecture-rules file, and a domain glossary — and it generates a backlog of GitHub issues. Each issue is a vertical slice (a tracer bullet): database → API → contract → UI → tests, all together, never a horizontal "do all the database work first" layer. Each issue carries acceptance criteria and a priority label.
The second half is the
Agent Loop
— the part that does the building. It's a stateful bash orchestrator (afk.sh ) wrapped around a single long instruction file (prompt.md ), invoking Claude Code headless:
claude --print --permission-mode acceptEdits \--output-format stream-json "$context\n$prompt"--permission-mode acceptEdits is the line that matters: the agent accepts its own file changes with no human prompt. Each iteration is one full ticket, start to finish. The loop reads a state file (RUNNING / PAUSED / STOPPED ) between iterations so a human can pause or stop it gracefully, and it emits a structured log line at every step boundary:
[agent] iteration=1 issue=#42 step=red commit=f91e660
[agent] iteration=1 issue=#42 step=verify result=pass
[agent] iteration=1 issue=#42 step=review approved=true
[agent] iteration=1 issue=#42 step=green commit=8027244
[agent] iteration=1 issue=#42 step=pr url=.../pull/123GitHub is the queue. The loop picks the highest-priority open issue labeled afk , labels it in- progress so no other agent grabs it, and on completion opens a PR (Closes #42 ), flips the issue to qa-ready , and comments a summary back onto the issue. The whole audit trail — plan, red/green cycles, review verdict, PR — lives in the issue's comment thread. When we hard-killed a run, a cleanup script stripped in-progress labels off every open issue so nothing deadlocked.
The agent didn't just write code. It managed its own work board.
Red → green → refactor, enforced by a hook, not by hope
The discipline that made this produce mergeable code rather than plausible-looking sludge was strict TDD, and we did not trust the model to follow it voluntarily. A Claude Code PostToolUse hook (tdd-pattern-check.sh ) runs after every git commit and inspects the message: any feat(green): or fix(green): commit
must
be immediately preceded by a test(red): commit on the same branch. If it isn't, the hook exits with code 2, which feeds an error back into the agent's context and forces it to revert and do it properly.
So every feature lands as a pair: a commit with a failing test, then a commit that makes it pass. The git history reads like a metronome:
test(red): JWT verifyJwt must reject tokens with missing exp or junk role
feat(green): validate JWT payload — reject missing exp and non-role values
test(red): Playwright e2e — login→dashboard→logout for doctor and patient
feat(green): web — login page, patient/doctor dashboards, auth token storageBefore each green commit, the agent also invokes a fast reviewer subagent (reviewer-light ) on the staged diff. It returns JSON, not prose, and it blocks on a fixed checklist: no any , no leftover console.log , no unhandled async, no fake tests (expect(true).toBe(true) ), no mocking the thing under test, no crossing the architecture's layer boundaries, no domain terms that drift from the glossary. A second hook auto-runs Prettier and ESLint --fix after every edit, so the agent never spends tokens on formatting.
The stack itself was conventional on purpose — a pnpm monorepo with Hono + Drizzle on the API, React + TanStack on the web, and a shared ts-rest + Zod contracts package as the typed source of truth between them, so the agent couldn't drift the front end and back end out of sync. Integration tests ran against a real Postgres, not a mock.
What broke, written down because pretending wouldn't help
-
The tasks ran themselves — and that was the problem. The Agent Loop's autonomy was the demo highlight and the operational headache. Once it was running, our control over what it picked next and how it interpreted a loosely-worded ticket dropped sharply. Autonomy and steerability traded off against each other in real time.
-
Low control, by construction.
acceptEditsplus a GitHub-driven queue means the human isn't in the approval path during a run. Great for throughput, uncomfortable when the agent confidently builds the wrong thing for twenty minutes. -
It reached for e2e tests where unit tests belonged. The TDD hook enforced that a test existed before the code — it couldn't enforce that the test was at the right altitude. The agent often satisfied "write a failing test" with a heavy Playwright e2e flow when a small unit test would have been faster, cheaper, and more precise. The pattern was followed; the judgment behind it wasn't.
-
PR conflicts. Running tickets in parallel (we experimented with
git worktree-isolated burst agents) produced merge conflicts between PRs. Vertical slices are less prone to this than horizontal work, but "independent" issues turned out to touch shared scaffolding more than our slicing assumed.
What worked, and what we'd keep
The artifacts the system produced were genuinely good: a clean breakdown into tickets, one PR per issue, priorities respected, the full progress of each issue narrated in its own comment thread, and complete AI logs for every iteration. As a process scaffold — TDD enforced by hooks, architecture enforced by a reviewing subagent, work queued and audited through GitHub — it held together. The machinery was sound. The thing it kept revealing is that the hard part of agentic engineering isn't the model. It's the orchestration, the feedback loops, and the boundaries you draw around the agent.
Using Claude Code or Cursor every day doesn't mean you're doing agentic engineering. It's a different discipline — closer to designing a CI system that happens to write code than to pair programming with an AI.
The most valuable hours of the day were the ones spent comparing failure modes with other teams instead of competing with them. Everyone hit the same wall in a different place. Nobody in that room The most valuable hours of the day were the ones spent comparing failure modes with other teams instead of competing with them. Everyone hit the same wall in a different place. Nobody in that room had this fully figured out — which is exactly why we need more hands-on, technical events like it, while the way software gets built keeps shifting under our feet.
Thanks to our team — Piotr Nowak, Michał Gąsiorek, Piotr Baranek — for taking the risk of an open-ended experiment and learning out loud, to ITCORNER for organizing, and to every other team that shared the parts of their process that didn't work.


