huecki

huecki

Software, AI agents, messy notes and the occasional useful idea.

Read blog GitHub LinkedIn

Currently on the bench

Currently building: Agent Buildprint

Agent Buildprint is my current main project: executable contracts for coding agents — phase-flow packets, evidence ledgers, review loops, and replay gates instead of just prompt + spec.

ACTIVE BUILD phase-flow replay + evidence honesty

Agent Buildprint

Agents no longer start from a vague assignment. They bootstrap a selected-buildprint packet, read the phase-flow constitution, write schema-valid runtime evidence, and cannot sell blockers as success.

$ agb start
→ phase before code
→ evidence before trust
→ replay before done

PHASE-FLOWEVIDENCEREVIEWSREPLAY

Open Buildprint registry →

AI Native Engineering

From prompt writer to AI system builder.

A self-paced learning path for developers who want to operate AI features, not just demo them — covering context budgets, Task Contracts, decomposition, evals, and fallbacks.

01

Tokens & Attention

Context windows, position effects, and lost-in-the-middle as real architecture constraints.

02

Context Engineering

Task Contracts, schemas, and source boundaries instead of longer prompts.

03

Agentic Delivery

Evals, traces, tool gates, and incident playbooks for operable AI features.

68 slides · self-paced · interactive

Becoming LLM-Native

Open the full learning path with the interactive slide deck, context models, Task Contracts, and operable AI-feature patterns.

Open AI Native Engineering →

July 17, 2026 · AI Agent Workflows

Your Agent Harness Needs a Behavior Map

Harness Handbook points at a practical bottleneck in agent engineering: the behavior you want to change is scattered across prompts, state managers, tool calls, policy code, and tests. Build a behavior map before editing the harness.

Read article →

July 17, 2026 · AI Agent Workflows

Your AI Agent Is Not Reflecting. It Is Defending Its First Answer

Asking one agent to reconsider its answer often produces a more confident defense of the same mistake. A bounded challenger-and-judge loop can create real alternatives, but only if disagreement, stopping, and judge bias are engineered explicitly.

Read article →

July 17, 2026 · AI Agent Workflows

Your AI Agent Learned Something. Should It Be Allowed to Remember It?

An agent that writes a lesson into memory, a skill, a prompt, or its own code is deploying behavior into future runs. This guide shows how to put persistent changes through evidence, eval, approval, expiry, and rollback gates.

Read article →

July 15, 2026 · AI Agent Workflows

The Perfect Automated AI Eval Stack Does Not Exist

The reliable eval system is not one automated judge. It is a closed loop that combines portable traces, deterministic invariants, narrow semantic judges, versioned production failures, adversarial tests, and human calibration.

Read article →

July 13, 2026 · AI Agent Workflows

Your Agent Eval Is Too Short

A final pass/fail score hides the part of agent work that matters most: where the run started drifting, whether it noticed, and whether it recovered. The practical replacement is a trajectory eval with checkpoints, failure labels, and recovery metrics.

Read article →

July 10, 2026 · AI Agent Workflows

Stop Asking Which Coding Model Is Best

The useful question is moving from which model is best to what your agent harness can change, measure, persist, and roll back.

Read article →

July 7, 2026 · AI Agent Security

Your Coding Agent Can Be Tricked by Boring Shell Commands

The MOSAIC paper shifts the coding-agent security question from hostile prompts to command traces. The practical move is to audit producer-consumer state across shell commands before generated state crosses into privileged work.

Read article →

July 6, 2026 · AI Agent Workflows

Your Agent Needs an Operating Contract, Not a Bigger Prompt

The serious agent pattern is no longer bigger prompts and more encouragement. It is an operating contract: measurable goal, bounded tools, context sources, verifier evidence, review notes, rollback path, and a skill update when the run teaches you something.

Read article →

July 2, 2026 · AI Agent Workflows

Stop Prompting Your Coding Agent. Give It a Loop.

The useful upgrade from prompt engineering is not a longer instruction block. It is a reusable loop spec: trigger, goal, allowed tools, verifier, terminal states, and memory rules. That is how repeated coding-agent work becomes operational instead of conversational.

Read article →

June 26, 2026 · AI Agent Workflows

Better AI Products Need Systems, Not One Agent

Better AI products come from improvement systems around the agent. This guide shows how to build one with deterministic checks, narrow scoring rubrics, private holdouts, calibrated judges, and promotion gates.

Read article →

June 24, 2026 · AI Agent Security

Audit Local LLM Agents Like Runtimes

Local LLM agents can touch shells, files, browsers, credentials, memory, and messaging tools. Treat their runtime layer as source code worth auditing, then turn static findings into a manual review queue instead of automatic verdicts.

Read article →

June 22, 2026 · AI Agent Workflows

Agent Protocols Are Becoming a Stack, Not a Winner-Takes-All Standard

The useful question is not whether MCP, A2A, ACP, agents.json, Agora, ANP, LMOS, or AGNTCY wins. The useful question is which communication boundary you are designing: discovery, tool execution, task delegation, identity, transport, or runtime negotiation.

Read article →

June 17, 2026 · AI Agent Workflows

Your Agent Memory Test Is Probably Measuring the Wrong Thing

Most memory evals ask whether the agent got the final answer right. MemTrace suggests a sharper unit: one durable user fact tested across age, current state, earlier state, trajectory, and contradictory evidence. That turns memory from a vague feature into a small regression suite.

Read article →

June 10, 2026 · AI-first Engineering

Your Agent's Harness Is a Binary Now

Two 2026 papers from the same research lineage quietly retire prompt engineering as a discipline. The agent's system prompt is now a binary you can version, diff, and evolve with a 200-line loop. The four metrics that actually matter are not the ones your dashboard shows.

Read article →

June 8, 2026 · AI-first Engineering

AGENTS.md Is Not Context. It Is a Control Surface.

The surprising lesson from AGENTS.md benchmarks is not that context files are useless. It is that they change agent behavior, sometimes into more expensive and less useful work. Treat them as a control surface, not a repo manual.

Read article →

June 3, 2026 · AI Agent Workflows

The Next Prompt Is Not a Prompt. It’s a Workflow.

Dynamic workflows move agent work from one chat prompt into inspectable orchestration: phases, subagents, evidence, budget, permissions, adversarial review, and stop conditions. The point is not more agents. The point is better control.

Read article →

May 30, 2026 · AI-first Engineering

Put an AI Slop Gate After Tests and Lint

Tests tell you whether behavior still works. Linters tell you whether code is syntactically and stylistically acceptable. An AI-slop gate catches the residue coding agents leave behind: fake comments, swallowed errors, any-casts, duplicated helpers, TODO stubs, and dead code.

Read article →

May 30, 2026 · AI-first Engineering

Debug AI Reward Functions Like Production Incidents

Bad reward functions should not be treated like prompt drafts. Treat them like production incidents: preserve traces, classify the failure, patch only the implicated logic, and rerun against the same controls.

Read article →

May 28, 2026 · AI-first Engineering

Your AI-Built UI Needs a Playtester, Not a Screenshot Review

AI-generated interfaces often look finished before they behave correctly. A GUI playtester loop uses a separate browser agent to interact with the artifact, record screenshots and action logs, turn broken flows into reproducible bug reports, and rerun the same script after repairs.

Read article →

May 28, 2026 · AI-first Engineering

Stop Judging AI Code by the Diff

Better AI coding is not mainly about better prompts. It is about the harness around the model: explicit contracts, separate builder and reviewer roles, evidence requirements, and a loop that turns failures into better specifications.

Read article →

May 26, 2026 · AI Agent Workflows

Agents Don’t Need ‘Keep Going’. They Need Exit Conditions.

The useful lesson behind Claude Code /goal is not that agents can run forever. It is that long-running agent work needs an explicit, observable exit condition: what proves done, what stays in scope, and when to stop blocked.

Read article →

May 26, 2026 · AI Agent Workflows

Don’t Benchmark the Model. Benchmark the Agent System.

Agent evals should not only ask whether the final answer looked good. A useful benchmark measures the whole agent system: skill routing, tool policy, evidence, outcomes, hard-fail safety cases, regressions, cost, and production drift.

Read article →

May 25, 2026 · AI Agent Workflows

Give Your Agent Seatbelts, Not a Longer Prompt

When an agent keeps jumping from planning to editing to testing at the wrong time, the fix is not usually another paragraph of system prompt. Put the workflow into explicit states, give each state a tiny tool policy, and make phase changes visible.

Read article →

May 24, 2026 · AI-first Engineering

Agent harnesses should be specs, not hidden glue code

Natural-Language Agent Harnesses give a useful name to an important shift: the agent policy should be an inspectable document that a runtime executes, not invisible glue hidden inside controller code.

Read article →

May 23, 2026 · AI-first Engineering

Spec-Driven Context Resets for Coding Agents

Long agent chats rot. A better pattern is to move decisions into small spec files, clear context between layers, and let each coding-agent session read only the artifact it needs.

Read article →

May 21, 2026 · AI Agent Workflows

AI Agents Need Evidence Before They Click

When an agent clicks, sends, pays, deletes, or extracts data, the critical truth cannot live only in model prose. Put a small evidence gate before risky tool calls: predicate, evidence type, source, decision.

Read article →

May 21, 2026 · AI Agent Workflows

Stop Asking AI to Critically Self-Check

Open-ended instructions like “critically self-check this” accidentally reward the model for producing criticism. The fix is not less review. It is calibrated review: explicit criteria, PASS_NO_CHANGE, evidence per finding, severity thresholds, and a tiny change budget.

Read article →

May 20, 2026 · AI-first Engineering

Agents Don’t Need Longer Prompts. They Need Harnesses.

The arXiv survey Code as Agent Harness names the next shift in agent engineering: code is not only what agents generate. It is becoming the executable, inspectable, stateful runtime that makes agents reliable.

Read article →

May 20, 2026 · AI-first Engineering

Your Onboarding Is Why Your Team Is Vibe Coding

Teams do not usually start vibe coding because developers became careless. They start because onboarding is broken: docs are stale, harnesses are undocumented, system knowledge lives in people’s heads, and AI turns missing context into plausible code and Markdown.

Read article →

May 19, 2026 · AI-first Engineering

AGENTS.md is not enough: your coding agent needs a harness

A coding agent is not made reliable by one magic prompt. It needs a harness: AGENTS.md, skills, tool permissions, hooks, and evals that catch behavior drift.

Read article →

May 19, 2026 · AI Agent Workflows

give every client project a tiny agent

The useful move is not one mega assistant for all client work. Give each client project a small, isolated agent with its own memory, tasks, preview URL habit, and boring daily standup.

Read article →

May 18, 2026 · AI-first Engineering

Prompt Decomposition: How to Break Down AI Tasks Properly

After context engineering comes decomposition: developers should stop putting everything into one prompt and instead split tasks into direct prompts, subtasks, pipelines, agent loops, or skills.

Read article →

May 15, 2026 · AI-first Engineering

The LLM-native developer needs more than prompts

The next developer skill is not writing clever prompts. It is building the operating system around LLMs: data quality, model versioning, evals, guardrails, incident response, review UX, and repo instructions agents can actually follow.

Read article →

May 15, 2026 · Personal AI Workflows

Voice notes are the best interface for small agent jobs

Voice is not good for everything. But for small agent jobs it is brutally useful: dictate a task while moving, transcribe it locally, let your existing agent handle it, and get only a short answer back.

Read article →

May 12, 2026 · AI-first Engineering

Prompting Is Dead. Context Wins.

In 2026, good prompting is not about one magic sentence. The better approach is to curate context, define tools and schemas, set agent rules, and verify behavior with evals.

Read article →

May 11, 2026 · AI Agent Workflows

Hermes Agent: Self-Review Instead of One-Shot Output

Hermes gets interesting when an agent does not only produce output, but reviews the run: execute, measure, critique, rewrite the skill, and test again. The loop pays off mainly for repeatable workflows.

Read article →

April 29, 2026 · AI-first Engineering

AI-first Architecture: Faster Decisions, Still in Control

AI-first architecture does not mean the model decides. It means AI generates options, finds risks, compresses context, and the team makes a traceable decision.

Read article →