$ startcut cat field-notes/01.md

How an AI-native
engineering team
actually ships software.

A practical, end-to-end workflow — from the first ambiguous problem to a service humming in production and the docs that keep it there. Ten stages, gated by adversarial review, with clear seams between what humans own, what models own, and where the two meet. No vibes, no hype — just the shape of the working day on a team that treats AI as a first-class collaborator.

01/Principles

Four commitments that hold the workflow together.

An AI-native team isn't a team that uses AI — it's a team whose process assumes AI from the first sketch. These four commitments are the load-bearing pieces; everything downstream falls out of them.

P / 01

Context is the product.

Every team has the same models. The edge is the corpus of briefs, plans, prior reviews, ADRs, and conventions you feed in. Curate context the way you curate a codebase — it compounds the same way.

P / 02

The plan is the contract.

Every unit of work starts with an explicit, model-actionable plan: file changes, test surfaces, checklist, rejected alternatives. Reviewing a plan is an hour; reviewing the same mistake after implementation is a day.

P / 03

Humans own the seams.

Models are excellent at the middle of a problem and weak at its edges — naming what to ship, picking which evidence is good enough, owning the failure. Keep humans at every boundary; let models do the middle.

P / 04

Adversarial gates beat friendly review.

Trust falls out of cross-checking, not goodwill. A second, structurally different reviewer — different model family, different rubric, different prompt — gates every transition. A friendly reviewer rubber-stamps; an adversarial one finds the cracks.

02/The pipeline

Ten stages, one rhythm.

The shape below is the working day on an AI-native team. Each transition between stages runs through an adversarial gate (see §04). The bar under each card shows roughly how much of the work the model carries — the human still drives, but the centre of effort shifts as you move through the pipeline.

← Ambiguous Operational →
01
Brief
problem · scope · criteria
02
Research
codebase agent · telemetry
03
Plan
file-change table · checklist
04
Plan review
cross-family reviewer
05
Implementation
coding agents · IDE
06
Verification
deploy · tests · evidence
07
Code review
bot pass · human pass
08
Bot loop
trigger · fix · re-trigger
09
Documentation
living docs · drift checks
10
Retrospective
per-ticket · audit log
AI-leveraged effort Human-led effort Width of accent bar ≈ share of work the model carries
The bottleneck isn't typing. It's deciding what's worth typing.
the through-line of every stage below
03/Stage by stage

The day in ten acts.

Each stage names the human owner, the agent's role, the artefacts that flow out the back, and the failure modes worth watching for. Read it in order the first time; skim it as a checklist after.

Stage 01
Brief.
problem before solution

Every unit of work starts with a written problem definition — not a feature description. Customer, pain, scope, success criteria. The model is a Socratic partner here, not an author.

Engineer / PM owns
  • Naming the customer and their actual pain in one sentence.
  • Drawing the scope line — what's in, what's explicitly out.
  • Writing success criteria a test could actually check.
Model assists with
  • Drafts a problem-statement scaffold from a ticket or chat.
  • Asks the ambiguity questions out loud until they have to be answered.
  • Pulls related prior briefs, postmortems, and ADRs into context.

Inputs

  • Ticket · chat thread · customer signal
  • Indexed corpus of prior briefs & decisions

Outputs

  • A brief in canonical structure: problem · outcome · scope · criteria
  • Signed by a named owner before the next stage starts
Antipattern Skipping the brief because "everyone knows what we're building." If the next agent doesn't, you'll find out at PR review — when it's expensive.
Stage 02
Research.
read the code with intent

Find out what is already true — about the codebase, the data, the customer — before committing to what should be. AI compresses days of "where is this used" archaeology into an afternoon you can actually act on.

Engineer owns
  • Choosing which evidence is strong enough to bet on.
  • Tracing the few code paths that actually matter.
  • Talking to the people the model can't.
Model assists with
  • Maps call graphs, owners, and hotspots in the relevant slice of code.
  • Surfaces prior tickets, ADRs, retros that touched the same area.
  • Drafts a "what's known / what's not yet" inventory linked to source.

Inputs

  • Repo · telemetry · incident history
  • Interview notes & customer recordings

Outputs

  • A research note the plan can lean on without re-deriving
  • A short risk list — things that must be designed around
Antipattern Treating a model summary as primary evidence. Synthesis is downstream of facts; link the note back to the code and conversations it came from.
Stage 03
Plan.
highest-leverage artefact

The implementation plan is the contract between a human's intent and every downstream agent. Files that will change, failure modes, test surfaces, an explicit checklist. This is the single highest-leverage artefact in the workflow.

Tech lead owns
  • The architectural calls: what to add, what to leave, what to refactor under.
  • The rejected alternatives — kept in the doc, not deleted.
  • The "small enough to land in one PR" decision.
Model assists with
  • Drafts the plan scaffold from the brief plus the research note.
  • Generates the file-change table, architecture diagram, test outline.
  • Reviews for ambiguity — "what would I need to ask to act on this?"

Inputs

  • Brief · research note · existing system contracts
  • Repo conventions & recent plans on adjacent code

Outputs

  • A plan a teammate model could execute without follow-up
  • One ADR per significant choice, with rejected options
Antipattern A flat backlog. Without a checklist a model can pull from, the agent will improvise and the diff will surprise the reviewer.
Stage 04
Plan review.
cheapest place to argue

A bad plan caught at review costs an hour. The same bad plan caught after implementation costs a day. This is the cheapest place to argue — and the first stage that runs through an adversarial gate.

Reviewer owns
  • Sign-off on scope, risk, and the "is this the right small slice" question.
  • Pushing back when the plan optimises for keystrokes over readability.
  • Deciding what to skip, defer, or split.
Adversarial reviewer does
  • A second, structurally different model attacks the plan: missing edges, weak coverage, broken assumptions.
  • Loops with the planner until verdicts converge — or escalates.
  • Files the disagreement record as part of the plan's history.

Inputs

  • The plan · the same context the planner used
  • Repo conventions & prior review patterns

Outputs

  • An approved plan with residue called out (open questions, known risks)
  • A gate-log entry — verdict, retries, findings
Antipattern Same model reviewing its own plan. A friendly reviewer rubber-stamps; an adversarial one — different family, different rubric — finds the cracks.
Stage 05
Implementation.
work the checklist

Once the plan is approved, the agent works the checklist. The engineer's job is staging context, steering, and judging — not typing. Status updates run in the foreground; approval-asks should not.

Engineer owns
  • Loading the agent with the right context: plan, neighbouring code, conventions.
  • Reading every diff before it leaves the workstation.
  • Calling the moment to stop iterating and rewrite by hand instead.
Agent does
  • Implements each checklist item against the acceptance criteria.
  • Runs lint, types, and unit tests locally before surfacing a diff.
  • Posts brief status lines per step; does not stop to ask "should I continue?"

Inputs

  • Approved plan with checklist · curated context window
  • Isolated workspace per unit of work

Outputs

  • Small, self-described commits that pass CI on arrival
  • An updated context pack for the next stage
Antipattern The "mega PR." Long agent sessions accumulate drift; force small, verifiable slices and rebase often. Also: pausing mid-phase to ask permission — the plan was the approval.
Stage 06
Verification.
evidence, not claims

Tests pass on a developer's machine. A change is verified when it has been deployed to a real environment and exercised end-to-end. Verification produces an evidence pack — not "looks good to me."

Engineer owns
  • The eval set: cases that matter most to the customer and the business.
  • Reading the failing test first, before reading the fix.
  • The "is this evidence, or is this a story" judgement on the artefact.
Agent does
  • Synthesises unit, integration, and property tests from the plan.
  • Drives the deploy to a dev environment and re-runs the suite there.
  • Captures screenshots, logs, and responses as the evidence pack.

Inputs

  • Implemented branch · curated eval set
  • Reachable deploy target with realistic data

Outputs

  • An evidence pack attached to the change
  • Regressions filed back to the eval set, not silently fixed
Antipattern Green CI quoted as proof. CI is necessary; deploy-and-exercise is sufficient. Trust evidence over reassurance.
Stage 07
Code review.
two passes · one human

A model does the wide pass — style, missed tests, naming, neighbouring code. A human does the narrow pass — intent, taste, the things the model can't see. Both are required; neither is sufficient alone.

Reviewer owns
  • Does this match the plan's intent — not just its letter?
  • Will this be operable at 3am by an on-call who didn't write it?
  • Does this make the codebase a place people want to keep working in?
Model does
  • Mechanical checks: lint, types, dead branches, missing tests.
  • Plan-vs-diff alignment: did the agent stay on the checklist?
  • Drafts the PR description; the human confirms or edits.

Inputs

  • The PR · the linked plan · the evidence pack
  • Repo conventions & recent review history

Outputs

  • A PR ready for the adversarial bot loop
  • Review notes captured back into the context corpus
Antipattern Skipping the human pass because "the bot said LGTM." The bot pass is necessary; it is not sufficient.
Stage 08
Bot review loop.
loop until clean

After the human pass, a separate automated reviewer runs in a tight loop — different model family, structured verdicts, fixes applied, re-runs until the bot has nothing left to say. Loops that don't converge surface to a human.

Engineer owns
  • Deciding when a finding the bot keeps raising is a tradeoff, not a bug.
  • Killing the loop when it's chasing its tail.
  • The "this finding is a feature request, not a regression" call.
Bot reviewer does
  • Triggers a structured review; reads findings; applies fixes; re-triggers.
  • Detects loops — same finding twice — and surfaces them.
  • Writes each iteration to the append-only audit log on the PR.

Inputs

  • Human-approved PR · bot reviewer with a structured-verdict schema
  • Repo's prior gate history for context

Outputs

  • A PR that's been adversarially reviewed past a fixed quality bar
  • An audit trail attached to the PR, in source control
Antipattern Silently merging when the bot can't make progress. If the loop doesn't converge, that's a signal — not a nuisance. Surface it.
Stage 09
Documentation.
the compounding asset

On most teams docs are a chore. On an AI-native team they're the substrate the next agent runs on. Treat documentation as a living index — kept fresh by the same agents that consume it.

Owner owns
  • Deciding what's canonical vs. archival, and where it lives.
  • Approving doc diffs the agent proposes — docs are not free-write.
  • The taste of the docs: tone, ordering, what gets a diagram.
Agent does
  • On every merge, drafts diffs to README, ADRs, and the system map.
  • Detects drift between code and docs; opens PRs to close it.
  • Maintains the searchable index every other stage reads from.

Inputs

  • Merged code · ADRs · evidence pack · prior docs
  • Existing doc tree & ownership map

Outputs

  • Always-current system map, contracts, runbooks
  • An indexed corpus that makes every other stage faster
Antipattern A wiki of model-written prose nobody reads. Optimise docs for the next agent and the on-call human — not for length.
Stage 10
Retrospective.
per ticket · not per quarter

The cheapest place to compound is the end of the last ticket. A short, structured retrospective after every meaningful change — not a quarterly ceremony — is what bends the workflow over time.

Engineer owns
  • Naming what would have made this easier; what to lift into the standard process.
  • Calling out gaps in plans, evidence, or context the bot missed.
  • Updating the playbook when the same lesson lands twice.
Agent does
  • Reads the brief, plan, diff, reviews, and bot-loop log; drafts a structured retro.
  • Surfaces recurring failure modes across recent retros.
  • Files proposed changes to the process docs as PRs.

Inputs

  • The full ticket history · prior retros · process docs
  • Gate-log entries across the run

Outputs

  • A short retro — what worked, what didn't, what changes
  • Zero or more proposed process-doc edits as PRs
Antipattern Saving retros for the end of the quarter. The lesson cools off; the proposal never lands. Run it now, while the run is still warm.
Approve phases, not steps. Once the phase is on, the checklist runs.
the carry-on rule, AI-native edition
04/Gates

Trust is what survives an adversarial review.

Every transition between stages runs through a structured gate — a second, structurally different reviewer (different model family, different prompt, different rubric) that emits a verdict, not a vibe. Loops continue until verdicts converge — or surface to a human.

Verdict · 01

Approved.

Reviewer found nothing material. Advance the phase, append the verdict to the audit log, move on. The loop is over.

Verdict · 02

Needs revision.

Findings with severity and a fix. Apply, re-run the gate. Repeat until clean. Each retry increments a counter; counters are part of the record.

Verdict · 03

Escalate.

The reviewer can't reach a verdict — ambiguity, missing context, contested judgement. Stops the loop; pages a human. Don't auto-decide what wasn't decided.

Verdict · 04

Loop detected.

Same finding survives two consecutive fixes. The disagreement is the signal. Advance with documented residue and surface to a human — don't grind.

tail -f gate.log   ·   one JSON line per gate run, committed alongside the code

{ ts: "2026-05-21T14:22Z", phase: "plan-review", verdict: "needs-revision", findings: 3, retry: 0 }
{ ts: "2026-05-21T14:31Z", phase: "plan-review", verdict: "needs-revision", findings: 1, retry: 1 }
{ ts: "2026-05-21T14:38Z", phase: "plan-review", verdict: "approved",        findings: 0, retry: 2 }
{ ts: "2026-05-21T15:04Z", phase: "impl-review", verdict: "approved",        findings: 0, retry: 0 }
{ ts: "2026-05-21T15:51Z", phase: "bot-loop",    verdict: "loop-detected",   findings: 1, retry: 2 }
05/Per-ticket rituals

The rhythms that keep the pipeline honest.

Tools change the work; rituals change the team. The cadence on an AI-native team is not weekly — it's per ticket. These four rituals run on every meaningful change, even small ones. They're the smallest set that keeps a pipeline from drifting into either over-trust or theatre.

per ticket · 10 min

Write the brief.

Problem before solution. Customer, scope, criteria — in canonical structure, signed by an owner. Even a one-line bug fix gets a one-paragraph brief. It's the seed the rest of the pipeline grows on.

per phase · ongoing

Status, not standstill.

Approve the phase once; expect short status lines while it runs, not approval-asks. Mid-phase pauses are a smell. The plan was the approval; the checklist is the contract.

per transition · 1 line

Append to the log.

Every gate run, every retry, every verdict — appended, never edited. The log is in source control. It's how you measure overlap between reviewers, retry counts, and rubric drift over time.

per ticket · 5 min

Run the retro now.

Five minutes, structured, at the end of every ticket. Not a quarterly ceremony. The lesson is hottest the moment after; that's also the only moment a process-doc PR actually lands.

06/Metrics

Measure the right things; ignore the rest.

Counting accepted suggestions or generated tokens is theatre. The six below track outcomes that actually matter — and the first three only become legible because you're running structured gates with an audit log.

M / 01

Cross-reviewer overlap (ω).

What share of findings the adversarial reviewer surfaces that the first reviewer missed. High ω means your gate is earning its keep; low ω means you're paying twice for one opinion.

accepted suggestionsω · per phase
M / 02

Loop convergence time.

Average gate retries to reach approved per phase. Trending up means the upstream artefact (the plan, the PR) is getting worse — or the rubric just shifted.

PRs / engineerretries · per phase
M / 03

Escalation rate.

Share of gates that surface to a human (escalate or loop-detected). Should be small and stable — spikes mean a rubric or context-corpus problem, not an engineer problem.

bot-found defectsescalations · per week
M / 04

Lead time, brief to first user.

How long between a signed brief and a real customer touching the change behind a flag. The truest measure that the pipeline is flowing end-to-end.

lines of AI codetime-to-first-user
M / 05

Eval-set health.

Does the eval set grow with real failures and shrink with archived edges? A living eval beats a static 90%. Health is a portfolio decision, not a number.

coverage %eval deltas / week
M / 06

Doc-to-code drift.

For every doc the pipeline reads from, how stale is it on average? Drift is a leading indicator of every other quality slip in the workflow.

wiki page countmean doc age · days
07/Getting started

If you only do six things this quarter.

A pragmatic order of operations for a team going from "we use Copilot sometimes" to a workflow that clears the bar above. None of these require new headcount; all of them require taste and time.

Pick one product surface.

Not "the company." One service, one team, one quarter. The workflow earns trust by working somewhere before it gets adopted everywhere.

Stand up a context corpus.

Index your briefs, plans, ADRs, prior retros, runbooks, and a slice of the codebase. Make it the canonical thing both humans and agents read from. Nothing else works without this.

Rewrite one plan template.

Make it explicit enough that a model can act on it: file-change table, test surfaces, checklist, rejected alternatives. If a teammate model can't execute it, neither can a new hire.

Add a second reviewer to one phase.

Pick the phase where mistakes are cheapest to catch (usually plan review). Wire a structurally different reviewer in. Watch the overlap (ω). Expand only when ω earns it.

Start the audit log on day one.

Append-only, JSON, committed to source. One line per gate run. Boring to set up; load-bearing within a quarter — it's the only way to measure the workflow honestly.

Retro every ticket, no exceptions.

Five minutes, structured, even on small bugs. The retro is what bends the workflow over time — it's also the only ritual that catches your own anti-patterns before they calcify.

$ ./startcut --contact

Want a senior team that
already ships this way?

StartCut runs this workflow on real production systems. If you're staring at six months of runway and a half-built AI product, start with a 90-minute scoping call.

~/startcut · share.sh
$ startcut share post-01
◇ tweet draft… staged
◇ link → startcut.dev/field-notes/ai-native-teams
◇ posts in series… 01 / 06
$ _
Engineering teams are about to be more leveraged. Most won't notice — because their workflow won't let them.
About this doc

A living playbook. Edits welcome via PR — retro after every meaningful change, even this one. Treat anything older than a quarter with suspicion.

Versioning

v2.0 · may 2026

© StartCut · field-notes/01