P.S. I took the images on my widescreen monitor while working on a real feature. If there’s interest, I’m happy to put together smaller screenshots or a walkthrough video.
TL;DR
Our workflow today looks roughly like this:
- Kick off an agent review locally via
/grove_*_review - The agent reviews the diff, runs best-practice checks, and spins up the system locally
- Full E2E flows run automatically (happy, sad, chaotic paths)
- Failures are diagnosed with context instead of just ātest failedā
- A PR sweep catches regressions and tech-debt risks
- Humans review mission-critical logic manually
- A final PR description is generated automatically

The key idea: Leverage tools off the shelf, semi (not completely) automate things specific to your domain, and do as much of it locally as possible.
Code Reviews are the new still the Bottleneck
It seems like everyone is talking about how code reviews are the new bottleneck in the era of agentic software development.
There is some truth to it, but if you’ve been around long enough, you know it’s not new. The problem just looks a little different now. The approach and the solutions are evolving. They’re not the same, but they rhyme.
There are lots of approaches, tools, and companies tackling this problem.
We’ve tried or looked into Claude CodeReview, CodeRabbit, and PropelCode. They’re all good and will get you at least halfway there.
However, the other half is the hard one.
It’s the part that’s specific to your domain, your product, your tech stack, your culture, and your taste.
This is just a quick show-and-tell of our lightweight workflow around Claude Code. It’s simple, custom-tailored, and easy to use.
Most importantly, it meets developers where they already are.
The Stack
Three repos:
- Grove API (Backend)
- Grove App (Frontend)
- Grove Extension (Chrome Extension)
The majority of these codebases were developed in an agent-first environment.
The code in the frontend and extension is what some would refer to as āvibe-codedā.
The backend is a bit more mature and reviewed in depth because it deals with fund management. I still review the core logic line-by-line, but I haven’t written a single line of it myself.
CI vs Local Automation
We still run traditional CI, but the role has changed.
Traditional CI only runs linting and unit tests.
The heavier work happens locally, and integrate (not replace) the human. The agent spins up the stack, runs E2E flows, and performs the review before a PR even exists.
This approach is similar to the direction described by DHH when moving CI back to developer machines.
Our end-to-end tests also double as production smoke tests that run on GitHub after deployment. š„
Step 1: Kick off the review via /grove_*_review
Each repo has a custom local slash command that kicks off the review:
/grove_app_review/grove_api_review/grove_extension_review
The command does a handful of things:
- runs
git diffagainst the default branch - builds context for the agent
- checks cosmetic changes
- validates best practices
- prepares E2E testing
Let’s assume you just vibe-coded engineered a big new feature with the help of agents, and it’s time to start reviewing your work.
ā¼ļø Like any AGENTS.md file, this is not one-and-done. I use /session-commit to keep it updated.
Documentation is a living thing. It’s the responsibility of both the human and the agent to update and review these commands and files regularly whenever something new is learned, a pattern emerges, or a change happens.

Step 2: DO ALL THE THINGS
Once the initial review completes, it returns a report, a summary, and a list of actionables.
The actionables cover many things we’ve learned we need to manage:
- updating docs or flows
- fixing terminology
- adding unit/integration/e2e tests
- adding logs
- following existing patterns
- linting
Running end-to-end tests is one of the most important steps.
Usually I tell it to DO ALL THE THINGS, but it really depends on the situation.

Step 3: Review the results
Here the human actually reviews the results and jumps in where needed.
In this particular case, Docker wasn’t running, so the end-to-end tests couldn’t start. Not a big deal, but something I prefer not to delegate to an agent.

Step 4: E2E Test Results
This is one of my favorite parts of the API.
It spins up a database, starts the server, and runs realistic end-to-end flows against it.
Happy paths, sad paths, chaotic paths. š¢
Everything.
If something fails, the agent tries to diagnose the root cause before surfacing it.
No ātest failedā without context. It tells you why. š
Claude also fixes some of them (if trivial) along the way, and is instructed not to fix anything where the business logic change is questionable or requires another opinion.
The E2E tests report back with a clear pass/fail matrix.

Step 5: Pull Request Sweep
From here, I use my personal agent skills before moving into manual review.
In particular, I’ve found that cmd-pr-sweep from Olshansk/agent-skills is great at catching major regressions, bugs, and limiting how much tech debt we’re taking on.
If you’re curious, the skills can be installed with:
npx add skills olshansk/agent-skills

I review it manually and decide what to fix.
If something feels too big or out of scope for the work, I ask the agents to add TODOs with a lot of context.
Having that inline is a great way to give future agents context about the tech debt.
This is the last gate before a human looks at the PR. š§

Step 6: Prepare and upload the PR
Once everything passes, I commit, push, and generate a PR description using /cmd-pr-description.
The skill reads the full diff, commit history, and review findings, then drafts a structured description with a summary, feature diff table, and technical details.

Here is what it looks like on GitHub:

Other Special Mentions
This isn’t intended to be a āhow I codeā post, but I wanted to call out some of the patterns I’ve found useful lately:
Cross-referencing with other models Depending on the size and complexity, I like to cross-reference plans and bugs with Gemini and Codex. Gemini tends to be very idiomatic and strong at frontend work. Codex is great at architecture and challenging requirements. Not every change needs this, but for architectural decisions or tricky edge cases, it’s useful to get a second or third opinion from another model.
Manual review I don’t review frontend code manually, but I still review mission-critical business logic line-by-line on GitHub. I leave comments (locally or remotely) and have agents pick them up. I focus heavily on reducing code surface area, regression testing important edge cases, naming, and ensuring TODOs with explanations are in place.
/session-commitI use this frequently to keepAGENTS.mdupdated based on learnings from the most recent agent session.
Depending on how much work happened during this stage of the review, I might go back and start the process again.