Insight

Ronnie Kilsbo

Ronnie Kilsbo

A Practical AI Code Reviewer for Real-World PRs

Here are my insights into how an AI code reviewer can streamline the development process.

By, Ronnie Kilsbo, Senior Back End Engineer.

TL;DR

Predyktable hackathons are one-day events we run every two months, where employees can build anything. During one of them, I built an AI code reviewer CLI and it was production-ready enough to provide real value the same day.

Since then, we’ve iterated and improved it, but the core idea stayed simple: it reads pull request diffs and posts advisory review comments. Today it runs in CI (currently in Bitbucket Pipelines across a growing set of repos) and typically produces a review in 5–10 minutes without blocking builds.

The problem: Reviewers are Busy

In a smaller team, code review usually isn’t blocked by process, it’s blocked by attention. PRs can wait longer than necessary simply because it’s hard to get someone’s focus at the right moment.

What I wanted was a quick, consistent “first pass” reviewer that can:

  • Reduce time-to-first-feedback
  • Catch common pitfalls reliably
  • Help humans spend their time on the *interesting* parts of review

The First “wow”: Catching a Real Firestore Migration Risk

Early versions did what you’d expect: housekeeping and obvious warnings.

The first real “wow moment” came after improving the prompts. The reviewer spotted a deeper issue: we were changing a model in a way that meant existing Firestore documents would no longer validate, implying we’d need a database migration.

That’s when it started to feel less like “a bot that comments on code” and more like “a colleague who follows consequences.”.

What we Built: A CLI that Runs Locally or in CI Pipelines

The implementation is a CLI that:

  • Authenticates with the OpenAI API (model: GPT-5)
  • Authenticates with the code hosting platform to post comments (today: Bitbucket)
  • Runs in CI to review PRs
  • Can also run locally and print structured output to the terminal

In platform mode it posts inline comments (with the relevant line centered) and includes concrete findings plus suggested code examples. In local mode it prints the same structured review in the terminal.

A review comment to a PR in Bitbucket:

A review comment to a PR in Bitbucket:

A review comment in a branch running the CLI locally:

A review comment in a branch running the CLI locally

How it Works (Without the Internals)

At a high level, the reviewer is diff-first but not diff-only.

We gave it the PRs changed file list, and we let it fetch diffs and pull additional context when needed (e.g., reading related files or searching for usages). For backend services we encourage it to trace “call chains” from an entry point down to I/O and data layer, because that’s often where bigger issues emerge.

Keeping it Reliable: Validated Structured Output

One small trick made the tool much more robust: instead of requiring perfect JSON as a final response, we had the model submit the review through a function call (e.g. `submit_review(…)`) that validates the schema.

If validation fails, the tool returns what failed and the model retries until the review passes. Only then does it finish.

Guardrails

This tool is advisory by design:

  • It does not approve PRs
  • It does not fail builds
  • It does not block pipelines

If it fails, it fails gracefully and we keep CI deterministic.

We also avoid repeat noise on PR updates by ensuring it doesn’t comment on the same line in the same file again for that PR.

What it’s Good at (and where it’s not)

In practice, it’s been especially useful at:

  • Edge-case validation (variables with unexpected values, dynamic inputs)
  • Missing error handling that could bite in production
  • Security concerns (e.g., injection risks)
  • Suggesting more appropriate data representations for a use case (for example, date handling in data stores)

Where it can be less helpful:

  • Business logic that spans multiple services (a single repo rarely contains the full story)
  • Suggestions that assume older versions of libraries, when you’re on newer versions (model training data is old)

A simple mitigation for the second point: document important library versions prominently in the repo so the reviewer has less room to guess.

Build vs Buy, Cost and flexibility

There are off-the-shelf tools in this space, but many are priced around $20–$40 per user and month.

Our internal usage has been much lower:

  • Monthly OpenAI usage: $5–$18 over the past few months
  • Cost per PR review: typically, ~$0.20, highest seen ~$0.50 (often on new repos)

Lesson Learned: Don’t Overcomplicate Model Choice

Making one solution “work on any model” sounds nice, but in practice it’s fragile:

  • Pick a single model and build for its quirks
  • If you add a second model, do it intentionally (light/heavy) and expect two slightly different behaviours
  • Treat model changes as real migrations: prompts, tests, and glue code will move

Reliability comes from guardrails, not clever abstraction. Build for a certain model that matches your expectations in terms of outcome, cost and speed.

Looking ahead: making the reviewer interactive

A natural next step is to make the reviewer part of the conversation, not just a one-off commenter.

If someone replies to a reviewer comment, we can capture that (via webhooks), pass the thread context to the model, and let it decide whether a reply would be helpful. For example:

  • Clarify why it flagged a risk
  • Propose an alternative fix when the author explains constraints
  • Acknowledge when the human reviewer’s judgment supersedes the suggestion

This would remain ‘advisory’. The goal isn’t automation for its own sake, it’s reducing needless back-and-forth when the tool can add specific, useful context.

You don’t need another dashboard.

You need a system that thinks ahead.

Contact us to find out more about how we can help you stay in control, cut through the noise, and deliver on your customer promise – even when things change fast.

Change Cookie Settings

Cookie consent: Undecided