Analysing GitHub Pull Request review times with SQLite and Go
This post's featured URL for sharing metadata is https://www.jvt.me/img/profile.jpg.
In Improving Team Efficiency By Measuring and Improving Code Review Cycle Time, I mentioned that one thing we can do to understand if code review is causing delays is to measure it.
Since then, I've also worked on building this at Deliveroo with one of my colleagues, just before we started using PluralSight Flow, which didn't quite give some of the metrics we wanted out of it.
With a bit of free time in my time between jobs, I thought I'd at least blogument the data fetching and parsing that I've found works, and so if anyone else goes to do this, they've got something to start with.
Unlike previous attempts I wanted to:
- build it as an Open Source project
- write it with Go
- use SQLite as the underlying datasource
- try building a CLI using urfave/cli/
The project can be found at gitlab.com/tanna.dev/ghprstats, which has some docs on how to get started with it.
I'm not sure I'm quite happy with how I've internally implemented it, but happy it's done and I can iterate over it.
How it works
At its core, we are using the following four APIs from GitHub to retrieve the data about a given Pull Request:
- List reviews for a pull request
- List review comments on a pull request
- List issue comments
- List timeline events for an issue
These are then used to look at who is interacting with the changes, as well as using the timeline events to determine the state(s) the PR is in over time.
We fetch the data up-front and then sync it to the SQLite database, after which it can be queried or more easily distributed.
What it looks like
For a few PRs worth of data, running ghprstats report cycle-time
results in the following (converted to HTML table for ease of viewing):
PR | PR Title | Changes | Initial State | # Reviews | # Comments | First Commenter | Time to first comment | First Reviewer | Time to first review | Second Commenter | Time to second comment | Second Reviewer | Time to second review | Time to close | Merged? | PR link |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
snarfed/oauth-dropins#284 | Migrate IndieAuth to full authorization_code grant | +82,-36 lines changed across 3 files | ready | 2 | 13 | snarfed | 4768 | 7150 | true | https://github.com/snarfed/oauth-dropins/pull/284 | ||||||
cucumber/common#2024 | Add further examples for valid Cucumber files to pretty-print | +108,-0 lines changed across 1 files | ready | 0 | 6 | false | https://github.com/cucumber/common/pull/2024 | |||||||||
deepmap/oapi-codegen#648 | Generate anonymous objects referenced in schemas | +1303,-50 lines changed across 35 files | draft | 8 | 14 | false | https://github.com/deepmap/oapi-codegen/pull/648 | |||||||||
endoflife-date/endoflife.date#3330 | Remove Gorilla toolkit archiving | +4,-3 lines changed across 1 files | ready | 1 | 4 | marcwrobel | 187 | 188 | true | https://github.com/endoflife-date/endoflife.date/pull/3330 |
With SQL access to the raw data, it may also be easier to run queries like the below to better understand the data.
For instance, how can we work out the most frequent approvers?
select
login,
count(author_id) as num_reviews
from
reviews
inner join users on author_id = users.id
where state = 'APPROVED'
group by
author_id
order by
num_reviews desc
limit 10
Or how can we find out which PRs started as a draft vs ready for review?
select
pulls.owner,
pulls.repo,
pulls.id,
title,
(
case
when i.initial_state is null then 'ready'
else i.initial_state
end
) as initial_state
from
pulls
left join (
select
(
case
when event = 'ready_for_review' then 'draft'
else 'ready'
end
) as initial_state,
timeline.created_at,
timeline.owner,
timeline.repo,
pull_id
from
timeline
where
(
event = 'ready_for_review'
or event = 'convert_to_draft'
)
order by
created_at asc
LIMIT
1
) i on pulls.owner = i.owner
and pulls.repo = i.repo
and pulls.id = i.pull_id;