Some useful SQL(ite) tips I've learned
I've recently spent some time working towards presenting some of the findings from dependency-management-data at work, which has required writing some fun queries, and then using Google Sheets to better visualise that data.
Not only have I taken advantage of some fairly straightforward queries, written off-the-cuff, or relied upon some pre-written queries in the DMD codebase, I've also found myself needing to branch out and write some much more complex SQL, as I didn't want to add any logic into the Google Sheet and have that only as a presentation layer, which could then be consumed by a Google Slides presentation.
In particular, because I haven't looked at presenting this data directly via Google Sheets visualisations, I eneded up searching for things like sqlite buckets
or how to make rows columns vice versa
, to mixed luck.
Instead of this being a spatter of separate posts, I thought I'd collect them here for lower discoverability, but higher density.
Splitting data into buckets
This is a big one that took me many attempts to find an article / StackOverflow post that would explain quite right for what I wanted to do.
For instance, let's say we have the following data (limited for brevity):
platform | organisation | repo | package_name | version | current_version | package_manager | package_file_path | dep_types | level | advisory_type | description | supported_until | eol_from |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
github | stretchr | testify | go | 1.11 | 1.23.3 | gomod | _codegen/go.mod | ["golang"] | ERROR | UNMAINTAINED | go 1.11 has been End-of-Life for 1897 days | 2019-09-03 | |
github | wiremock | wiremock | gradle | 4.5.1 | 4.5.1 | gradle-wrapper | perf-test/gradle/wrapper/gradle-wrapper.properties | [] | ERROR | UNMAINTAINED | gradle 4 has been End-of-Life for 2178 days | 2018-11-26 | 2018-11-26 |
github | wiremock | wiremock-resilience-examples | gradle | 4.5.1 | 4.5.1 | gradle-wrapper | gradle/wrapper/gradle-wrapper.properties | [] | ERROR | UNMAINTAINED | gradle 4 has been End-of-Life for 2178 days | 2018-11-26 | 2018-11-26 |
gitlab | gitlab-org | gitlab | node | 4.2.2 | 4.2.2 | gitlabci | lib/gitlab/ci/templates/Pages/Metalsmith.gitlab-ci.yml | ["image"] | ERROR | UNMAINTAINED | nodejs 4 has been End-of-Life for 2388 days | 2017-04-01 | 2018-04-30 |
If we wanted to get a high-level view of what packages (via the package_name
) were in the state of being UNMAINTAINED
or DEPRECATED
, how would we do that, so we can i.e. get the following output?
package_name | bucket | count(*) |
---|---|---|
go | unmaintained | 1 |
gradle | unmaintained | 2 |
node | deprecated | 2 |
node | unmaintained | 10 |
python | deprecated | 1 |
rails | unmaintained | 2 |
ruby | unmaintained | 1 |
To do this, we can use a case
/ when
statement to group each type of data into a named bucket, like so:
select
package_name,
(
case
when abs(cast ((julianday(eol_from) - julianday('now')) as integer)) > (365 * 5)
then 'unmaintained'
when abs(cast ((julianday(supported_until) - julianday('now')) as integer)) > (365 * 5)
then 'deprecated'
else
'THIS SHOULD NOT BE HIT'
end
) as bucket,
count(*)
from advisories
where
(
abs(cast ((julianday(eol_from) - julianday('now')) as integer)) > (365 * 5)
)
or
(
abs(cast ((julianday(supported_until) - julianday('now')) as integer)) > (365 * 5)
)
group by package_name, bucket
order by package_name, bucket
This is very useful when trying to then graph the resulting data, and I've ended up using this bucketing quite a few times after finding out how to do it ππΌ
ORDER BY
doesn't need to reference a column
Something interesting is the fact that it's possible to use an ORDER BY
with an arbitrary statement, for instance:
select
visibility,
count(*)
from
repository_metadata
group by
visibility
order by
(
case
visibility
when 'PUBLIC' then 0
when 'INTERNAL' then 1
when 'PRIVATE' then 2
else 3
end
)
Previously, I had been creating a separate column called ord
and then was using this, but this is much simpler.
Querying JSON
I've written about this more in a separate post but one thing to note that I've recently been doing is relying on the shorthand "arrow functions" that are inbuilt to SQLite, and are familiar if you're coming from Postgres.
For instance, let's say we have the following:
select
id,
json_extract(data.json, '$.name') name
from
data
We can actually further simplify this with an arrow operator, ->>
:
select
id,
- json_extract(data.json, '$.name') name
+ data.json ->> '$.name' name
from
data
Concatenating strings
Something I've done a bit before, but for posterity:
-- ...
datasource || ': ' || update_type as label,
-- ^^ this concatenates between multiple strings
-- ...
Very useful when trying to add useful labels for graphs.
Providing default values for NULL
s
In the case that we're retrieving a nullable column from database, but want to provide a default value, we can use coalesce
.
For instance:
select
coalesce(current_version, version) as ver
from
renovate
Or:
select
coalesce(metadata, '{}') as m
-- ...
Date calculations
Something I've been doing since very early on with DMD is calculating "days between" dates, i.e. to given an indication of "you've been running Node 12.x for 900 days past its End-of-Life date".
For instance, if we want to take a column like:
supported_until |
---|
2024-10-22 |
And work out how many days it's been since that date, we could write a query using julianday
, such as:
select
supported_until,
-- it's more useful to show the absolute (aka always positive) value to humans
abs(cast(julianday(supported_until) - julianday('now') as integer)) as since
from
advisories
where
supported_until is not null
and
-- this number will be negative if it's in the past
cast(julianday(supported_until) - julianday('now') as integer) < 0
Using subqueries as a way to produce column-focussed data (for presenting in Google Sheets)
Within the DMD codebase, I'm using the excellent sqlc
for my database queries.
One thing it doesn't really support - which is fair - is heavily nested queries, including those with subqueries, so I don't generally lean on them as it's hard to use them with sqlc
.
However, when preparing this presentation, I found that as I wasn't tied to sqlc
while preparing the data, I could write the most cursed and nested subqueries I wanted π
This allowed me to think about how to handle getting the right columns in the resulting output, at the cost of it generally being a less efficient query, but working for what I needed.
For instance, the below query:
select
(
select
count(*)
from
(
select
distinct package_manager, package_name
from
renovate
)
) as num_deps,
(
select
count(*)
from advisories
where advisory_type = 'DEPRECATED'
) as total_deprecated,
(
select
count(*)
from advisories
where advisory_type = 'UNMAINTAINED'
) as total_unmaintained
This isn't the most complex example I could share, but out of an abundance of shame I won't go into it too much π«£