Some useful SQL(ite) tips I've learned

Featured image for sharing metadata for article

I've recently spent some time working towards presenting some of the findings from dependency-management-data at work, which has required writing some fun queries, and then using Google Sheets to better visualise that data.

Not only have I taken advantage of some fairly straightforward queries, written off-the-cuff, or relied upon some pre-written queries in the DMD codebase, I've also found myself needing to branch out and write some much more complex SQL, as I didn't want to add any logic into the Google Sheet and have that only as a presentation layer, which could then be consumed by a Google Slides presentation.

In particular, because I haven't looked at presenting this data directly via Google Sheets visualisations, I eneded up searching for things like sqlite buckets or how to make rows columns vice versa, to mixed luck.

Instead of this being a spatter of separate posts, I thought I'd collect them here for lower discoverability, but higher density.

Splitting data into buckets

This is a big one that took me many attempts to find an article / StackOverflow post that would explain quite right for what I wanted to do.

For instance, let's say we have the following data (limited for brevity):

platformorganisationrepopackage_nameversioncurrent_versionpackage_managerpackage_file_pathdep_typesleveladvisory_typedescriptionsupported_untileol_from
githubstretchrtestifygo1.111.23.3gomod_codegen/go.mod["golang"]ERRORUNMAINTAINEDgo 1.11 has been End-of-Life for 1897 days 2019-09-03
githubwiremockwiremockgradle4.5.14.5.1gradle-wrapperperf-test/gradle/wrapper/gradle-wrapper.properties[]ERRORUNMAINTAINEDgradle 4 has been End-of-Life for 2178 days2018-11-262018-11-26
githubwiremockwiremock-resilience-examplesgradle4.5.14.5.1gradle-wrappergradle/wrapper/gradle-wrapper.properties[]ERRORUNMAINTAINEDgradle 4 has been End-of-Life for 2178 days2018-11-262018-11-26
gitlabgitlab-orggitlabnode4.2.24.2.2gitlabcilib/gitlab/ci/templates/Pages/Metalsmith.gitlab-ci.yml["image"]ERRORUNMAINTAINEDnodejs 4 has been End-of-Life for 2388 days2017-04-012018-04-30

If we wanted to get a high-level view of what packages (via the package_name) were in the state of being UNMAINTAINED or DEPRECATED, how would we do that, so we can i.e. get the following output?

package_namebucketcount(*)
gounmaintained1
gradleunmaintained2
nodedeprecated2
nodeunmaintained10
pythondeprecated1
railsunmaintained2
rubyunmaintained1

To do this, we can use a case / when statement to group each type of data into a named bucket, like so:

select
package_name,
(
	case
	when abs(cast ((julianday(eol_from) - julianday('now')) as integer)) > (365 * 5)
		then 'unmaintained'
	when abs(cast ((julianday(supported_until) - julianday('now')) as integer)) > (365 * 5)
		then 'deprecated'
	else
		'THIS SHOULD NOT BE HIT'
	end
) as bucket,
count(*)
from advisories
where
  (
    abs(cast ((julianday(eol_from) - julianday('now')) as integer)) > (365 * 5)
  )
  or
  (
    abs(cast ((julianday(supported_until) - julianday('now')) as integer)) > (365 * 5)
  )
group by package_name, bucket
order by package_name, bucket

This is very useful when trying to then graph the resulting data, and I've ended up using this bucketing quite a few times after finding out how to do it πŸ™ŒπŸΌ

ORDER BY doesn't need to reference a column

Something interesting is the fact that it's possible to use an ORDER BY with an arbitrary statement, for instance:

select
  visibility,
  count(*)
from
  repository_metadata
group by
  visibility
order by
  (
    case
      visibility
      when 'PUBLIC' then 0
      when 'INTERNAL' then 1
      when 'PRIVATE' then 2
      else 3
    end
  )

Previously, I had been creating a separate column called ord and then was using this, but this is much simpler.

Querying JSON

I've written about this more in a separate post but one thing to note that I've recently been doing is relying on the shorthand "arrow functions" that are inbuilt to SQLite, and are familiar if you're coming from Postgres.

For instance, let's say we have the following:

select
  id,
  json_extract(data.json, '$.name') name
from
  data

We can actually further simplify this with an arrow operator, ->>:

 select
   id,
-  json_extract(data.json, '$.name') name
+  data.json ->> '$.name' name
 from
   data

Concatenating strings

Something I've done a bit before, but for posterity:

  -- ...
  datasource || ': ' || update_type as label,
  --         ^^ this concatenates between multiple strings
  -- ...

Very useful when trying to add useful labels for graphs.

Providing default values for NULLs

In the case that we're retrieving a nullable column from database, but want to provide a default value, we can use coalesce.

For instance:

select
  coalesce(current_version, version) as ver
from
  renovate

Or:

select
  coalesce(metadata, '{}') as m
-- ...

Date calculations

Something I've been doing since very early on with DMD is calculating "days between" dates, i.e. to given an indication of "you've been running Node 12.x for 900 days past its End-of-Life date".

For instance, if we want to take a column like:

supported_until
2024-10-22

And work out how many days it's been since that date, we could write a query using julianday, such as:

select
  supported_until,
  -- it's more useful to show the absolute (aka always positive) value to humans
  abs(cast(julianday(supported_until) - julianday('now') as integer)) as since
from
  advisories
where
  supported_until is not null
  and
  -- this number will be negative if it's in the past
  cast(julianday(supported_until) - julianday('now') as integer) < 0

Using subqueries as a way to produce column-focussed data (for presenting in Google Sheets)

Within the DMD codebase, I'm using the excellent sqlc for my database queries.

One thing it doesn't really support - which is fair - is heavily nested queries, including those with subqueries, so I don't generally lean on them as it's hard to use them with sqlc.

However, when preparing this presentation, I found that as I wasn't tied to sqlc while preparing the data, I could write the most cursed and nested subqueries I wanted 😈

This allowed me to think about how to handle getting the right columns in the resulting output, at the cost of it generally being a less efficient query, but working for what I needed.

For instance, the below query:

select
(
	select
	count(*)
	from
	(
		select
		distinct package_manager, package_name
		from
		renovate
	)
) as num_deps,
(
	select
	count(*)
	from advisories
	where advisory_type = 'DEPRECATED'
) as total_deprecated,
(
	select
	count(*)
	from advisories
	where advisory_type = 'UNMAINTAINED'
) as total_unmaintained

This isn't the most complex example I could share, but out of an abundance of shame I won't go into it too much 🫣

Written by Jamie Tanna's profile image Jamie Tanna on , and last updated on .

Content for this article is shared under the terms of the Creative Commons Attribution Non Commercial Share Alike 4.0 International, and code is shared under the Apache License 2.0.

#blogumentation #sqlite.

This post was filed under articles.

Interactions with this post

Interactions with this post

Below you can find the interactions that this page has had using WebMention.

Have you written a response to this post? Let me know the URL:

Do you not have a website set up with WebMention capabilities? You can use Comment Parade.