Quantifying your reliance on Open Source software (State of Open Con version)
This is a writeup of my talk at State of Open Con 2024, about the dependency-management-data project. The talk abstract can be found on my talks site.
This is an updated version of the original talk writeup Quantifying your reliance on Open Source software, due to significant new features, improved documentation, as well as refreshing the content to fit within the conference.
This talk is now also available as a recording on YouTube if you'd prefer to watch it.
whoami
I'm Jamie, I'm a Senior Engineer with an interest towards solving engineering-facing problems, aiming to make folks more effective in their roles, as well as being an avid blogger (on this website), and I've been thinking about the problem of understanding your Open Source dependency tree in this form as early as 2021, but also more generally since ~2019.
Timeline of events
- 2024-02: This talk!
- 2023-07: First public talk
- 2023-02: Created the dependency-management-data project (which has just celebrated its first anniversary)
- 2022-08: First iteration with Dependabot
- 2019: "Formally" considering it
- 2017: Hacking around
Why is this important?
As I wrote in the post Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest (CC-BY-SA-4.0):
In recent years, it has become unavoidable to build software on top of Open Source. This is absolutely a great thing, and allows developers to focus on fewer areas of domain specialisation as possible, as well as allowing a much wider range of users to pick up on defects and bring new features to our tools.
However, with events such as the Log4Shell security vulnerability, times where maintainers have removed their libraries from package and source repositories, sometimes in political protest, it's understandable that businesses are somewhat hesitant about the sustainability of projects.
Open Source projects need support, love and positive feedback from their communities, and with the increasing demands of organisations on their software supply chain, it's important to fully appreciate the depth of your dependencies.
Being able to understand how your business uses Open Source is really important for a few other key reasons (but this list is by no means exhaustive!):
- How am I affected by that dependency migrating away from Open Source?
- Usages of unwanted libraries
- Understand usage of libraries and frameworks, and their versions
- Discovering unmaintained, deprecated or vulnerable software
As well as focussing on Open Source, we can also ask these same questions about how your business uses internal software:
- Usages of unwanted libraries
- Understand usage of libraries and frameworks, and their versions
- Discovering unmaintained, deprecated or vulnerable software
There are additional insights we can discover about our dependencies, such as:
- How maintained does the dependency appear to be?
- How are the dependency's supply chain security practices? (via OpenSSF Security Scorecards)
- How many dependencies are actively seeking financial support?
That all sounds great, so how do we do that?
It's very likely that you've been sitting on a call with a vendor who's pitching you this idyllic view of your software estate. Or maybe you've recently been told that if you had Software Bill of Materials (SBOMs) for all your applications, this would magically solve the world's problems.
Instead, I'm going to show you that you don't need to pay a lot of money to provide some key insights into your use of Open Source and proprietary software, but can instead build this with Free and Open Source software, and Open APIs (not OpenAPI, which I often talk about!).
What is dependency-management-data?
Dependency Management Data (DMD) is a set of Open Source tooling I've built from the ground up as a means to gain insights into your dependencies. It provides a means to look at the Open Source and proprietary dependencies that your organisation or projects use, producing an interface that allows further querying, filtering, and reporting.
DMD consists of:
- The outputted SQLite database
- The command-line tool
dmd
- The web application
dmd-web
and GraphQL-only web applicationdmd-graph
- (Your SQLite browser of choice)
The SQLite database
Arguably the most important part of dependency-management-data is the resulting SQLite database that's produced from dependency data.
One key design decision for DMD was to utilise SQLite as the database engine. SQLite has recently seen a resurgence in popularity and for me was the perfect choice as I wanted to make it convenient to share the data between people, at least early on when I was manually updating the data and building the database.
With SQLite, there is a single file that can be shared around - for instance as part of the result of a GitHub Actions or GitLab pipeline - which would have performed any operations necessary to produce a "ready to use" dataset, and then allow folks to perform their own queries on top of it.
SQLite also works well whether you're working locally or hosting it elsewhere, as SQLite is a single-file database that can be distributed much more conveniently than other database engines.
Another key design decision was that the database should be the source of truth for all data and querying. Instead of locking you into using the dmd
CLI to interact with the database, all data gets synced to the database, and can be browsed with any database browser.
You may be interested in reading the design decisions, which discuss some of this in more depth.
CLI (dmd
)
However, to get the SQLite database, first we need to use dmd
to create it.
The dmd
CLI contains the functionality to build the SQLite database, consuming different types of dependency data (also known as "datasources"), and can be optionally enriched with data such as "advisories".
You can use this dependency data as-is, or you can use the command-line tool to enrich the database with additional data ("advisories"), such as being able to get insight into which dependencies are running end-of-life versions, as well as interrogate the database for specific data ("reports").
As well as data that can be gleaned for public dependencies, you can also write your own custom advisories or "policy" to provide your own organisation-specific insights into the data.
With the raw dependency data discovered as well as any additional data added via advisories or dependency health, you can discover some pretty interesting things about your usage and answer all of the questions posed earlier in the talk, and more!
Web Applications (dmd-web
and dmd-graph
)
DMD also contains an inbuilt web server, dmd-web
, which allows serving the database using a pre-configured integration with Datasette's excellent SQLite UI.
One of the great things about having this as a web UI is that you can share URLs to previously run queries, allowing you to easily collaborate with colleagues on the data without copy-pasting results, as well as giving you a central place for teams to access the data.
The dmd-web
application contains an inbuilt GraphQL API for additional querying, which makes it possible to query for specific data, without needing to directly write SQL. It's expected that you'll reverse-proxy dmd-web
behind an OAuth2 proxy, so there's a separate dmd-graph
application that can be deployed on its own to only expose the GraphQL API.
Data sourced through the example project can be found hosted by dmd-web
on Fly.io.
How did it come to be?
This project has been something that's been bubbling away in the back of my mind for a few years.
As written about in the post Idea for Open Source/Startup: monetising the supply chain I discussed how having access to dependency trees may be handy for a multitude of reasons, including financially supporting your supply chain:
While at Capital One, one of my colleagues was working on a side project to look at dependencies we were using, as a means to better understand our dependency trees, and lead to easier determining of when we needed to do dependency upgrades.
It'd got to a pretty great place, just as we'd started to adopt WhiteSource Renovate (now called Mend Renovate), so we were discussing other options for it, as it was now redundant for that original purpose.
Among other options raised, I suggested using it as a way to understand what libraries we were using, across our software estate, and use it to more appropriately distribute (financial) support to our projects.
Before this post, I'd worked on something similar at Capital One to gauge the usage and spread of libraries across repositories in my team or around our shared libraries community, which required awkward scripts of grep
and sed
to achieve the same, as there wasn't an easier way.
Fast forward a few months to from that post to Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest:
Coming up to Hacktoberfest [in 2022] - my first Hacktoberfest since joining Deliveroo - I wanted to spread the love and see if I could give a similar experience to other folks, as well as to try and get us to contribute to some of the projects that power the business.
A few months ago, I wrote about an idea on my personal blog about programmatically determining how (Open Source) libraries are used and, in that case, contributing financially, but the concept still works for contributing in other ways. I decided that I wanted to use the same dependency analysis approach, using the dependency tracking functionality we have available through GitHub Advanced security. Deliveroo is a data-driven company, so being able to bring some data to teams, to highlight commonly used libraries that may be good candidates for contributions, was really important.
As part of this, I had the opportunity to really dig into the data and find out how to use the data to determine our most used direct/transitive packages.
As we had recently got GitHub Advanced Security's Dependabot APIs enabled across Deliveroo, this gave me a great starting point for this data. Although Dependabot APIs only supported a subset of the languages and tools that we used, it supported much more than my hacky shell scripts could have in the past.
At the time, this was purely looking at the names of dependencies to understand the usage, but as time went on, I started using it more and more for understanding of our ecosystem.
This fed into some work in early January around our Production Engineering teams' need to understand the usage of DataDog client versions, and started off proving the value of this data being available.
This was a little awkward, hampered by the way that GitHub's Dependabot APIs were structured, as we were missing information about the current discovered version of the dependency. In most cases, GitHub's data would provide the version constraint specified in the Gemfile
or go.mod
, and would need further sanitisation to discover the exact version, or if you were lucky, a separate JSON object in the response may exist if there's a lockfile understood by Dependabot.
Update 2023-10-14 - as noted in Prefer using the GitHub Software Bill of Materials (SBOMs) API over the Dependency Graph GraphQL API, these issues mostly disappear, which is available since dependabot-graph
v0.2.0.
As we were starting to use Renovate more, I discovered that Renovate had some pretty great data as well as supporting a much wider set of package ecosystems that we could use to our advantage. It wasn't immediately straightforward to get the dependency data out of Renovate, so I created a slim Open Source package called renovate-graph which would wrap around Renovate and allow outputting the full dependency tree in a JSON format. In hindsight, the "graph" is a bit of a misnomer, as it doesn't provide the full graph.
Using Renovate as the datasource for dependency data opened us up to more of the ecosystems we used like Scala Build Tool (sbt) and CircleCI, as well as including the exact version number a dependency was resolved to. With this available, I was able to start building some internal tooling for checking end-of-life details using endoflife.date, which provides an API to query the dates at which certain types of software becomes end-of-life, such as Node.JS, Go, Apache Tomcat, etc.
While doing this, I realised that my cobbled together database schema would probably be best to be thought about in a more structured way. Up until now, all the code was internal to Deliveroo, but I found that it didn't need to be, as this was a problem others could benefit from having a solution for, especially as I'd proved some value of this inside the org.
I decided to set about working on a clean-room implementation from the ground up which would make it more generic than Deliveroo's internal setup, and as it was an evenings and weekends project, it naturally fit in my personal organisation rather than my employer's.
How does it work?
DMD is first and foremost a command-line tool, dmd
, which aims to pull dependency data from different datasources and construct an SQLite database for further processing.
To start using DMD, a user needs to have run three fairly straightforward commands - one to retrieve some data, and two to ingest it:
# produce some data that DMD can import, i.e.
npx @jamietanna/renovate-graph@latest --token $GITHUB_TOKEN your-org/repo
# set up the database
dmd db init --db dmd.db
# import renovate-graph data
dmd import renovate --db dmd.db 'out/*.json'
# optionally, generate advisories
dmd db generate advisories --db dmd.db
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'
Datasources
As mentioned above, DMD doesn't know how to get the dependency data, so it requires you provide data through the following tools:
- renovate-graph (using the package data known about by Renovate)
- Note that this doesn't require you to be using Renovate for your dependency updates, it's simply used as a means to discover dependency data, and has much better quality of data to any other scanners I've tried, especially compared to Dependabot
- dependabot-graph (using the package data known by GitHub's Dependabot API)
- Software Bill of Materials (SPDX, CycloneDX)
- endoflife-checker (supports various types of AWS infrastructure)
DMD has an underlying data model that it translates each of the above datasources into, which is imported into the database schema.
From there, DMD then uses its own understanding of those data formats for performing reporting or enriching the data.
Once ingested, it's possible to write SQL queries to your heart's content, for instance to ask:
- "which repos use a vulnerable version of Log4J"
- "how many repos are using a version of the Datadog SDK that's older than ..."
- "what is our most used direct/transitive dependency?"
Reports
As well as having raw access to the data and being able to query it yourself, there are some common queries that folks may be interested in.
As of writing, there are several reports available:
$ dmd report --help
advisories Report advisories that are available for packages or dependencies in use
dependenton Report usage of a given dependency
golangCILint Query usages of golangci-lint, tracked as a source-based dependency
infrastructure-advisories Report infrastructure advisories
licenses Report license information for package dependencies
mostPopularDockerImages Query the most popular Docker registries, namespaces and images in use
mostPopularPackageManagers Query the most popular package managers in use
policy-violations Report policy violations that are found for packages or dependencies
An example of these reports can be found on the example web app.
Some of these operate on the raw data, but some of them require pre-enriching the data with advisories data.
As we'll see in the case studies later, a few of these are based off the back of events happening in the Open Source ecosystems.
Advisories
Being able to query the dependency data for your projects is really powerful, and makes it possible to start answering questions like "what Terraform modules and versions are being used across the org" and "which teams are using the Gin web framework".
These questions are quite specific to your organisation to be able to make generic in the form of a report, but what if you wanted to ask questions like "which software am I running that needs an upgrade soon"?
This concept is know as "advisories", and it provides a means to surface other information about your dependencies, such as whether a dependency is deprecated/end-of-life or unmaintained, has a security issue, or is something else.
As mentioned before, to start with I found that it was useful to have end-of-life checking through endoflife.date, which gave us visibility over which of our libraries were running end-of-life versions. Over time, I've also added integrations with deps.dev for vulnerability and licensing data and Ecosyste.ms for dependency health data.
This end-of-life checking doesn't just work for package data, but also includes AWS infrastructure checking through endoflife-checker, making it possible to answer questions like "how much time should my team(s) be planning in the next quarter to upgrade their AWS infrastructure".
These are useful, but sometimes you will want to be able to define your own rules or advisories, which can be done by creating custom advisories. I find this to be a particularly useful feature, as it allows you to really teach the tooling how to make it work best for your organisation.
To do this, you can add an advisory to the advisories
table, which allows you to define your own rules about packages. This lends itself well to being able to define i.e. a security or maintenance issue with your own internal libraries, or flag up cases where you're using libraries you would prefer not to.
An example of what advisories data looks like can be found on the example web app.
Additionally, there are community-sourced advisories through the "contrib" project, which provides a means to share common advisories for the good of the community. For instance:
INSERT INTO custom_advisories (
package_pattern,
package_manager,
version,
version_match_strategy,
advisory_type,
description
) VALUES (
'github.com/golang/mock',
'gomod',
NULL,
NULL,
'UNMAINTAINED',
'golang/mock is no longer maintained, and active development been moved to github.com/uber/mock'
);
If there are any other additional sources you'd find useful for advisories, please contribute them! If you're unable to - for instance if it takes information from an internal database - then you could create a new table and provide a means to sync the data into it, so you can add it to custom queries.
Policies
As noted in the Turning complex policies into custom Advisories using Open Policy Agent cookbook, the ability to write policies using the Open Policy Agent's Policy Language, Rego allows writing much more complex custom advisories, which are also called "policies".
It's worth reading through the in-depth examples for how policies work as found in the cookbook linked above, in particular how you can use it to flag specific versions of i.e. Bytedance's Open Source libraries.
Ownership
An additional opt-in feature is the ability to manage ownership for repositories, which can be really great for trying to work out who you need to get in touch with about an advisory.
For instance, let's say that we've found which of our projects are using a Go library that we're no longer recommending. How would we let the owners know that this is deprecated? Do we know who the owner even is?
DMD contains a dmd owners
subcommand that allows us to manage the ownership through a separate table owners
which allows JOIN
ing in queries.
This could for instance be synced with some internal tooling for managing ownership of services and projects, such as your Service Catalog(ue).
Once the ownership data is present, you can then perform a query such as:
select
distinct
renovate.platform,
renovate.organisation,
renovate.repo,
owner
from
renovate
left join owners
on renovate.platform = owners.platform
and renovate.organisation = owners.organisation
and renovate.repo = owners.repo
This would allow you to see all repos and their respective ownership, and works well when performing other queries against this data.
It's also worth checking out the cookbook for how to use the ownership data.
Repo metadata
In addition to ownership of given repositories, there is the ability to store additional metadata around the repository.
The repository metadata capability allows you to introduce other insight into a given source code repository, such as:
- is this a monorepo? (a step towards better supporting monorepos in the future)
- is this a fork?
- what type of repository is this, i.e.
SERVICE
,EXAMPLE_CODE
,CLOUDFRONT_LAMBDA
- is the repo public/internal/private?
- what other key-value metadata is relevant?
This can be found fully documented in database schema.
This allows us to fill the repository_metadata
with data that describes our repositories, which can feed in data via our internal Service Catalog(ue) as well as other internal datasources, which produces data such as:
repo | repository_type | repository_usage | additional_metadata |
---|---|---|---|
api-service | SERVICE | API JAVA POSTGRES | {"customer_facing": "true"} |
examples | EXAMPLE_CODE | SDK WORKSHOPS | |
business-service | SERVICE | FRONTEND | {"customer_facing": "false"} |
If we wanted to find the number of dependencies that customer-facing services had, we could write a query such as:
select
distinct
sboms.platform,
sboms.organisation,
sboms.repo,
(case json_extract(repository_metadata.additional_metadata,
'$.customer_facing') when "true" then true
else false
end) as is_customer_facing,
count(*) as total_deps
from
sboms
left join repository_metadata
on sboms.platform = repository_metadata.platform
and sboms.organisation = repository_metadata.organisation
and sboms.repo = repository_metadata.repo
-- where ...
Example project
Another key piece of functionality available in the DMD ecosystem is having a separate example project, which pulls from various real-world public repositories.
Although not a core part of the DMD project itself, it's an important offering to provide prospective users an idea of what the data could be used for, as well as being part of the integration tests that run as part of contributions to DMD, to ensure that there aren't any regressions introduced.
Contrib project
As mentioned before, there is the "contrib" project which provides a space to manage community-sourced contributions.
Right now, we only have support for custom advisories, but it's been set up in a way to be extensible and allow sharing other community-sourced data that doesn't make sense to sit in DMD's repo.
Case Studies
To give more of an indication of some of the things that can be done with DMD, let's take a look at some practical applications of this tooling, based on areas this data has previously been used.
During the talk, I talk through the following case studies:
- Deliveroo and a potential race condition with a Kafka sidecar
- Responding to the Log4shell incident
- Determining the effect of the Gorilla Toolkit archiving
- Determining how the Docker Free Tier sunset affects you
I'd recommend reading through these, as well as any other case studies that you find interesting. These case studies go into more depth than was possible during my talk, both for the context of what the end goal was, and why, as well showing what data was available and how we queried it.
Getting started
To get more of a feel for some real-world example data, it's worth checking out the Getting Started (with the example data) cookbook, which digs into the example project, and pulls data from various Open Source repositories across GitHub and GitLab.com.
There are also screencasts of various portions of functionality of the DMD tooling using the example project which can be found on the dmd website.
As well as this, you can also check out the Getting Started cookbook, which takes you step-by-step guide for how to get started with the project - against your own organisation's data - and taking your first steps to understanding further how your organisation uses different dependencies.
There's even a TL;DR section that should be a concise getting started post that may be more convenient to share with colleagues.
This TL;DR setup translates to the following three-command setup:
# produce some data that DMD can import, for instance via renovate-graph
npx @jamietanna/renovate-graph@latest --token $GITHUB_TOKEN your-org/repo another-org/repo
# or for GitLab
env RENOVATE_PLATFORM=gitlab npx @jamietanna/renovate-graph@latest --token $GITLAB_TOKEN your-org/repo another-org/nested/repo
# set up the database
dmd db init --db dmd.db
# import renovate-graph data
dmd import renovate --db dmd.db 'out/*.json'
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'
What's next?
I've got a lot of features, tweaks, and some bug fixes that I'd like to work my way through, and would appreciate insight from users about what may be useful for you.
I'd love to hear how you find the cookbooks for getting started and doing some common things, as well as get some more folks using it and sharing their own use-cases and functionality they'd like to make this more effective.
I'm super passionate about this, and it's been arguably a bit of a game changer with the way I can approach problems as an engineer working on shared tooling, as well as at a team-level considering what work is required to do to close off advisories.