How to use Dependency Management Data to discover which dependencies are participating in Hacktoberfest
As I've mentioned before, the fact that it's September means that it's almost October, and October primarily means one thing for me: Hacktoberfest ππππ½
Two years ago, the precursor to dependency-management-data was created as part of the blog post Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest, which has a more fleshed out history of inception if you're interested.
As the first full year since the project was started, following its official birthday in February, I wanted to take this opportunity to consider - how would I do the same thing in 2024, given I have a much better understanding about how I use Open Source, thanks to dependency-management-data, and the data it understands.
I'd hoped to finish this for September 1st, but I didn't end up doing so (as you can see from the publish date), and in the last few hours I noticed that the new Hacktoberfest website and branding is live, so this is a perfect time to ride on the coattails of hype and get this post out.
With dependency-management-data, one of the key things is that once you've ingested your dependency data you can then start querying it, for instance using pre-built "reports".
To make the ability to query for repositories participating in Hacktoberfest, there's now a new report in dependency-management-data v0.106.0 which allows you to run a report such as:
# linebreaks added for readability
dmd report hacktoberfest --db dmd.db \
--perform-external-lookup \
--platform gitlab \
--organisation tanna.dev \
--repo ghprstats
This then provides you a view of which dependencies - if any - are participating in Hacktoberfest by using the hacktoberfest
topic on their GitHub or GitLab repos.
Notice that when calling the report, we need to explicitly mention the Repo Key to specifically query a given repository in our database.
This is because there's a few external lookups:
- discover the URL for the repository, via Ecosystems
- if it's a GitLab repo, call out to
gitlab.com
's APIs to check the repository topics - if a
GITHUB_TOKEN
environment variable is set, call out togithub.com
's APIs to check the repository topics - otherwise (if there's no
GITHUB_TOKEN
or if a non-GitHub or non-GitLab repo) we use the data straight frompackages.ecosyste.ms
To avoid a significant set of outbound traffic, I've made the choice to - right now - only support looking up one repo at a time, which should reduce the overhead, as a couple of datasets I've queried this with have tens of thousands of dependencies to look up, which is quite significant outbound traffic π«£
So what does this look like?
For instance, on my ghprstats
project, which does have some repos participating:
And for a repository that doesn't have any dependencies that are participating, like my readme-generator
project:
This will hopefully be useful for folks who are looking for take their existing pre-built dataset of dependencies and work out where they can find dependencies that are of high value to the organisation (either used in a large number of repositories across the org, or that is a high percentage of the dependencies in use at the org) and be a good opportunity to give back to the community ππΌ
What other information do you think would be useful to add here? Should I add dependency-management-data to projects participating in Hacktoberfest?