Things I've learned about building + delivering software for other engineers while working in Engineering Productivity

Roughly 18 months I joined Elastic and started working on Engineering Productivity for the first time in my career. This is a role I've been gravitating towards throughout my career and has been something I've done unofficially several times, so it's been very enjoyable working on it as part of my "day job".

For the last ~12 months I've been in the DevFlow team, where we own internal + external tooling that targets "left of the main branch" to make teams more productive within Elastic's Platform organisation - think automated builds + CI, supply chain security, dependency management, development environments, test tooling, code quality and security, and many more things!

In that time, I've learned a few lessons about building software for developers, especially when those developers are in your organisation, and thought they'd be worth sharing.

Get out of the way of the user, as soon as you can

Although we'd love to craft ✨ beautiful, seamless experiences that are a joy to work with and brighten everyone's days 🌈 that's (unfortunately) not often the case.

While we (as Engineering Productivity, Developer Experience, Platform teams, or insert-job-function-here) are often building tooling, services or platforms that we are proud of, and that for the most part have the best intentions in mind, we're also going to be the cause of frustration for engineers.

We need to remember that each unfixed bug or piece of friction can lead to an engineer that's already having a bad day have an even worse day. This can compound with the tools we're building possibly being on the "critical path" (and therefore mandated), meaning that folks likely can't introduce workarounds to avoid dealing with our tools.

With that in mind, we should be trying to make sure that our tools actively avoid causing strife and seeing how we can "get out of the way" as soon as possible - focus on how to provide a well-scoped set of functionality that can be used.

For instance, this could be a case of providing an "escape hatch", where our users can use the default functionality, or they can say "I'm an expert user, I've got this" and allow them to configure things more easily, while knowing that they may be less able to get support when things go wrong.

Building empathy

It's been said of me at times that I'm quite an empathetic engineer.

I'd like to agree with this (slightly humbly), especially when I compare myself to others I've worked with in the past, while very much understanding I still have a lot I can do to improve.

One of the most important things I see underappreciated in engineering is the understanding that empathy for your users is very important.

It's important to remember that there are real human people - who, if you're working in an internal Engineering Productivity team, are your colleagues and teammates - who you are impacting with decisions you're making.

I also remember far too well times I've worked on product engineering teams who are constantly blocked by changes the "Platform" or "DevEx" teams, and the deep frustration (and inevitably, burnout) it can cause.

It's important to remember that when an internal customer comes with feedback, that we remember this is something that we should be thankful they've engaged with us - as they could've instead thought "eh, nothing will change, so I may as well not raise a ticket or Slack them" - and that there may be something we can do to (significantly) improve the state of things for them.

One of the things we do in Platform Engineering Productivity at Elastic is a "Host Of The Week" rota which replaces on-call with in in-hours internal customer support. Our internal customers, who comprise of most of our engineering function, regularly hit different bugs, need help with configuring things or have feature requests that need to be triaged, which are served by the "Host Of The Week".

By putting our on-call engineer in a more customer-facing position where they are able to interact with users of our software, they much more heavily build up empathy, as well as learn a lot more about the shared tech stack we have, as they need to do some triage + documentation + runbook following to answer customer questions, if they can.

I had already had a fair bit of empathy for our customers, but it's been interesting watching over the last year how some things were very regularly (constructively) complained about, but now don't appear much. This is hopefully that we've ironed out a number of the quirks (or better documented them) but is also likely that people are now just used to them πŸ˜…

Something I recommend doing is to make sure that you're "drinking your own champagne" and try and have different members of the team periodically go through the onboarding docs for i.e. your new service creation tooling, and see if you can spot gaps or areas to improve, making sure to only follow the docs and not use your implicit knowledge, as well as taking "early adopters" from your customer base who can help iterate and improve before it opens up for general availability.

Documentation is ace - keep doing it

As you may know from being someone who blogs a fair bit, I quite like writing (documentation).

But why do I enjoy writing documentation? It's because I find a very valuable experience from writing in an empathy-first way.

If you think of your documentation from the point of view of your users, and try and think "what is someone who's coming to our documentation looking to solve" or when releasing a new service or feature - "what would I want to read as a customer for how I should use this new functionality", or even "what do I wish I knew when I first started fighting this darned system"?

As a more tangible example, in the v2.2.0 release of oapi-codegen I completely rewrote the documentation for the project, adding copious documentation into the README, added a JSON Schema configuration for autocomplete + validation, but also added exhaustive examples for each of the key pieces of functionality (where they were lacking).

If you compare the documentation in the README from v2.1.0 of oapi-codegen to v2.2.0 of oapi-codegen you can see a significant difference, and this is something I've had a lot of positive feedback off the back of it.

More recently, I've done some similar, but less in-depth work, on renovate-graph and although there's still work to do, I'm happy with the improvements.

It takes time and a skill (I feel I at least partly have) to write good documentation, but I find it's so worthwhile that engineers learn how to do it and continue to invest in it, and be more well-rounded engineers.

Documentation is a waste of time

Unfortunately, it can be wasted effort.

As the age old saying goes:

If there is documentation written in the wiki, but no one can find that page, has it truly been written?

If your customers aren't able to find that documentation, then what's the point of having written it?

Is it what the user needs?

And then, even if they do find it - is it what they need?

The character Kevin Mallone from "The Office (US)", saing "Why waste time say lot word when few word do trick?"

I'll hold up my hand and say that I can definitely be a little on the verbose side of writing and may provide a little too much context at times for why rather than how the user needs to solve the issue.

Although I've not super formally taken on the DiΓ‘taxis framework in anything I've done, I would recommend reading up on it as a way to consider the different types of information users need and how you can better serve them what they need.

Are they going to read it?

And even if it is what they need, are they going to read it? πŸ˜…

Let's be honest, not everyone reads the documentation, they'll skim it, or ask an AI to summarise the section because they don't want to read the hand-crafted words you've written.

There's always more you could be doing

This one is a fun two-for-one learning - there's always more you can be doing.

The backlog scales infinitely

There will always be more feature requests, more bugs to triage and more planned work than can feasibly fit into the roadmap.

Something I've seen across my career is that, like the hydra, you can close one ticket off, and there will be more tickets left in the backlog. I feel this is further exacerbated by internal customer requests to triage and prioritise, which is no bad thing that they're actively feeding into our work, just that it's an additional backlog to work through.

A (disappointingly, not moving) GIF of the myth of Sisyphus, where Sisyphus is attempting to roll the boulder up the mountain, for it to inevitably roll down before he is finished

One option you have is to try and outsource some of the work. At a previous large enterprise I worked at, there was a heavy focus on "innersourcing" to get your customers to send contributions they want accepted, and then "all" you need to do "just" review the changes as they come in.

This works great in theory but in practice, not so much.

As I and many others know from being an Open Source maintainer, one reason people don't have default commit access to the project is that they'll need a hand with making their changes work. Even if your users could raise PRs to fix their issues, it will still need help from your team to get them over the line - often it can take more time to coach someone into making the change, than to do it yourself.

Although in Open Source it can e useful to have users learn more about the codebase and become regular contributors, in my experience it's less frequent in corporate environments of innersourcing.

You'll never ship the perfect solution, anyway

Secondly, you could always be going more anyway, as you are very unlikely to be building a solution that works for 100% of the people with 100% of the use-cases every time.

Brian Fantana from Anchorman saying "60% of the time, it works every time"

Instead, we need to make sure that we're prioritising the requests as they come in, thinking about i.e. whether a feature will unlock behaviour for everyone? Or a subset of users? Or is it even something that should be supported?

Friction can be good

Something I learned in my time in Deliveroo's Customer Care organisation was that there is a fine balance between the "right amount" of friction we can introduce for users.

In Deliveroo's case, for customers looking to request compensation, it was "too much friction" resulting in possibly being fined (due to making it too hard for a customer to complain and request compensation) versus "too little friction" (increased cost for paying out compensation and staffing the Care organisation), and was a fine balance to make it that little bit too difficult to put in a claim that it would reduce payouts, but not too difficult that it then may result in complaints or less loyal customers.

Goldilocks stands on the front decking of the bears' house, proclaiming "Your porridge ain't good". The bears look aghast, and both Goldilocks' and the bears' eyes and mouth are replaced with human eyes and mouths, looking really quite unsettling

Let's say that you're noticing customers are asking a lot more questions recently, and there doesn't seem to be anything that's led to it.

It could be that they've realised that it's easier asking the question in a public channel and then going away to do something else than it is to find + read the docs.

In this case, could we introduce friction to reduce our responsiveness to questions that we feel are already answered in our documentation, allowing our customers to maybe self-serve in the meantime?

There's a fine balance here as it may lead to your customers feeling that they wouldn't get a timely response when they have actual concerns that need addressing, or time-sensitive issues in the case of a broken release pipeline.

Golden paths should be paved with building blocks

One of the things that "Platform" teams, or those in the Engineering Productivity space, will do is to work on building "golden paths" which provide a really great experience to i.e. take a service from a Pull Request/Merge Request and ship it all the way to production, making it straightforward to operate through out-of-the-box monitoring/alerting and other great operations tooling.

This is an awesome way of reducing the things that teams need to think about + set up when building i.e. a new web API and is something we did really well at Deliveroo with our Go service template, providing a high-quality "state of the art" for best practices, that we continually updated as we improved key services internally, or shipped new patterns for i.e. batch processing with SQS.

These golden paths should be the default for teams to be choosing when building software, to the point that someone not using it should raise eyebrows.

But to call back to earlier - this shouldn't block teams who don't necessarily fit within the bounds we originally defined, and we should still provide escape hatches that allow flexibility for teams who don't necessarily want to follow the whole path.

Ideally these golden paths would be built on top of common building blocks that can be assembled into the golden path, in a way that is more maintainable long-term and also allows for a more composable/pluggable architecture.

For instance, at Elastic, we require teams keep on top of their security updates + CVE fixes for their dependencies. Teams' repositories and artifacts are scanned at several points and informed of any issues that need to be remediated.

There are a few options for keeping software patched for CVEs and our golden path recommendation is Renovate, which is pre-configured to access and manage a number of internal dependencies that are not supported out-of-the-box in other tooling.

Teams are well within their rights to take another route and swap Renovate out for one of the many alternatives, but at the end of the day, as long as teams are ticking the box for security upgrades, that's the important thing.

Scope creep

Something very common across engineering in general is the risk of scope creep. But I'd say there are a few areas that can be more pertinent in Engineering Productivity teams.

Do you own too much?

One of the troubles of working in an area of more general platform capabilities is that there can be a lot of disparate services, components, tools and platforms that you own - with groups of loosely or tightly coupled pieces.

For instance, one of the things my team owns is "CI", which is often described in a single line item in spreadsheets, but contains roughly 2 dozen distinct pieces and ~5 SAAS platforms to make things work.

It's worth taking stock and working out if there's something that you can be looking to consolidate, retire, or look at reprioritising your focusses.

For each component/service/platform/product you own, not only look at how important they are for business impact, but how many team(s) use it, whether there's active or planned development on any of them, or if you can assign different weights to the levels of investment everything should receive.

As another example, think about how polyglot your team is - can you sufficiently operate a number of JVM-based services, Python services, Go command-line-tools, all deployed into a mix of cloud providers?

Does it need a custom solution?

In a similar vein, do you need to build it from scratch?

As an engineer, the answer is likely "yes!" That's the ideal, but not necessarily as the on-call engineer who needs to debug it at 0200, and no one's touched it in years and understands Python 2.

Instead, could you pay a vendor to host it for you? Or are there any SAAS alternatives that may be a reasonable alternative?

If you do built in in-house, do your internal platforms provide any primitives to simplify the work to build it, rather than needing to build each-and-every bit individually?

Communicating upcoming changes and reducing upgrade friction

More generally, it's really important to find a way to manage your communication with your customers, trying to find the balance between "we don't ever know what's going on" and "I auto-filter your emails, as you're too noisy".

There's definitely something to be said about brevity (not always my strong suit ~3500 words in to this post) and making sure that there's a "TL;DR" at the top that folks can get a high-level view of what's going on.

There's a balance between trying to give folks lots of notice, and dragging out the changes you're trying to make, or not giving anyone any time to prepare or object.

If you can, work really hard on backwards compatibility, so you don't need to give large amounts of prior warning.

But if you have to force people to migrate over, overcommunicate, provide reasonable timelines, and see if you can make the work easier - can you provide script(s) to automagically migrate? If so, can you raise the PRs for them?

Understand what teams are using

This is maybe a subtle pitch for dependency-management-data, a project I've been building for a few years, but of the companies it's deployed at, it's been making a significant impact for folks in this sort of role.

By understanding how teams are using our internal tooling, we can better serve them, as well as finding out where we would expect to be used but it isn't.

For instance, finding out that a given tool is used as part of the critical path for the Elastic Cloud offering is something that could inform if there are additional requirements or investments we need to make into the tool to bolster its maintainability or security.

Alternatively, if you're planning on releasing v3 of your "batteries included" Terraform modules for service deployment, but find that ~60% of the organisation is on v1 and 30% are on v0, maybe the work needs to be put in to working out why teams aren't upgrading, and then helping them do that migration.

It also helps to be able to understand how you're impacted by supply chain security attacks.

Additionally, if there's anything you can put in telemetry wise to understand how different features are being used - in a privacy-protecting way - that's another great way to understand uptake of features and work out i.e. where you may be able to retire functionality.

But regardless of tooling, you should also be talking to your users to understand how they're using your software, learning if they are using it, what feature(s) they're relying on, how they're finding it, and what they wish could be changed.

"Are we pushing the needle?"

Having worked in teams which have very strong product management folks, I hugely see the value of having a Product Manager involved in thinking through the changes we're making, strategy for our platforms and understanding if we're making the "right" decisions.

In particular, if you've got a mixed bag of products and services you're offering internally, you may find that having someone with a good product background to help unify the approach and strategy across different types of work you're doing would be really impactful.

But I'd also say, more importantly, trying to work out whether you're actually focussing on the right things is important.

One way to do this is to measure metrics - for instance DORA - or you can leverage the hard work companies like DX are doing on understanding through well-researched-and-written surveys how folks are doing.

I just want to be able to tell if I'm making a difference, y'all πŸ˜…

It takes a lot to uniformly deliver great experiences

The last thing I want to stress is that it takes a lot of time and effort to polish each of the tools, products or services that we're offering, and sometimes the cost to do so isn't always worth the cost.

Similar to questions above about "do we own too much", we should also consider whether there are some tools or products that have a more premium feel to them, and that some aren't quite given the same level of polish.

In particular, there may be some "legacy" products which may be too hard to get up to the same level of polish - make sure you're intentional about it and clarify to yourselves and your customers that you're deciding not to invest as much.

I would definitely look at some of my Open Source projects and note that there are differing levels of polish with them, as well as finding that my definition of quality may have changed over time, which can also mean a long-running project may have differing levels of quality - can you honestly upgrade it all in one go? Likely not, so you need to do it piece-by-piece, while also delivering other features.

Remember that building great experiences takes time, effort and intentionality - you can do it, but you need to set yourselves up for success.

Want to chat more?

I'd be interested in chatting more about some of these, especially if there's anything I've been working on in or outside of work of interest.

Drop me an email or hit me up where I may be elsewhere.

Written by Jamie Tanna's profile image Jamie Tanna on , and last updated on .

Content for this article is shared under the terms of the Creative Commons Attribution Non Commercial Share Alike 4.0 International, and code is shared under the Apache License 2.0.

#developer-experience #elastic #deliveroo #capital-one #platform-engineering.

This post was filed under articles.

Interactions with this post

Interactions with this post

Below you can find the interactions that this page has had using WebMention.

Have you written a response to this post? Let me know the URL:

Do you not have a website set up with WebMention capabilities? You can use Comment Parade.