I read this Tweet today before an early morning flight, so I felt a bit pumped to write down some notes on things I’ve learned this year.
Disclaimer: This is mostly unedited and it overly focuses on software/programming related things – I could spend days talking about obscure things about house dance but I guess we can leave that for later.
————————-
A lot of the mind share of programming goes into writing software but effectively running software in production is equally important. Over the years I’ve read countless articles/books on design patterns, unit testing but few on incident response, designing fault tolerant systems, dealing with large DB tables, etc.
Here are a few things that have occupied my mental cycles over this year on this respect:
-
Incident response – mitigation is priority #1. Communicating effectively during the incident is important. You should leverage the collective experience of the available engineers. Post mortems are crucial for learning. Google has written some of the best content on the topic.
-
Observability – it is invaluable to be able to infer the health/behavior of the system/product just by looking at a few dashboards. Also you need to have alerts when a service degradation occurs. For this a white box monitoring solution like statsd/Datadog/influxDB/grafana is perfect. I’ve heard good things about Prometheus but I don’t have experience with it. The SRE book has some great content on what/how to monitor.
-
the hierarchy of signal to noise ratio – for issues in production my hierarchy goes a bit like this Datadog > NewRelic > Bugsnags > Logs > Code. I try to spend most of my time in graphs/metrics and add instrumentation for the most common or domain specific issues. But sometimes you can’t avoid having to dig into the codebase to understand why something is happening – this is time consuming and requires context which makes things difficult for on call engineers when responding to incidents. Every new issue in prod is an opportunity to learn about things that need better instrumentation.
-
Your monitoring is your 3rd party service status page – your monitoring will always detect outages in 3rd party services faster than they will communicate them (sometimes they won’t even detect service degradations before you do). And while you are at it, add timeouts to all those HTTP calls and a circuit breaker.
-
big SQL DB tables – I am not that well read on this topic so take my comments with a grain of salt. But, big tables (at the application level) are problematic – some queries might take forever even with indexes, migrating them takes hours, they create a migration anxiety. Trim your tables if you can (do you really need all that data from 10 years ago in your application DB or should rather be in the data warehouse?). Learn what are the consequences of migrating those tables – does the table have a high read/write activity? Will clients be blocked when you are migrating it? Learn how your online schema change tool works and what happens when things go wrong. Let me know if you know a good book on this topic!
-
Rails gives super powers to programmers – I think the ability to move between Rails projects and quickly get productive even if one has little domain expertise is severely understated. I changed jobs this year, and worked in different Rails applications within the same company and always felt at home. There was always something familiar and I always knew where to look for things even as I was just dipping my toes in new waters. I work at a place with a relatively small engineering team, given the size of the business, and I think Rails is a contributor to that productivity boost. DHH is right. Rails is likely not the framework of the day (neither the Ruby language), and there are likely good alternatives in other languages, and it might not be the best choice for some types of problems/companies but Rails has gotten this right.
-
Sidekiq Pro/Enterprise is worth every single penny. I wish there were more open source projects with this type of business model
-
Ruby concurrent programming is actually good. concurrent-ruby, celluloid and working with ruby threads are all amazing resources. And things are going to improve.
-
Some of the most high leverage work I do on a daily basis revolves around tactically thinking about how to approach a problem so we can achieve the intended goal, while being within the agreed timeline, and having short feedback loops and many iteration cycles. When the thinking is done, programming feels like the easiest task. Key skills here are: being able to clearly understand what is important and what isn’t, being able to understand the business goals, being able to clearly communicate why I chose a certain path, being able to build consensus or get feedback about a particular choice, being able to foresee the unintended consequences of our choices, being able to deal with uncertainty and open ended questions. I find it odd that some of the highest leverage work I do, is something we’re rarely evaluated for in interviews in the tech industry.
-
if you integrate with any 3rd party provider in which there is any potential for a dispute, you should store the payloads and make them easily accessible through an admin interface. For example, if your decentralized supermarket integrates with a cheese provider and there is a possibility for your cheese provider to claim that you ordered more cheese than you did or to claim that you incorrectly calculated taxes, then you should store payloads, they will be useful, you can thank me later.
-
most technical interviews aren’t great, but I know you can do better. Define a goal for what you want to assess, write a checklist that meets those goals, devise a set of exercises/questions that address that checklist, use the same evaluation criteria across candidates/interviewers. Learn to ask questions, listen and ask follow up questions. A ton of digital ink has been spent on this topic but this remains one of my favorite posts on the matter.