John Obelenus
Solving Problems & Saving Time through Software and Crushing Entropy
When working on complex and distributed systems my goal is to make things as simple as possible. Take a look at how some of the biggest architecture projects, like Kafka, out there design to be high-performant and fault tolerant; partitions (or, sharding the data that passes through it), and append-only. Those are two of the key architectural principles that make Kafka work.
It is hard to reason about these systems, and it is hard to be confident when identifying just what exactly is happening. Ensuring that your operational data (99% of the times this is your database) is an append-only log of events is my go-to strategy.
Sometimes that means I have a system-versioned tables (whether supported out of the box, or enforced with triggers). Sometimes that means there is a process that runs at a regular interval to “make sense” of the collection of records that I have. Because that collection is ever changing.
But if you have an important piece of data, that can be mutated by many different processes, and you’ve got UPDATE statements running over one another you are quickly going to get confused as to just how you got to your final state. I don’t care how many logging statements you put in there.
There is a reason these amazing projects use this techniques. They are tried and true. They are simple. They can be reasoned about.
You may not be building the next great piece of internet or software plumbing. But it doesn’t mean you can’t use the same techniques.
Read more: The Value of Append-Only
It is the network cost of communication and coordination between humans. Every time. The more humans, the bigger the cost, because there are more connections between humans. Some of the biggest advances in the technology industry have been social ones, not purely technical ones.
Micro-services is a social advance. It removes communication nodes by black-boxing certain services, and thus the people responsible for those services. If that service functions well, and doesn’t need any changing, the people can move on somewhere else, leaving that service to live an undisturbed life.
GraphQL is also a social advance. It allowed the separation between back-end folks and front-end folks get things done—without either one needing to dive deeply into the other area. GraphQL allows front-end engineers to make new and unique queries, without actually doing that. It allows back-end engineers to actually create performant queries and caching that can be re-used, without having to worry about someone making a new query that is neither optimized, nor cached, and destroys your system.
Wikis are a social advance. It allowed a user to edit and link webpages, but within a social construct and relationships.
Slack, Discord, Teams, are all social advances. Functionally they are not that different, rather they compose better with the modern web, and are easier to use (and set up) than IRC ever was.
These are all tools we are trying to use to combat the biggest network cost of all—communication between humans that are working together (and not just working together “at work”).
If you are not focused on improving communication within your organization this cost will eat you alive. You need both broadcast communication and strategic communication. Strategic communication is one-to-one. That doesn’t scale. But its necessary to come to decisions. Broadcast communication is one-to-many. That scales.
You can prevent unnecessary strategic communications by using broadcast communications to send out decisions that have already been made. Be careful: when you’ve missed something in your decision, or not included enough people, because your broadcast will spawn many strategic communications to address those gaps.
When you don’t spread information and decisions widely enough, you end up with people in your organization whose calendar is a double-booked mess. They’re having repeated strategic communication because something wasn’t broadcast wide enough.
Read more: Your Biggest Network Cost
Mentoring has nothing to do with managing. There is a dire lack of mentoring in the software industry. The pace of change is amazingly fast. So many conferences and talks are available to watch and learn from. Articles on how to do X, Y, and Z are constantly written. Documentation is improving because folks who write developer tools and platforms know that if your docs are bad no one will be able to learn how to use your amazing service. And all of this is free, I haven’t even mention paying for bootcamps or Udemy courses (though really, stop advertising classes at me in YouTube).
But none of that comes close to mentorship. I don’t have a mentor. I wish I did (any volunteers?). I don’t have anyone junior to me that I am truly mentoring (Bueller?). The statistics tell us that older programmers are being passed up for jobs. While all the problems we are solving are the same ones we’ve been working on for decades. They have so much knowledge and we are watching it walk out the door. Will we ever learn everything they learned? How much will be forgotten, and re-solved with a worse method?
Plenty of companies add some element to their job descriptions that sound like “mentoring”. My current gig has it. I bet if you took a poll less than 10% of people would say they are actively being mentored.
I do know that people a few rungs up the ladder have mentors. And they are usually long-lasting relationships.
The job of mentoring someone is, in my opinion, antithetical to being their manager. Once a person has the power to promote or fire you, the relationship to that person fundamentally changes. That changes puts in place a barrier to effective mentorship.
I can only tell you what others have told me. A mentor is a person you can use as a sounding board. You can tell them what you’re thinking and feeling around your work. You can tell them what you’re trying to do about it, and if it’s working. And they will use their experience and keep you on track. To use a musical metaphor they are your sound-check—make sure everything is working and in-tune.
The content is necessarily is specific to the individuals and situations. But it is a professional relationship between two people that is not hierarchical. Often these people don’t work in the same company.
I can definitively say that mentoring is not “I can’t figure this out, can you help me?” Helping someone you work with is something we should all be doing, but it is transactional in nature. Nor is mentoring venting about your work situation with someone else behind closed doors.
I believe if we had more effective mentoring we might realize several things:
These are all things our industry desperately needs.
Read more: More Mentoring
Tech Debt won’t kill you. I firmly believe that. However, I realize that people believe certain things are “technical debt” when they really aren’t. This happens because they don’t have another term to capture what is happening. So they turn to “tech debt” as an umbrella term.
Let’s stop doing that. Let’s offer up another phrase to more accurately capture what is happening.
Technical debt are decisions that will eventually need to be re-visited after a certain amount of time. On a long enough timeline, every single thing is technical debt. Eventually even computers will change so fundamentally that you’ll need to change even the simplest code.
But some decisions escalate risk. That is not technical debt. Even if it is a technical decision in nature, it is not debt. It won’t eventually become a problem. It is a problem in the very moment you’ve made the decision.
This is why technical debt will not kill you. Escalating risk will absolutely destroy you, your product, and your company. Here are a few things I’ve experienced that escalate risk.
Sometimes it isn’t even about code. All business systems include the humans that use those systems. A lack of communication between the people changing the system (engineers) and the people using the system is one of the riskiest propositions of all. That goes both ways. Without managing changes, explicitly, on how people are going to use the system differently, or what changes are being proposed is going to cause chaos.
I have seen corner-cases ignored with the explicit statement of “that won’t happen”. I guarantee you’re wrong, I guarantee that it will happen. Once it does happen you better have an answer. There is a clock running from the moment you deploy until that corner-case is hit, and you have zero control over how fast, and how randomly that clock ticks. The bigger your system and the more people are on it, the faster it ticks. Its one thing to not address a corner-case, but one surefire way to make it worse; not even alerting within the system that the corner-case was hit.
Those are some higher-risk examples. But let me give you a lower-risk example: not enough logging. Once a system is even close to something we consider “complex” there will be unexpected errors. Errors that don’t seem to be possible. Errors you cannot possibly hope to reproduce in dev. Errors that won’t stop. If you don’t have all the data you need to solve it at hand, you’ve added just a little bit of risk. If you’re experiencing this problem now I highly recommend looking into honeycomb.io (Disclaimer: I do not work there, they are just that awesome).
Read more: Not Everything Is Tech Debt
There are two major upsides to duplicating data. Make no mistake, when I say “duplicated” I mean it. I mean not-normalized data.
I can already hear the chorus: “One Source Of Truth”. And I agree, there should be one source of truth *in most cases. But sometimes that doesn’t apply.
Let me define the situation for which this does not apply; a complex, concurrent, multi-service, and multi-user system. If you’ve got one of those keep on reading.
I liken this to double-entry bookkeeping. There is not one source of truth. You only get to the truth when you’re comparing two (or more) records.
This notion of double-entry bookkeeping is the first major upside to “duplicating” data. Or at the very least, storing it in multiple forms, at multiple stages along its transformation. Why? Because when something goes wrong you have a breadcrumb trail to help figure out at which stage it went wrong. When you only have one source of truth that is constantly being overwritten—and something goes wrong, because something always goes wrong—you’ll never have any insight into what the state of the data was before the error. It’s gone. Overwritten. Inaccessible.
The second upside is optimization. Often when creating the One Source Of Truth it is designed strictly and ontologically. All the relationships conform to the platonic ideal according to the essence and relationships of the object (sorry, I studied philosophy). This design values ontology higher than the pragmatism of a running process. Duplicating data flies in the face of this thinking, and instead optimizes for access patterns; namely how you’re going to read and write this data.
My recommendation is that your initial source of truth is write-optimized. Make it easy and fast to write your data. The faster and more compact your writes are in a distributed system the lower the probability of losing (much) data when problems arise.
Duplicate and transform your data into as many different shapes you need to be in order to be able to read fast.
If you ever want to keep a system performing reliably and fast you have two options: keep it really really small, or embrace the CAP theorem and data duplication.
Read more: Against the Purists