Book Review: "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems", by Martin Kleppmann

I've worked at fairly small companies during my entire career so far, and I don't have too much visibility into how larger companies operate. So upon the recommendation of my friend Jordan, I decided to purchase a copy of "Designing Data-Intensive Applications" as a tasting menu of the sorts to the design decisions, tradeoffs, and implementations for larger-scale software systems.

This was a tough read, and it's because Martin packed this book chock full of new insights and ideas. I spent every day reading for at least an hour and a half, and it took me two full months. Not every chapter was a new idea -- but if it wasn't a new idea, it was beating a new idea again to hammer home its importance.

Here's three takeaways I got from the book:

There's few things more difficult than managing persistent state: Whether it's database replication and guaranteeing replication correctness, an offline-capable client application, or a materialized view / persistent caching layer, state designed to persist over a long time is difficult to get right because one data corruption instance may cause a deviation in behavior over time. Designing for correctness of applications often comes at a cost to performance (e.g. through two-phase commit) and perhaps other business-critical metrics. Finally, low-level persistent state often promotes service coupling (e.g. by diving into understanding the behavior of the underlying hardware/firmware). State is important for a number of reasons (e.g. a database, because it's the source of truth for the rest of the application stack, a caching layer because a lack of performance will cause latency amplification throughout the rest of the stack, an offline-capable application, because business requirements may demand it).

It seems to me that there are three points where developers relate to state along the Dunning-Kruger curve: the first where stateful code is written because one doesn't know any better, the second where stateless code is desired to avoid pitfalls of stateful code, and the third where state is unavoidable and solutions are carefully designed and thought through.
Sometimes, there is no definitive truth: I really enjoyed the section on synchronization services/consensus algorithms discussing Zookeeper/Paxos and etcd/Raft. The lesson I got from this section was how some truths in a system (e.g. does a new leader in a single-leader replication system need to be elected?) could just be what the majority decides (e.g. if the leader is alive but unreachable through network by the other nodes, it's assumed to be dead). This can result in problems in practice; for example, if there are intermittent network connection problems, Raft may continually elect new leaders to the detriment of getting any actual work done. This is because to the distributed system, the truth keeps shifting as there due to hardware restrictions, there cannot be a fixed truth baked into the software.

Other systems, like supercomputers, take a much simpler approach; colocate the hardware and ensure networking is robust (e.g. Infiniband vs. Ethernet), then run jobs as a fully reproducible batch process and crash upon errors. This is preferable from a software perspective, but it eats into margins and may not be commercially practical.
Good software is defined in terms of properties: Martin hammers home the points of reliable, maintainable, and scalable. These are then broken down into sub-properties, like idempotency (e.g. can this operation be retried without incurring additional side effects), linearizability (e.g. is there a total ordering that can be established between concurrent events), commutativity (e.g. can we merge state together at the interface level out-of-order and get the desired result), and others.

I'm guessing at this point, a lot of the other, lower-level issues with code, like code quality or robustness, have been smoothed out with CI/CD pipelines, testing, linting, and other general best practices. These kinds of things need to be enforced or warned at compile-time, or caught during code review.

At the end of the book, Martin proposes an extension of the UNIX philosophy from a single-machine paradigm to the multi-machine paradigm, with streaming, pub/sub, and dataflow programming. I found this to be interesting and worth paying attention to, because thinking and execting at this level is a serious differentiator for large companies where technical leadership may be removed from the front lines; the Googles that custom-build end-to-end systems with these principles in mind, and the companies that cobble together Apache projects and realize it works well for a prototype but seriously lacks in the last 20% that counts (e.g. security).

Although I don't know if I would wish to dig so deep into these weeds if I can help, I would highly recommend this book for any software engineer, since it gives a very good bird's eye view of the engineering problems across an organization.