My Quixotic Quest for Digital Security

It’s no secret the internet has gotten a lot scarier in recent years. When crime in your area goes up noticeably, it’s rational to invest in a new security system. So why not do the same with our digital lives? This impulse began my attempts at hardening my digital security.

The first step I took back in 2018 was to transition away from Gmail toward a private, end-to-end encrypted email solution at Protonmail. I also decided to buy their bundled VPN service and began using it constantly on all of my devices. The ProtonVPN app is decent enough 99% of the time, but it has a lot of problems, mainly when it has to reconnect. This is an especially salient problem for mobile devices, which are constantly connecting and unconnecting to different WiFi networks or falling back to 4G as you move them around. But even for home WiFi connections, the ProtonVPN app is just not that great, even though some major stability improvements have been made in the last few years like the ‘kill switch’ feature that terminates your connection if the VPN disconnects – which is unfortunately common.

This led me to seek out a custom firewall for my home network that could route all my home network traffic through the VPN. I landed on pfSense, a FreeBSD based OS for firewalls. It purports to offer “enterprise”-grade security. Now the question of what to run it on. A blog I follow recommended a little firewall device from Protectli would be more than sufficient for a home network.

I picked out my Protectli device for about $300 and began setting it up. I managed to get pfSense installed and seeing some packets come through the firewall. I then installed an OpenVPN client with my ProtonVPN credentials and verified everything was getting routed through the VPN tunnel. Looking good so far! I then went out and bought a new WiFi router to put behind the firewall (replacing my ISPs router, lord knows what what thing’s running) and voilá – I now had my own secured home network with no need to run the ProtonVPN app on my device. Everything behind that network would be VPN’d to another IP, and the WiFi router itself was protected by pfSense running on the firewall.

All was well and good until I had to restart the firewall. I was moving some equipment around my living room and had to unplug it. When I started it up again, I couldn’t get it to route traffic through the firewall, even after making a fresh pfSense install!

Right now I’m stuck trying to figure out this issue. I don’t know if it’s a hardware failure or some issue with how I’ve got it setup. The Protectli guys have been nice in communicating via email to see if we can figure out the problem. For now I’m going to have to keep the dream alive in my heart.

Holistic Software Engineering

All organizational problems in software development can be traced back to the erroneous assumption that developing software systems is like developing other large, complex things, like buildings or cars.

In a company producing cars, for example, the process is divided into two roles, broadly speaking. There are people who design the cars, creating blueprints and specifications as to how the engine components and various features will fit together functionally — let’s call them designers — and then there are the people who actually assemble the cars — let’s call them assemblers.

The assemblers need specialized skills to operate the various tools used to put together the components at the assembly plant. But they don’t have to know how cars work. They certainly don’t need a mental model of how a modern vehicle is designed. They are mostly concerned with the actual assembly process, how to make it more efficient, how to avoid costly accidents and delays in the assembly process, and so on.

The designers, on the other hand, have a completely different skill set. They have a deep knowledge of what kind of engine layouts work, what effect changing certain components have on the overall performance of the car, and so on. They talk to executive leadership about what kind of designs are selling, learn about performance improvements that can be made, and respond to problems in the design by producing new designs.

We also have a middle tier of managers that serve as connective tissue between these groups. They help to keep communication working smoothly, resolve issues with employees, and identify needs for new hires as the production line grows.

(OK, this is probably not exactly how automotive design works, but bear with me!)

Divergent Levels of Understanding

It’s tempting to think that building something like a mobile app works the same way. Perhaps you’re shaking your head now as you read this article, thinking to yourself, “Of course that’s how it works!”

In software, it’s quite common to have the assembler and designer roles separate, especially in corporate environments. From a hiring standpoint, this makes sense.

Let’s say we are developing a new mobile app front end to our back end system. We can hire an off-shore team of contractors to put together an iOS app, and throw in a senior back-end engineer from the main office to help direct them on Zoom calls once or twice a week. This might help us keep our personnel costs down, right?

Already we’ve split the task of building software into two roles. Now we have a senior engineer who’s really doing the design phase of the project and a group of assemblers who are actually doing the programming.

Let’s say we also have a product manager to oversee this project. The PM knows how the mobile app needs to work.

Notice how divergent the different levels of understanding of the project have become. We have different humans handling three distinct levels of understanding:

  1. What is the project supposed to do?
  2. What is the best approach to meet these requirements?
  3. How are we going to implement this best approach?

Roles Within Software

The people who understand the product and what it needs to be able to do in the long term have titles like product manager or software engineering manager. They usually don’t write code or implement the design. Sometimes they have development experience, but not always. These roles have a level 1 understanding of the project.

Even software architects frequently spend little time implementing designs. Usually, they’re concerned with making sure the team has the right approach, and they have to justify that approach. Sometimes a similar role is taken by lead engineers who are responsible for guiding the team toward an implementation. These are your level 2s.

And, of course, there are often several engineers working on implementing the design. Usually when we say engineer we mean someone writing application code to be deployed either to a server somewhere or delivered to a browser in a bundle of compressed JavaScript. These folks are deep in level three.

But there are also those tasked with ensuring the environment where the application actually runs is healthy. For maddening reasons, we call them devops engineers. In a past life, these brave souls might have been called sys admins. If they’re working in a cloud provider, like Amazon Web Services, we’ve sometimes taken to calling them cloud engineers because their job is to provision cloud resources for the team. These ops roles are also level three.

Split Brain

In our zeal to make software development more efficient, I think we’ve erroneously taken this metaphor of the assembly line and applied it to software.

Instead of making things more efficient, we split the complete understanding of the project, so when we make changes in the system, we now have to begin at level one, and translate the requirements all the way down to level three.

Dare I ask, is there a better way? Let’s let our imaginations go wild for a moment. I give you the holistic software engineer.

A Holistic Software Engineer

holistic software engineer is capable of:

  • Designing a system from beginning to end, taking into account various tradeoffs.
  • Administering a working system, be it in the cloud or on-premises.
  • Communicating these concepts to non-technical staff outside the team, e.g. executives or sales.
  • Understanding the need that the project fills for the user, whether it be an internal user like an executive or an external customer.

In other words, the holistic software engineer has a complete understanding of the project — from its purpose right down to its implementation details.

They can speak to the various decisions that went into designing the project. They can quickly iterate on new designs when the implementation in the “assembly” of the software runs into performance problems. They can also understand how decisions impact users of the system, and what aspects of the system’s performance would be most important to its users.

The most useful contributors on software projects are the ones that have this holistic understanding. They can bridge the gap between the different levels that the organizational strategy has created.

They could come from either direction. They might be a technical person who has gone out of their way to understand the product side of things.

Or they could be a product person who has done a lot of face-to-face work with engineers to understand how the system was designed and what its limitations are.

Whichever direction the holistic software engineer comes from, they’re useful because they’ve started to re-compose the complete understandingthat was lost when we split roles along with levels of understanding.

A Challenge

I challenge you to think about your own role in these terms and try to take on a more holistic role.

Are you working at level one, with a solid understanding of the product but no idea how to design or implement it? Try learning some technical skills so you can understand what the engineers mean when they say that service X needs to call service Y.

Or are you working at level two, where you sketch out designs of systems but cannot be bothered to iterate on those designs when your offshore team runs into problems? You’d be a better architect if you dove into the code and had a solid grasp of where the product is heading.

Perhaps you’re deep in level three, heads down in code, but not really aware of how the project might change and grow in the future. I would challenge you to think of yourself not as a “coder” but as someone with a role in structuring the implementation to allow the design to evolve to meet future product needs. Talk to product and ask questions to fill in the gaps in your understanding.


We can all improve in our craft if we have a little more understanding of the concerns of the people we interface with at work. I present this concept of holistic engineering as a way of getting at the kind of empathetic approach that I find works best in software teams.

I think that the Agile movement was an attempt to re-constitute this holistic engineer. Processes like point estimation, storyboards, and so on are just formalized ways of communicating technical challenges to management and communicating product requirements to engineers.

Whatever system we use, the end objective is the same: engineers and product experts working together toward a complete understanding of the project, so that progress can be made. Try to understand the whole picture and you will be more useful as a contributor, no matter what your role is.

Why Everyone Should Learn Functional Programming Today

In the world of programming languages, trends come and go. One trend that deserves consideration is the interest in functional programming that began earlier this decade. Functional programming is a style that emphasizes immutable data, functional primitives, and avoidance of state.

I know what you’re thinking. You wrote some Lisp in college and dreaded it. Or, you had to manage some awful Scala code at your last job and you’d rather not deal with that again.

I know, I know. But hear me out.

Functional programming is more than a trend. Understanding its concepts and appeal goes a long way toward understanding the problems facing software engineers in 2019 and on into the next decade.

In fact, it helps to understand the current state of the world, as data mining and Machine Learning algorithms become an issue of public concern.

Even if you don’t work in a functional language, the solutions offered by the functional way of thinking can help you solve difficult problems and understand the world of computing.

Imperative Style

Most programming languages in wide use today are Von Neumann languages. These are languages that mirror a Von Neumann computer architecture, in which memory, storage, control flow instructions, and I/O are parts of the language. A programmer creates a variable in memory, sets its value, deletes it, and controls what the next command will be.

Everyone who has written a program is familiar with these concepts. Indeed, all the most popular languages in use are Von Neumann family languages: Java, C++, Python, Ruby, Go.

Enter Functional Style

In August 1978, computer scientist John Backus published an article in the Communications of the ACM. Backus accused conventional Von Neumann style languages of being “fat and flabby.” He bemoaned the complexity of new generations of languages that required enormous manuals to understand. Each successive generation added more features that looked like enhancements but considerably degraded the language by adding complexity.

Furthermore, programs written in these languages couldn’t be composed into new programs because their components weren’t created in generic forms.

A sorry state of affairs, indeed.

Backus asked why we can’t create programs that are structured more like mathematical formulas. In such a language, data could be manipulated as in algebra. He proposes this “functional style of programming” would be more correct, simpler, and composable. Backus also stressed the importance of “clarity and conceptual usefulness” of programs.

It has been four decades since this paper was written, but I think we can all relate to this!

Languages like Java, Python, and JavaScript add new features intended to clarify syntax, but the overall trend of these languages is toward increasing complexity. Object-Oriented Programming (OOP) at least gives us modularity, but inheritance hierarchies lead to well-known design problems.

Models of Computing

The blame for all this complexity, according to Backus, goes back to the Von Neumann computer architecture itself. It served us well in the 1940s and ‘50s, but by 1978, it had begun to show its age. He defines several conceptual models to demonstrate the limitations of Von Neumann’s ubiquitous model.

Turing machines and automata

These are conceptual models of computers used by computer scientists. They meet all the requirements for computing, but they’re too unwieldy for human beings tasked with designing software programs.

The dreaded Von Neumann model

Backus calls the Von Neumann model, exemplified by most of the conventional languages we use today, “complex, bulky, not useful.”

Backus concedes that Von Neumann languages can be “moderately clear,” but he calls out their lack of conceptual usefulness.

Indeed, how many of us have stared cross-eyed at a 1,000-line block of Python or Java, trying to suss out what all these loops and conditional statements are trying to do? And with multiple contributors, it can be a nightmare to understand highly procedural code.

Backus also notes that the Von Neumann model is designed for a single CPU machine. Instructions are executed one at a time.

The functional model

Here, Backus identifies the lambda calculus, the Lisp language, and his own concept of “functional style programming” as the third category.

Programs written in this model have no state. Instead of setting variables directly, we bind values to symbols. Instead of looping, we transform collections. The result is programs that are concise and clear, as well as conceptually useful.

Another way to say it might be to say that functional style is obvious.

Indeed, a program written in a functional style language is often quite short, but its concise definition makes it easier to understand than its non-functional equivalent.

Why Should I Care?

OK, so maybe we could make better programs if we all dropped Python and Java and started writing Haskell. Uh-huh. OK. Sure.

But who’s going to do that? And why? How are we going to train developers fresh out of college in these languages that they don’t know? More importantly, why? Certainly, there has been a lot of quality software written in existing languages, and as C++ creator Bjarne Stroustrup once said:

“There are only two kinds of languages: the ones people complain about and the ones nobody uses.”

The reason we should care about all this beyond an academic exercise is that the present movement toward “Big Data”-driven products has led to problems in computing that the functional model is uniquely good at solving.


As Backus noted in 1978, the Von Neumann model is really oriented around simple computers that execute one instruction at a time. The flow of a Von Neumann style program puts the control of every instruction into the hands of the programmer.

Unfortunately, it didn’t take long before our computers became more complex. We now have computers with many CPUs, executing many instructions at the same time. Popular languages like Python and Java weren’t built from the ground up to take advantage of this. These languages have bolted on threading APIs to allow programmers to take advantage of multiple processors. Others rely on process forking, essentially pushing the problem down to the operating system.

Multi-threaded programs are hard to write correctly, and even very experienced programmers can make serious errors. Writing multi-threaded programs is so complex that there are entire books dedicated to doing it correctly.

What would our programming languages look like if computers with many CPUs were commonplace in the 1940s? Would we choose to program each thread individually, or would we come up with different concepts for achieving the same goal?

Distributed Systems

Ten years ago, most software was written to run on an operating system on a customer’s PC. Any operations that the software needed to do were processed using the customer’s CPU, memory, and local disk.

But the early success of Gmail and other web-based tools proved that a sophisticated software system could be run over the internet.

Today’s commercial software doesn’t just run on a customer’s PC. It runs in the cloud, across perhaps hundreds of machines. Software-as-a-Service (Saas) products are now commonplace, used by individuals and enterprises.

With the data taken off of the customer’s PC and sent over the wire to our data center in the cloud, we now have a situation where we can look at the data for all customers in aggregate form. And that data can identify trends in the data — for example, detecting fraud in bank transactions.

But these systems are hard to write. Instead of running in a single-threaded computer with local memory and disk access like the Von Neumann model presupposes, our programs now have to run across potentially hundreds of machines with many CPUs. Even worse, we’re now processing way, waymore data than we could ever hope to store on a single machine. And we need to be working on this data. It can’t just be shoveled into a data warehouse and queried later.

A Naïve Solution

One approach is to keep using the threading or process-forking models we have been given to write our code, and then build a fleet of machines to scale it. Those machines will then process data and push that data somewhere (a database?) to keep it from filling up the local disk.

As you might guess, this solution is very operationally complex. We have to manually shard the data somehow — i.e., split our data set evenly across our n processing machines — and write the glue code for all these machines to talk to one another and perform some sort of leader election to figure out whose job it is to coordinate all of this.

In practical programmer terms, it also means we’re going to have to write, maintain, and version the following in code:

  • Complex multi-threaded code written in Java, for example.
  • A bunch of bash scripts to deploy and update this code on our n machines in the cloud.
  • Code to scale up and down our solution to more machines as our data volume grows or shrinks.
  • Some kind of scheduler and coordination system to get all these operations to work together and join their result somewhere.

Now imagine debugging and maintaining this system. Doesn’t sound fun, does it? Certainly, the resulting solution in code will not be obvious.

An Elegant Solution

In 2013, Berkeley’s AMPLab donated the Spark project to the open-source world. Over the years, Spark has become one of the favored ‘big data’ cluster programming platforms, supplanting a variety of systems built by engineers at Twitter and Google.

Spark is written in the Scala language, a functional programming language that runs in the Java Virtual Machine (JVM). I won’t get into the gory details of how Spark works or write any code here. You can find plenty of examples online for that.

Instead, I’ll present the Spark conceptual framework and show how the functional model of computing is crucial to its elegant solution.

What is “the program?”

Ask yourself this question. In our hypothetical distributed system described above, what is “the program?”

Is it the concurrent Java code that we wrote? Or is it the bash scripts that deploy that code out to the various machines in our “cluster?” Is it the scheduling algorithm?

I’d argue that all of these components put together contain pieces of “the program.” The program is the instructions for transforming the data. Details like thread management and managing resources are incidental to this goal.

Think of it this way. Say we have a dozen machines in the cloud with 4 CPUs and 16 GB of memory. Throw all those machines together into a big “cluster” of computing resources. Now we have one big “computer” with 4 * 12 = 48 CPUs, 16 * 12 = 192 GB memory, and a certain amount of disk storage.

Now, imagine we write our data transformations in the functional style described by Backus. Each transformation is written like a mathematical function. There’s an input and an output. No state. All data is immutable, stored in stages on disk on each machine, and deleted when it’s no longer needed.

We could now have a scheduler that knows about the structure of our cluster. In other words, it knows it has 12 machines with 4 CPUs and 16 GB memory. The scheduler dispatches a portion of the data along with the data transformation function we’ve defined.

In fact, if we write our data transformation “program” in a purely functional style, the scheduler can dispatch many of these transformations at the same time, as many as can be fit in the cluster with its limited resources. That allows us to process our data in an efficient manner.

Programming the Cluster in Functional Style

I’m not going to promote Spark as the end-all, be-all of cluster computing. Perhaps we’ll come up with something better in the future, and Spark isn’t good for every distributed system. It’s optimized for data processing and streaming, and not serving up live requests, for example.

But I want to emphasize the shift in perspective that allows this type of system to be built, namely functional programming style. And indeed, when we enter the realm of ‘big data,’ we tend to find that most solutions rely on the functional model of computing.

Spark offers a Scala, Java, and Python API. Whatever language you choose, you’re going to be writing your Spark program in a functional style.

We also tend to find that the separation of transformation code from resource management is a theme. Apache Spark’s solution separates out the resource management aspects of our distributed system, leaving us to work with the data. Data transformation rules are clear and require no complex multithreaded code.

It seems that distributed systems are finally freeing us from the limitations of the Von Neumann model.


Functional programming languages may be falling out of favor as a popular replacement for languages like Java or Python. As a drop-in replacement for simple use cases, like a small web application, Scala or Haskell may be overkill.

But the functional model of computing has not gone away by a long shot. If anything, it’s more ascendant than ever. It’s hiding behind the scenes, powering the Machine Learning algorithms, business intelligence, and analytics engines that provide insights to modern organizations.

Software engineers and managers would do well to learn these concepts and understand why so many projects that run at the heart of the biggest tech companies rely on functional style projects like Apache Spark.

Functional style allows us to separate the “how” of computing resource management from the “what” of a program. It frees us from burdensome and complex multithreading APIs bolted on to languages that are based on a model of a simple computer conceived of in the 1940s.

The functional model is uniquely well-adapted to the data-rich world that we’re entering. It’s an indispensable tool for any software engineer working today.

The Surprisingly Simple Solution for Streaming Data

Previously, I’ve covered why functional programming provides the conceptual foundation for the big-data problems at the heart of software engineering today.

Now, let’s take a look at how these functional concepts have been applied to building a type of big-data data structure called a stream.

What Do We Mean by Stream?

What is a stream, exactly? It’s an ordered sequence of structured events in your data.

These could be actual events, like mouse clicks or page views, or they could be something more abstract, like customer orders, bank transactions, or sensor readings. Typically, though, an event is not a fully rendered view of a large data model but rather something small and measurable that changes over time.

Each event is a point of data, and we expect to get hundreds — or even millions — of these events per second. All of these events taken together in sequence form our stream.

How can we store this kind of data? We could write it to a database table, but if we’re doing millions of row insertions every second, our database will quickly fall over.

So traditional relational databases are out.

Enter the Message Broker

To handle streaming data, we use a special piece of data infrastructure called a message broker.

Message brokers are uniquely adapted to the challenges of event streams. They provide no indexing on data and are designed for quick insertions. On the other end, we can quickly pick up the latest event, look at it, and move on to the next one.

The two sides of this system — inserts on the one end and reads on the other — are referred to as the producer and the consumer, respectively.

We’re going to produce data into our stream — and then consume data out of it. You might also recognize this design from its elemental data structure, the queue.

An In-Memory Buffer

So now we know how we’re going to insert and read data. But how is it going to be stored in-between?

One option is to keep it all in memory. An insert would add a new event to an internal queue in memory. A consumer reading data would remove the event from memory. We could then keep a set of pointers to the front and end of the queue.

But memory is expensive, and we don’t always have a lot of it. What happens when we run out of memory? Our message broker will have to go offline, flush its memory to disk, or otherwise interrupt its operation.

An On-Disk Buffer

Another option is to write data to the local disk. You might be accustomed to thinking of the disk as being slow. It certainly can be. But disk access today with modern SSDs or a virtualized disk — like Amazon’s EBS (Elastic Block Store) — is fast enough for our purposes.

Now that our data can scale with the size of the SSD, we can slap on our server. Or even better, if we’re in a cloud provider, we can add a virtualized disk to scale as much as we need.

Aging Out of Data

But wait a minute. We’re going to be shoveling millions of events into our message broker. Aren’t we going to run out of disk space rather quickly?

That’s why we have a time to live (TTL) for the data. Our data will age out of storage. This setting is usually configurable. Let’s say we set it to one hour. Events in our stream will then only be stored for one hour, and after that, they’re gone forever.

Another way of looking at it is to think of the stream on disk as a circular buffer. The message broker only buffers the last hour of data, which means that the consumer of this data has to be at most one hour behind.

Introducing Your New Friend, Apache Kafka

In fact, the system I’ve just described is exactly how Apache Kafka works. Kafka is one of the more popular big-data solutions and the best open-source system for streaming available today.

Here’s how we create a producer and write data to it in Kafka using Java.

Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");KafkaProducer kafkaProducer = new KafkaProducer(properties);
kafkaProducer.send(new ProducerRecord("mytopic", 0, "test message")); // Push a message to topic "mytopic"

Now on the other side we have our consumer, which is going to read that message.

Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("", "mygroup");
KafkaConsumer kafkaConsumer = new KafkaConsumer(properties);List topics = new ArrayList();
kafkaConsumer.subscribe(topics); // Subscribe to "mytopic"ConsumerRecords records = kafkaConsumer.poll(1); // Get back the next record, blocking until it's available

There are some details here that are specific to Kafka’s jargon. First of all, we have to point our consumer/producer code to the Kafka broker and configure how we want it to transfer data back and forth out of the topic. Then, we have to tell it what sort of data to fetch by specifying a topic.

topic is essentially the name of the stream we’re reading and writing from. When we produce data to Kafka, we have to specify which topic we’re writing to. Likewise, when we create a consumer, we must subscribe to at least one topic.

Notice we don’t have any commands in this API to modify data. All we can do is push a ProducerRecord and get the data back as ConsumerRecords.

That’s all well and good, but what happens in-between?

It’s All About the Log

Kafka’s basic data structure is the log.

You’re familiar with logs, right? If you want to know what’s happening on a server, you look at the system log. You don’t query a log — you just read it from beginning to end.

And on servers with a lot of log data, the data is often rotated so older logs are discarded, leaving only the recent events you’re most likely interested in.

It’s the same thing with Kafka data. Our stream in Kafka is stored in rotating log files. (Actually, a single topic will be split among a bunch of log files, which depending on how we’ve partitioned the topic.)

So how does our consumer of the data know where it left off?

It simply saves an offset value that represents its place in the stream. Think of this as a bookmark. The offset lets the consumer recover if it shuts down and has to resume reading where it left off.

Now We Have Live Data

Now that we have our data in a live stream, we can perform analysis on it in realtime. The details of what we do next will have to be left for another article.

Suffice to say, once we have the data in a stream, we can now start using a stream processor to transform the data, aggregate it, and even query it.

It’s Functional

As we’ll see is the case in many big-data systems, Kafka uses the functional model of computation in its design.

Note that the data in Kafka is immutable. We never go into Kafka data and modify it. All we can do is insert and read the data. And even then, our reading is limited to sequential access.

Sequential access is cheap because Kafka stores the data together on disk. So it’s able to provide efficient access of blocks of data, even with millions of events being inserted every second.

But wait a minute. If the data is immutable, then how do we update a value in Kafka?

Quite simply, we make another insertion. Kafka has the concept of a key for a message. If we push the same key twice and if we enable a setting called log compaction, then Kafka will ensure older values for the same key are deleted. This is all done automagically by Kafka — we never manually set a value, just push the updated value. We can even push a null value to delete a record.

By avoiding mutable data structures, Kafka allows our streaming system to scale to ridiculous heights.

Why Not Data Warehousing?

On first glance, a stream might seem like a clumsy solution. Couldn’t we just design a database that can handle our write volume — something like a data warehouse — and then query it later when processing?

In some cases, yes, data warehousing is good enough. And for certain types of uses, it might even be preferable. If we know we don’t need our data to be live, then it might be more expensive to maintain the infrastructure for streaming.

The turnaround time for processing our data out of a data warehouse will be slow, delayed by hours or perhaps days, but maybe we’re happy with a daily report. Let’s call these solutions batch processing systems.

Batch processing is more commonly found in extract-transform-load (ETL) systems and in business-intelligence departments.

The Limitations of Batch Processing

There are lots of cases where batch processing isn’t good enough.

Consider the case of a bank processing transactions. We have a stream of transactions coming in, showing us how our customers are using their credit cards in real time.

Now let’s say we want to detect fraudulent transactions in the data. Could we do this with a data-warehousing system?

Probably not. If our query takes hours to run, we won’t know if a transaction was fraudulent until it’s too late.

Where Can I Use This?

Streaming isn’t the solution for every problem. For certain types of problems that are becoming increasingly relevant in the modern world, these are critical concepts, but not everyone is building a realtime system.

But Kafka is useful beyond these niche cases.

Message brokers are crucially important in scaling any system. A common pattern with a service-oriented architecture is to allow services to talk to one another via Kafka, as opposed to HTTP calls. Reading from Kafka is inherently asynchronous, which makes it perfect for a generic messaging tier.

Look into using Kafka or a similar message broker (such as Amazon’s Kinesis) for streaming your data any time you have a large write volume that can be processed asynchronously.

A messaging tier might seem like overkill if you’re a small company, but if you have any intention of growing, it’ll pay dividends to get this solution in place before the growing pains start to hurt.


As we’ve seen, functional-programming concepts have made their way into infrastructure components in the world of big data. Kafka is a prime example of a project that uses a very basic immutable data structure — the log — to great effect.

But these components aren’t just niche systems used in cutting edge machine-learning companies. They’re basic tools that can provide scaling advantages for everyone, from large enterprises to quickly growing startups.

The Limitations Of Automated Testing And What To Do About It

When you’re developing a software system, you know you should write automated tests. That’s a given.

Why should we write tests? Some common answers given:

  • Tests will catch bugs.
  • Tests improve the quality of the code.
  • Our senior architect said we should.

In this article, I’d like to shift the perspective a bit. I’ll look at automated testing from a pragmatic angle, analyzing the different types of tests we might write, and also highlight some of the limitations inherent in automated testing.

Why Test At All?

Why do we bother to test our system at all? The motivation is quite simple.

We want to identify and locate defects in our system.

This statement sounds glib, but it’s actually crucial. We assume there are defects in our system. We do not write tests to reassure ourselves that we are great programmers and to pat ourselves on the back for being so wonderful. We write tests to find the bugs we know are hiding in our system, despite our best efforts.

Defects can be syntax errors, or they can be incorrect implementations of a vendored service or a database client, or brain twister mistakes in multi-threaded programs. The list goes on and on.

In any case, a defect is a case where the program doesn’t do what the programmer intended.

Naturally, we want to identify these cases in our system. How can we best go about that?

Code Quality As A Metric

Organizations often lean toward metrics like code coverage as a way of proactively encouraging testing and maintaining a certain standard across teams.

Code coverage tools work by parsing code into decision trees, then running through it alongside the test code and generating a numerical rating based on how much of the decision tree was visited during the test.

We end up with a number, like 78.98%. That looks nice to put in reports and on slides.

But what does it mean? How do we know if we have “enough” code coverage? Is 80% too low? What about 90%? Does it depend on the project?

And what about the quality of the tests? Is it possible that we’re visiting every logical branch in the code but checking something trivial or unimportant at each step of the way?

It’s a well-meaning effort, but I don’t like code coverage metrics. I think it prevents us from seeing the forest for the trees.

Tests Are Just More Code

Automated tests are just code. And like all code, they can contain bugs and mistakes. Buggy tests may conceal real bugs in the code we are testing.

For example, say we test a simple algorithm. When we call it with a certain value x=10, it should return 100. But it actually returns 99.

But when we wrote the test, we were confused. So we tested that the algorithm should return 99 when we call it with x=10.

Congratulations! We now have 100% test coverage of this module.

But our test is wrong. Not only is the test wrong, it is hiding the bug, which is the opposite of what we want it to do.

Writing Tests Requires Context

I do not recommend having QA or test engineers write tests instead of developers. I have seen this practiced in some organizations and it never works out.

Only the developer who wrote the feature knows the nuances and potential snares in the system. They alone are equipped to test it properly.

Furthermore, separating the task of writing feature code from writing test code results in test engineers having to rewrite code produced by developers, or else accept that certain modules are untestable. This is because test code often needs to stub a function return value, inject a class, or do other runtime substitutions which require the feature code to be properly structured.

We wouldn’t want to have the person building the space shuttle to have no hand in testing its operation, would we? And we wouldn’t want to have the testers go into the engine room and rewire the oxygen to test it. So why do this with our software systems?

Time Investment

Writing tests takes time.

In some projects I’ve worked on, I spent up to 50% of my time writing tests. This ratio is not evenly distributed. Sometimes a single line of code resulted in the addition of 20–30 lines of test code. On the other hand, sometimes a block of code resulted in only one line of test code.

Time is a zero sum game. Every hour that we spend writing test code is an hour we could have been writing product feature code.

And vice versa. If we slow down our progress on churning out new features, we have more time for writing tests.

Whether to write tests or not is always a trade-off. Tests are competing for a limited resource — our time.

So we have to make sure we’re testing the right thing.

A Thought Experiment

If you’ll humor me, I want you to try the following thought experiment.

Imagine, for a minute, that there is no such thing as an automated test. Somehow, in the history of computing, no one has ever thought to write programs to check other programs. Your only choice when testing your system is to do it by hand.

On the plus side, you have six weeks to test every aspect of the project. In this fictional world, all programmers are also manual testers. It’s part of your job description. It’s not optional. You must do it before the project launches, fixing bugs along the way.

My question to you now is simple.

What do you test?

Do you try to test every line of code to ensure that it operates exactly the way you think it does? Do you suspiciously check that database client to make sure it really returns what it says it returns? Do you test everymethod? Every class? What about the ones that don’t really do anything except call something else?

Do you check functional features of the system? For example, if you’re building a system with an API, you might want to load a bunch of test data into the database and then make a series of API requests and make sure their responses match up with what you expect. Is that good enough? Will you know where the bug is if your test fails, or will it take you hours of debugging?

Are there any areas of your system that you suspect are hiding bugs? Very often these are the most complex parts of your system because defects hide in complexity. Intuitively, we all know where our bugs are probably going to come from.

Most products have some algorithmic component somewhere. Do you spend a lot of your time testing that this algorithm does what you think it does? Do you try giving it unexpected input just to see what happens?

Whatever you choose to invest most of your time in testing in this scenario, that is where you should invest your time when writing automated tests.

Automated tests are just programs that test what you would otherwise have to test by hand.

Not All Tests Are Created Equal

The fact that we write tests doesn’t necessarily mean we are writing the best tests.

Consider the following:

  1. A unit test that checks that an input to complex algorithmic code returns the correct value.
  2. A unit test that checks that a method in a class calls a method in another class.

Assuming both unit tests take the same time to write, and similar effort to maintain, which one is more valuable?

Hopefully you agree that the first is more valuable. We get more “bang for our buck” with tests that target complex code, where we are more likely to have bugs. The second test is fine, and it may be valuable, but it’s providing relatively little value compared to the first one.

We must be careful not to invest too much time in writing tests that provide little value.

The Fabled End-to-End Test

If our system is simple, we probably can get by entirely on simple tests. But when systems grow, they sprout extra limbs. Suddenly our simple web application with a RDBMS attached actually needs ten different components to operate. That adds complexity.

And remember, defects hide in complexity.

In every company I’ve worked at, there came a time when there was a complicated bug across multiple systems that was so terrible, and so devastating to our confidence in our system, that it convinced us to drop everything and look for a way to plug this hole in our testing strategy.

We had unit tests. We had functional service-level tests. But we hadn’t tested the interactions between all these systems together.

So we began to quest for the fabled “end-to-end” test.

And much like the lost city of El Dorado, we never quite found it.

End-to-end tests look something like this:

  • Push a message into queue.
  • Wait for Service A to pick up the message and insert it into the database.
  • Check that message is in the database.
  • Wait for Service B to notice the row in the database and push a message to another queue.
  • Check that the other queue contains a message.

And so on.

On the surface this seems like a perfectly valid way to test our complex system. Indeed, it comes closer to describing how the system actually works in production.

But I have never seen this style of test last longer than a few months without succumbing to the same fate.

The Fate Of Too-Big Tests

Inevitably, end-to-end tests fall into disrepair and neglect. This is primarily because they yield false-negatives — i.e., failing when there is not really a bug found — so often that their results are frequently ignored. This problem is even worse if they block the operation of a deploy pipeline.

False-negative tests become “the boy who cried wolf.” When we see them failing, we don’t take them seriously.

Thus the test code rots. Failing tests will be marked as ignored or commented out. Someone will eventually propose we take them out of the deploy pipeline and run the end-to-end tests only once in a while, basically admitting that they aren’t important.

What Went Wrong?

The problem with end-to-end tests is that they’re huge.

They cannot be easily booted up and run without some significant infrastructure. That might mean a special QA environment just for the tests. It might mean provisioning some cloud resource (like a SQS queue). It might mean reworking some part of the system to be more amenable to testing.

All of this requires more code, more work, and someone to maintain it going forward. Developers are typically more focused on completing features and not on maintaining test infrastructure, so tests with complex resource requirements are rarely a priority.

And why are these tests so unreliable anyway?

End-to-end tests often contain a lot of waiting for asynchronous tasks to complete — for example, checking if a row has been updated in a database. The only solution for this in most cases is polling with a timeout. Given that we are going to tear down and boot up the whole test environment on every run, this is basically begging for tests to fail for operational reasons, not because of bugs. That’s how we get our false-negatives.

Too-Big Tests Aren’t Useful Even When They Work

But it gets worse.

Even if the end-to-end test worked perfectly, it’s still not that valuable. What does it mean if a test that does everything fails?

If the “do everything” test fails, developers have to spend significant time debugging to identify where in the test the defect is hiding. Remember, the whole point of testing is to identify and locate defects. If our tests don’t do that, they’re not good tests.

Whether you call your test a unit test, a functional test, or a service test, tests are always more valuable when they are smaller. It is better to have several small tests than to have one large test, assuming that tradeoff is possible.

Smaller tests are better because when they fail they tell the developer where the bug is. They say, “hey, there’s a bug in the Foo service’s run endpoint!”

A big, unwieldy test mumbles under its breath, “hey, uh, I think there’s a bug somewhere. But I don’t know where.”

A Better Large Test

Tests that span service boundaries can be great. But we have to narrow in on what they are testing and what they are not testing.

For example, service level tests are great for checking a schema. If we have a contract in Service A that an endpoint should return schema X, then we can have another test for Service B that assumes this schema (for example, using a stubbed function call). Do we really need to test that the two services can talk to each other over HTTP? Probably not. HTTP is reliable enough.

By taking the operational components out of the test, we can get the valueof a cross-service boundary test without all that complexity.

Now when the tests fail, we know what they’re trying to communicate to us!

The Ideal Test

Taking all of this together, we can say a few things generally about “good” tests.

  1. They don’t report false-negatives. When they fail, they identify the presence of a defect.
  2. They are small and bounded. When they fail, they tell the developers where the defect is hiding.
  3. They are testing some code that we wrote, not third-party code.
  4. They are things we would prioritize testing by hand if we had the time.
  5. They are targeting complex parts of our system where bugs are likely to hide.

Note that all of these points help us to target our goal of identifying and locating defects in the system.

Of course, terms like “small” are entirely subjective, and there are cases where bigger or smaller tests are appropriate. You may adjust the definition of “small” according to circumstances, and the principles still hold.

A Common Theme

A common theme tying together all these observations from my experiences is that testing is often approached with an overzealous mindset.

We have grand visions of a testing approach that tests everything that could go wrong in the entire system. Every line of code is “covered” by our testing framework. Our end-to-end test spans the entire system because defects might occur anywhere.

This shows that we are focusing on the idea of testing rather than the practical outcome that we want. Remember, we want to identify and locate defects.

The Limitations Of Testing

I’m not saying we shouldn’t write tests. We should. We must. We shall.

But tests are just one tool in the arsenal of software engineers to manage the quality of our systems. And they are a fallible and limited tool.

Automated tests carry a significant maintenance cost. Someone has to keep our testing infrastructure up and running. Tests are just code, and we will have to change them as our system evolves. They cost real hours of effort that could be devoted to project work. And test code itself will inevitably contain bugs that may conceal defects in our working system.

Tests also must be limited in their scope of they lose their usefulness in locating defects. So we can’t simply write giant system-spanning tests in incredible detail and expect that our developers will find this useful.

So what can we do? Is it possible that there is something better than testing that we should be doing?

As it so happens, yes. Yes, there is.

Code Review: Better Than Testing

Nothing is better than code review for finding issues in software systems. A code reviewer is not a program. They are a real human being, possessing intelligence and context for the whole project, who will look for problems we never even considered.

A good code reviewer will find complex multi-threading issues. They will find slow database queries. They will find design mistakes. They will even identify bad automated tests that are disguising defects.

Even better, they will suggest improvements in the code that go beyond finding defects!

So why is there such emphasis on testing, with innumerable frameworks and tools released, and so little on code review?

Perhaps it is because code review is a nebulous process that depends in large part on the efforts of individual contributors. And we in the software industry have a nagging tendency to downplay the importance of human intelligence and ingenuity in development of our systems.


As I will explore further in future articles, I believe that the software industry often suffers from overemphasizing technological solutions over solutions that rely on human intelligence.

Technological solutions are great for amplifying human intelligence and ingenuity. Automated tests, for example, save us the trouble of manually testing our entire product. Thank goodness we don’t live in that hypothetical world without them!

But we need those intelligent, thoughtful, and conscientious engineers at our side to look for problems in our work and to gently challenge us to perfect our designs. Indeed, this is often the easiest way to catch serious defects in our systems.

Automated testing should always be a complement to our collaboration with colleagues, not a replacement for it.

Why you should build your new startup on Kubernetes

The last interview cycle I did back in early 2019, I spoke to a handful of startups. I always ask about the deployment pipeline for startups because it helps me have an idea of what stage of technical complexity the company is at. Some businesses can go really far on a simple PHP web app deployed with scp. Others hit limits and have to rework the system into several services with infrastructure components like Redis or Kafka used to communicate between them.

When they see Kubernetes on my resume, interviewers often ask ed about it. There was a lot of interest in dipping toes into Kubernetes, but also some anxiety about whether it was appropriate for the particular use case. How did we use it at my last company? Was it difficult to learn? What were the development teams’ experience of working with it? Sometimes there are horror stories of bad implementations and fears that moving to Kubernetes was a mistake. Often I heard some very reasonable skepticism and a desire to keep deployments simple, and a hesitation to jump on the bandwagon.

So I will jump to the punchline here. If I were starting my own startup from scratch today, I would very likely start with Kubernetes. I’ve used it in two very different companies now and this is my (admittedly subjective) conclusion.

Quite simply, the few negatives are so greatly outweighed by the positives that I think it’s worth the investment for many startups. Not all startups. Not necessarily your startup. But a lot of them. Let’s take a look at the reasons why.

What is Kubernetes?

In short, Kubernetes is an open source container orchestration system, originally developed by Google. It has been contributed back to the community, with lots of new libraries and plugins (called “operators”) contributed by third parties.

Kubernetes is not a cloud platform like AWS (Amazon Web Services) or GCP (Google Cloud Platorm). In fact, you could run and deploy Kubernetes on your own hardware in your own data center, if you were so inclined, though I don’t recommend this for startups.

Think of it more like a language that we can use to describe a working system. Once we describe the system in enough detail, Kubernetes can then go and use its compute resources (“nodes” in Kubernetes parlance, also known as, you know, computers) to run containers that execute our system. 

The big benefit for startups is that this process of “describing the working system” serves as documentation and a centralized location in code for definining infrastructure.

Kubernetes pays for itself

I’m not going to lie. EKS (the managed Kubernetes solution provided by Amazon) is expensive. It will cost you an overhead of $0.20, or $1,753.20 a year, on top of your EC2 costs. It’s not free.

But consider what you would pay for have an engineer manually bring up nodes. The amount of time lost to these purely infrastructure changes is simply taking time away from developing your product. If you’re a startup trying to just hit your next goal, you should be happy to pay a (reasonable) overhead to magically erase an error-prone and time-consuming process from your team.

With the Terraform tool at your hand, you can also create a Kubernetes cluster that can be scaled with a simple one line change. In my last team, our cluster grew from 2 to 4 nodes with a Git commit that changed a 2 to a 4. It was literally a one line change. After the nodes were added, Kubernetes automagically moved resources onto the new nodes and no further work was required. Then you can move on to solving real problems.

Deployments are easy

A traditional Linux production system typically looks like this. You have some code written in Java, Python, or Ruby. The application code is often written by people that don’t know servers very well, or at least aren’t practiced in them. You have a machine, let’s say in Amazon EC2, which is managed by someone in your ops team, who doesn’t know the application code very well. When the application team completes some work, they want to be able to deploy those changes. The ops team wants to ensure those changes don’t break anything.

You also don’t want the system to go offline during a deploy. You want to be able to “roll back” to a previous version of the code if something goes wrong. And what if your deploy process, from uploading assets to starting the server, takes 30 minutes? Will you take your system offline for 30 minutes?

Probably not. You’ll likely come up with some system for keeping version n-1 running until version n starts up, at which point you’ll switch to point to the new version.

But boy does that seem complicated. It’s a lot to remember, and a lot that can go wrong. Those deploy rules will be written in a series of scripts that need to be versioned and maintained, and could very well contain bugs themselves. And when we’re expanding the company in seperate teams, all of them trying to deploy multiple times a day, it starts to feel scary. Ops team members start to get overwhelmed with the amount of churn in the system. Deploys start to take longer and longer, as the process becomes laden with more and more complexity.

Does this story sound familiar?

Kubernetes does away with much of that complexity. To deploy a new version of a service, we can simply update the container image to point to the new version of code. We can also define a health check that will be performed before declaring that the new version is working. If it doesn’t pass, the old version of code keeps running.

We can define a service using an internal-only DNS name, like order_service, which will automatically load-balance to running replicas. Nobody has to maintain a list of running instances.

And if we find a problem after the deploy, a simple “roll back” command looks up the previous container image and applies it. Often this can take just a few seconds, and then we’re back to running the last known stable version of our software.

Doesn’t that sound nice?null

You don’t need a separate ops team that does everything

Kubernetes itself is a complex beast. But using it is achievable for any seasoned developer.

That’s because instead of using a complex series of bash scripts, special deploy tools, and so on, Kubernetes deployments are managed with simple declarative YAML files.

That’s right. The simple XML replacement championed by Ruby enthusiasts is all you need to know to work with Kubernetes.

Using nothing but YAML, we can define a whole working system with auto-scaling, replication, and service resolution. Then using the kubectl CLI tool, we can ask the cluster to run our configuration. We never directly tell Kubernetes to do anything. Rather, it reads our declarative YAML and interprets what needs to be done.

Do you think your developers can figure out how to write YAML? I do!

I’ve worked on some complex systems that required the person managing the deploy to understand a) Python, b) bash, c) some minor intricacies of the OS version we were running, d) JVM flags (God help you), e) scp commands (can you write a valid scp command without looking at docs?)… and so on.

There is also an organizational overhead. Often the deploy scripts and infrastructure code is managed by the ops team. But developers often needed to make changes in the deploy code – for example, to set a flag on startup – and to scale up the system. That creates a tension between developers and ops, since the two groups create demands on one another, but are often beholden to different objectives.

All that complexity and overhead adds a tax to everything you do in your startup. If you want to develop new features quickly and have the ability to easily jump from project to project, then you really want to keep that friction as low as possible. Kubernetes abstracts away a lot of the pain, leaving you to focus on the product.

Situations where you probably don’t need Kubernetes

Of course, there is no silver bullet, and there are cases where something like Kubernetes is overkill.null

Simple WordPress sites, CMSes, etc.

If you’re just running WordPress, you don’t need Kubernetes. If you’re running a CMS that never really gets deployed except once in a while to upgrade libraries or install a plugin, you don’t need Kubernetes. It’s really optimized for managing large, changing systems.null

Embedded systems, anything needed access to a real OS

Obviously, if you’re writing low-level embedded systems or software that needs to interface with the Linux kernel, Kubernetes is not for you. That goes for any containerization solution.

Your product is primarily a database

Kubernetes does have a resource type called a “Stateful Set” intended for running things like databases and message brokers that manage state. In theory, running a Stateful Set could allow you to run multiple replicas and scale them up and down, and attach and grow storage.

But doing so always makes me a little nervous. With an application service, I want to make it easy for developers to tweak settings and deploy without trouble. With databases, I want just the opposite. It should be hard to accidentally change a setting or upgrade the system to a new version. I also don’t want my database competing for CPU and memory within the cluster.

I’m especially prone to not use Kubernetes for databases if I’m using AWS and have access to RDS. RDS or its equivalent in your cloud provider of choice is going to be a lot easier for managing automatic backups, scaling, and monitoring.


Kubernetes is perfect for any project that needs to scale and grow over time.

If you’re a startup, you almost certainly fall into that category. You might be small right now, but you want to grow. It’s what you tell your investors and it’s the reason you’re hiring so many developers. Your system is going to change and expand quickly, and you want to build it in a way that allows this with the least amount of added cost and friction possible.

For that reason alone, I think it makes a lot of sense for any ecommerce, SaaS, or similar company to invest in Kubernetes early on. Even if you’re just deploying a single simple web application within the cluster, planning for the future means building your infrastructure carefully to enable your team to move quickly a year or three down the line.

Thanks for reading and best of luck!

Want a decentralized web? Start a blog

There is a lot of talk about the decentralized web now. I’m very interested to see where IPFS takes us, and I’m even working on my own little IPFS backed project in my spare time.

So what is decentralization? It’s a fancy way of talking about peer-to-peer networks. P2P (a common shortening of “peer-to-peer”) represents the potential future of a distributed, decentralized internet.

But to understand the difference between that and the “normal” internet, you have to understand a bit about networking. When I load up Facebook’s homepage, I am essentially plugging into Facebook’s servers via my browser software. These are thousands upon thousands of computer wired together in a data center somewhere. When you enter your profile information, or post a snarky quip, it gets stored on Facebook’s servers. And Facebook owns that information. They can harvest your data however they want, with some limits in Europe thanks to GDPR and the soon-to-come CCPA in California. Unless you’re in one of those jurisdiction, your data is owned by Facebook. And Facebook can profit from it however they choose.

If you were to upload the same data to IPFS, the situation would be different. IPFS stores data across multiple computers around the world called peers. Unfortunately, to access something like a whole webpage in such a system can be slow, because we have to stick together data from across the network. It’s not all within a single data center controlled by one company. IPFS also uses something called a block chain to keep tabs on that data, making it tricky to “search” for your data unless you know its precise location in this vast network.

If you are at all politically minded, you can see why the decentralized model is getting so much traction in certain circles. It offers an alternative to the world that emerged in the last 2000s where a handful of private entities control vast swaths of the world’s data.

Eventually, so the theory goes, people would come up with a decentralized version of sites like Facebook. It would be powered by something like IPFS. In fact, Diaspora (which predates IPFS by quite a lot) and Mastodon have achieved exactly this.

That said, technology like IPFS has a few glaring practical challenges. For one thing, storing all of that data will take up a ton of space and processing power. Ordinary citizens would have to decide to host an IPFS instance (called a node), pay the electric bill, keep a high speed internet connection running to it, and so on. Some probably would do so out of passion for the project. But it’s almost inevitable that eventually IPFS nodes will be largely running on Amazon or Google hosted cloud servers. Perhaps even one of these big companies would offer “free” IPFS hosting in exchange for an opportunity to packet-sniff everything coming in and out. So the big companies might end up effectively re-centralizing this system in practice if not in principle. Sure, the block chain* aspect prevents tampering with the actual data, but it does kind of beg the question of why we are running this complex peer-to-peer system if it tends to become centralized anyway.

So what is a boy to do? I am a fan of simple solutions. And IPFS, though intriguing, is far from simple. A simple solution for owning your own information online?

Host a blog.

At this blog, I have an opportunity to present myself to the world exactly as I choose. I run WordPress and manage it myself. WordPress isn’t trying to monetize my content or manipulate my audience with algorithms to increase conversion rate on ads. I can post my own profile online here with as much or as little information about me as I want to share with the world.

If you want to see the internet do better than Facebook and Twitter, give blogging a try! All you need is an AWS account, a little bit of technical know-how, and something you want to say.

*I hope to write a more thorough article about blockchain at a later date. It’s a complicated enough idea to warrant its own article.

Functional programming jargon decoded

The last year I’ve learned Scala to the point of literacy. In the course of doing so, I also acquired some hands-on understanding of functional programming patterns. In this post I’m going to attempt to define some common terms that get thrown around by Scala people with the hope of providing a sort of glossary.

Type classes

If you’ve worked with a language with generics support like Java, you’re surely familiar with common generic types like List. List itself is not a type. We can’t actually refer to a List in Java. Rather, we have to a refer to concretization like List<Integer> or List<String>. We can think of List itself as a type which, given some parameter (which is a type itself), itself produces a type.

In Scala this is called a type class or a type constructor. Back to our Java example, try to think of List as a special function that knows how to build a type given another type.


The term “monoid” sounds complicated, but without even getting into the abstract algebra behind it, we can define it in simple practical terms.

Simply put, a monoid is a type that can be combined. A nicer name might be addable. To define a monoid, we need two things:

– The definition of “zero” for this type. (E.g., the number 0, an empty string, an empty list.)
– An operator that defines how to combine two instances of the type.

“Combining” in this sense can mean anything – adding, subtracting, concatenating strings – as long as it is associative.


Another complicated sounding name for something powerful and expressive, monads are the cornerstone of functional programming. The monad is a design pattern that allows purely functional programs to be written in an imperative style.

I like to think of monads as mappables. Whereas the monoid defines a thing that can be added, the monad defines a thing that can be transformed. Monads are useful for modeling side-effects in a purely functional way.

List is a simple example of a monad. Given a List<String> (or List[String] in Scala), we could define a function which, given a String, produces another String. This might do something like transform some text into its uppercase form. Or we could transform the String to an Integer, e.g. by finding the length of the string buffer.

Note that the transformation is defined in terms of the contents of the List, but the transformation operation occurs on the List itself. In other words, our transforming function is:

String => Integer

But the transformation is:

List[String] => List[Integer]

More generally, we write this as:

F[A] => F[B]


A => B

The transformation operation is called map in functional terminology, and F in the above example is our monad. Fancy name for a powerful design pattern that allows us to generically compose statically typed programs with no sucky side-effects.