The Limitations Of Automated Testing And What To Do About It

When you’re developing a software system, you know you should write automated tests. That’s a given.

Why should we write tests? Some common answers given:

  • Tests will catch bugs.
  • Tests improve the quality of the code.
  • Our senior architect said we should.

In this article, I’d like to shift the perspective a bit. I’ll look at automated testing from a pragmatic angle, analyzing the different types of tests we might write, and also highlight some of the limitations inherent in automated testing.

Why Test At All?

Why do we bother to test our system at all? The motivation is quite simple.

We want to identify and locate defects in our system.

This statement sounds glib, but it’s actually crucial. We assume there are defects in our system. We do not write tests to reassure ourselves that we are great programmers and to pat ourselves on the back for being so wonderful. We write tests to find the bugs we know are hiding in our system, despite our best efforts.

Defects can be syntax errors, or they can be incorrect implementations of a vendored service or a database client, or brain twister mistakes in multi-threaded programs. The list goes on and on.

In any case, a defect is a case where the program doesn’t do what the programmer intended.

Naturally, we want to identify these cases in our system. How can we best go about that?

Code Quality As A Metric

Organizations often lean toward metrics like code coverage as a way of proactively encouraging testing and maintaining a certain standard across teams.

Code coverage tools work by parsing code into decision trees, then running through it alongside the test code and generating a numerical rating based on how much of the decision tree was visited during the test.

We end up with a number, like 78.98%. That looks nice to put in reports and on slides.

But what does it mean? How do we know if we have “enough” code coverage? Is 80% too low? What about 90%? Does it depend on the project?

And what about the quality of the tests? Is it possible that we’re visiting every logical branch in the code but checking something trivial or unimportant at each step of the way?

It’s a well-meaning effort, but I don’t like code coverage metrics. I think it prevents us from seeing the forest for the trees.

Tests Are Just More Code

Automated tests are just code. And like all code, they can contain bugs and mistakes. Buggy tests may conceal real bugs in the code we are testing.

For example, say we test a simple algorithm. When we call it with a certain value x=10, it should return 100. But it actually returns 99.

But when we wrote the test, we were confused. So we tested that the algorithm should return 99 when we call it with x=10.

Congratulations! We now have 100% test coverage of this module.

But our test is wrong. Not only is the test wrong, it is hiding the bug, which is the opposite of what we want it to do.

Writing Tests Requires Context

I do not recommend having QA or test engineers write tests instead of developers. I have seen this practiced in some organizations and it never works out.

Only the developer who wrote the feature knows the nuances and potential snares in the system. They alone are equipped to test it properly.

Furthermore, separating the task of writing feature code from writing test code results in test engineers having to rewrite code produced by developers, or else accept that certain modules are untestable. This is because test code often needs to stub a function return value, inject a class, or do other runtime substitutions which require the feature code to be properly structured.

We wouldn’t want to have the person building the space shuttle to have no hand in testing its operation, would we? And we wouldn’t want to have the testers go into the engine room and rewire the oxygen to test it. So why do this with our software systems?

Time Investment

Writing tests takes time.

In some projects I’ve worked on, I spent up to 50% of my time writing tests. This ratio is not evenly distributed. Sometimes a single line of code resulted in the addition of 20–30 lines of test code. On the other hand, sometimes a block of code resulted in only one line of test code.

Time is a zero sum game. Every hour that we spend writing test code is an hour we could have been writing product feature code.

And vice versa. If we slow down our progress on churning out new features, we have more time for writing tests.

Whether to write tests or not is always a trade-off. Tests are competing for a limited resource — our time.

So we have to make sure we’re testing the right thing.

A Thought Experiment

If you’ll humor me, I want you to try the following thought experiment.

Imagine, for a minute, that there is no such thing as an automated test. Somehow, in the history of computing, no one has ever thought to write programs to check other programs. Your only choice when testing your system is to do it by hand.

On the plus side, you have six weeks to test every aspect of the project. In this fictional world, all programmers are also manual testers. It’s part of your job description. It’s not optional. You must do it before the project launches, fixing bugs along the way.

My question to you now is simple.

What do you test?

Do you try to test every line of code to ensure that it operates exactly the way you think it does? Do you suspiciously check that database client to make sure it really returns what it says it returns? Do you test everymethod? Every class? What about the ones that don’t really do anything except call something else?

Do you check functional features of the system? For example, if you’re building a system with an API, you might want to load a bunch of test data into the database and then make a series of API requests and make sure their responses match up with what you expect. Is that good enough? Will you know where the bug is if your test fails, or will it take you hours of debugging?

Are there any areas of your system that you suspect are hiding bugs? Very often these are the most complex parts of your system because defects hide in complexity. Intuitively, we all know where our bugs are probably going to come from.

Most products have some algorithmic component somewhere. Do you spend a lot of your time testing that this algorithm does what you think it does? Do you try giving it unexpected input just to see what happens?

Whatever you choose to invest most of your time in testing in this scenario, that is where you should invest your time when writing automated tests.

Automated tests are just programs that test what you would otherwise have to test by hand.

Not All Tests Are Created Equal

The fact that we write tests doesn’t necessarily mean we are writing the best tests.

Consider the following:

  1. A unit test that checks that an input to complex algorithmic code returns the correct value.
  2. A unit test that checks that a method in a class calls a method in another class.

Assuming both unit tests take the same time to write, and similar effort to maintain, which one is more valuable?

Hopefully you agree that the first is more valuable. We get more “bang for our buck” with tests that target complex code, where we are more likely to have bugs. The second test is fine, and it may be valuable, but it’s providing relatively little value compared to the first one.

We must be careful not to invest too much time in writing tests that provide little value.

The Fabled End-to-End Test

If our system is simple, we probably can get by entirely on simple tests. But when systems grow, they sprout extra limbs. Suddenly our simple web application with a RDBMS attached actually needs ten different components to operate. That adds complexity.

And remember, defects hide in complexity.

In every company I’ve worked at, there came a time when there was a complicated bug across multiple systems that was so terrible, and so devastating to our confidence in our system, that it convinced us to drop everything and look for a way to plug this hole in our testing strategy.

We had unit tests. We had functional service-level tests. But we hadn’t tested the interactions between all these systems together.

So we began to quest for the fabled “end-to-end” test.

And much like the lost city of El Dorado, we never quite found it.

End-to-end tests look something like this:

  • Push a message into queue.
  • Wait for Service A to pick up the message and insert it into the database.
  • Check that message is in the database.
  • Wait for Service B to notice the row in the database and push a message to another queue.
  • Check that the other queue contains a message.

And so on.

On the surface this seems like a perfectly valid way to test our complex system. Indeed, it comes closer to describing how the system actually works in production.

But I have never seen this style of test last longer than a few months without succumbing to the same fate.

The Fate Of Too-Big Tests

Inevitably, end-to-end tests fall into disrepair and neglect. This is primarily because they yield false-negatives — i.e., failing when there is not really a bug found — so often that their results are frequently ignored. This problem is even worse if they block the operation of a deploy pipeline.

False-negative tests become “the boy who cried wolf.” When we see them failing, we don’t take them seriously.

Thus the test code rots. Failing tests will be marked as ignored or commented out. Someone will eventually propose we take them out of the deploy pipeline and run the end-to-end tests only once in a while, basically admitting that they aren’t important.

What Went Wrong?

The problem with end-to-end tests is that they’re huge.

They cannot be easily booted up and run without some significant infrastructure. That might mean a special QA environment just for the tests. It might mean provisioning some cloud resource (like a SQS queue). It might mean reworking some part of the system to be more amenable to testing.

All of this requires more code, more work, and someone to maintain it going forward. Developers are typically more focused on completing features and not on maintaining test infrastructure, so tests with complex resource requirements are rarely a priority.

And why are these tests so unreliable anyway?

End-to-end tests often contain a lot of waiting for asynchronous tasks to complete — for example, checking if a row has been updated in a database. The only solution for this in most cases is polling with a timeout. Given that we are going to tear down and boot up the whole test environment on every run, this is basically begging for tests to fail for operational reasons, not because of bugs. That’s how we get our false-negatives.

Too-Big Tests Aren’t Useful Even When They Work

But it gets worse.

Even if the end-to-end test worked perfectly, it’s still not that valuable. What does it mean if a test that does everything fails?

If the “do everything” test fails, developers have to spend significant time debugging to identify where in the test the defect is hiding. Remember, the whole point of testing is to identify and locate defects. If our tests don’t do that, they’re not good tests.

Whether you call your test a unit test, a functional test, or a service test, tests are always more valuable when they are smaller. It is better to have several small tests than to have one large test, assuming that tradeoff is possible.

Smaller tests are better because when they fail they tell the developer where the bug is. They say, “hey, there’s a bug in the Foo service’s run endpoint!”

A big, unwieldy test mumbles under its breath, “hey, uh, I think there’s a bug somewhere. But I don’t know where.”

A Better Large Test

Tests that span service boundaries can be great. But we have to narrow in on what they are testing and what they are not testing.

For example, service level tests are great for checking a schema. If we have a contract in Service A that an endpoint should return schema X, then we can have another test for Service B that assumes this schema (for example, using a stubbed function call). Do we really need to test that the two services can talk to each other over HTTP? Probably not. HTTP is reliable enough.

By taking the operational components out of the test, we can get the valueof a cross-service boundary test without all that complexity.

Now when the tests fail, we know what they’re trying to communicate to us!

The Ideal Test

Taking all of this together, we can say a few things generally about “good” tests.

  1. They don’t report false-negatives. When they fail, they identify the presence of a defect.
  2. They are small and bounded. When they fail, they tell the developers where the defect is hiding.
  3. They are testing some code that we wrote, not third-party code.
  4. They are things we would prioritize testing by hand if we had the time.
  5. They are targeting complex parts of our system where bugs are likely to hide.

Note that all of these points help us to target our goal of identifying and locating defects in the system.

Of course, terms like “small” are entirely subjective, and there are cases where bigger or smaller tests are appropriate. You may adjust the definition of “small” according to circumstances, and the principles still hold.

A Common Theme

A common theme tying together all these observations from my experiences is that testing is often approached with an overzealous mindset.

We have grand visions of a testing approach that tests everything that could go wrong in the entire system. Every line of code is “covered” by our testing framework. Our end-to-end test spans the entire system because defects might occur anywhere.

This shows that we are focusing on the idea of testing rather than the practical outcome that we want. Remember, we want to identify and locate defects.

The Limitations Of Testing

I’m not saying we shouldn’t write tests. We should. We must. We shall.

But tests are just one tool in the arsenal of software engineers to manage the quality of our systems. And they are a fallible and limited tool.

Automated tests carry a significant maintenance cost. Someone has to keep our testing infrastructure up and running. Tests are just code, and we will have to change them as our system evolves. They cost real hours of effort that could be devoted to project work. And test code itself will inevitably contain bugs that may conceal defects in our working system.

Tests also must be limited in their scope of they lose their usefulness in locating defects. So we can’t simply write giant system-spanning tests in incredible detail and expect that our developers will find this useful.

So what can we do? Is it possible that there is something better than testing that we should be doing?

As it so happens, yes. Yes, there is.

Code Review: Better Than Testing

Nothing is better than code review for finding issues in software systems. A code reviewer is not a program. They are a real human being, possessing intelligence and context for the whole project, who will look for problems we never even considered.

A good code reviewer will find complex multi-threading issues. They will find slow database queries. They will find design mistakes. They will even identify bad automated tests that are disguising defects.

Even better, they will suggest improvements in the code that go beyond finding defects!

So why is there such emphasis on testing, with innumerable frameworks and tools released, and so little on code review?

Perhaps it is because code review is a nebulous process that depends in large part on the efforts of individual contributors. And we in the software industry have a nagging tendency to downplay the importance of human intelligence and ingenuity in development of our systems.

Conclusion

As I will explore further in future articles, I believe that the software industry often suffers from overemphasizing technological solutions over solutions that rely on human intelligence.

Technological solutions are great for amplifying human intelligence and ingenuity. Automated tests, for example, save us the trouble of manually testing our entire product. Thank goodness we don’t live in that hypothetical world without them!

But we need those intelligent, thoughtful, and conscientious engineers at our side to look for problems in our work and to gently challenge us to perfect our designs. Indeed, this is often the easiest way to catch serious defects in our systems.

Automated testing should always be a complement to our collaboration with colleagues, not a replacement for it.