Automated Test Coverage

red light

Automated testing appears to be essential for delivering high-quality, robust code, especially in an agile environment. It's pretty clear that high performing teams rely heavily on automated testing, and low quality teams don't.

However, it's not clear that testing is the cause of robust software. High quality teams would probably deliver good code in the absence of automated testing, and many teams write a lot of automated tests and don't produce good products.

So, what gives? What do we really know about testing, and how can that inform our testing strategies?

First, consider what automated testing doesn't do.

Automated tests don't...

Testing code coverage doesn't equate to execution path coverage. This point is completely overlooked by teams that pursue 100% code coverage. 100% code coverage only means that every line is executed at some point, and it says nothing about how many paths through the code have been tested.

Testing doesn't find errors in code paths that aren't tested. This is a corollary to the first observation. A different execution path may result in an execution state that results in an error for the same line of code that already has test coverage. Having 100% test coverage and passing all tests doesn't mean that no bugs exist.

Testing doesn't substitute for good technical design. Good design results in systems that are cleanly composed of distinct, single purpose components. Testing a well designed system is easier because single purpose components are easier to unit test. Testing a well composed subsystem is easier because clean composition is easier to reason about.

Testing doesn't substitute for good product design. If the product doesn't meet the customer's needs, whether the customer is internal or external doesn't matter, the automated tests are irrelevant. There is no additional value in claiming that software no one uses works just as it's supposed to.

Now, consider what automated testing does do.

Automated tests do...

Document expected behavior. When developers look at source code, it's not uncommon to have questions about what the code is intended to do. Having test code for reference goes a long way to answering those questions, and having a test to kick off a debugging session is super useful. Working tests are much better than out-of-date documentation.

Confirm that test operations are repeatable. This should be obvious, but developers will frequently perform manual checks on systems. The problem with manual checks is that they aren't reliably repeatable, humans err, or get sick, or forget. Manual skills are not transferable. There's really nothing as fun on a holiday as being on a support call where someone on vacation is trying to explain to someone on-call what commands to execute to test something.

Establish a snapshot of functionality. Once a set of passing tests exists, you have a snapshot of a current baseline of functionality. As changes are made, the developer can reliably validate that the baseline is met. This makes adding features and refactoring much more reliable and less time consuming.

Your tests should...

Given the prior observations about what testing does and doesn't accomplish, what goals should your testing have? Three goals are provided below.

Exercise key behaviors. Not all code in a repository is equal. Start by testing the most central functionality first and expand from there; don't simply list methods or functions to test and work down the list. When you feel that a test is busy-work as opposed to providing useful information, you shouldn't write it. If that means your test coverage is low, consider removing code before adding more tests.

Provide developer confidence. The meta point about testing is that it enhances developer productivity in two ways. It provides observational understanding of the code, and it prevents regressions as the code is changed. This additional productivity can then be reinvested in making the quality of the code, and the product better. If you have code coverage, but low confidence in code behavior, add more tests.  

Provide operational confidence. The end state is that systems perform reliably in production. Developers often partition testing, unit tests and integration tests, from monitoring. Monitoring is what the SRE's do. But monitoring is really just testing production. If the system is unreliable, everyone will be miserable and nothing else will matter. When allocating time to the team for testing, ensure that monitoring production behavior is given it's full due and that telemetry acquisition is built into the system.

For the quants…

There are those who mistakenly believe that if you don’t assign numbers that you can’t make progress. Then there is the more realistic issue of needing a number to prove you’re done so that you can get paid or tell a pointy-haired boss to go away for a while. In these cases, what numbers do you assign?

For code coverage, the numbers Google uses are pretty good with a caveat (and they offer plenty of their own). The Google numbers are 60% as “acceptable”, 75% as “commendable” and 90% as “exemplary.” The key caveat would be that none of the tests are the result of busy-work. Without that, people will pursue 90% to claim to be exemplary. And to reiterate, if you have 60% code coverage and genuinely feel that you are getting all you need from your tests, you should refactor out the 40% of code that’s just lying around, apparently without value.

For monitoring, you should have 24x7, you should initially pursue the top 10% to 30% of system interactions depending on your usage statistics. System usage tends to follow power law distributions (like the 80/20 Pareto law). If you have an API, it’s likely that 90% of the traffic comes from 10% of the endpoints, or even more skewed like 98% of traffic from 2% of the end points. For most organizations, ROI is only break-even for more monitoring work once 95% of system activity is monitored. At larger scales, marginal coverage is obviously more valuable. Either way you should calculate the likelihood of an outage and it’s expected duration to compute a cost of outage and balance that against the cost of additional monitoring work to generate a breakeven point for monitoring coverage.

For integrations, the perspective is relevant. If you are providing a service, your coverage is going to look like monitoring coverage. If you are consuming the service, your coverage will look like typical code coverage, except that you want every integration point to have some coverage, which will likely give you code coverage numbers a bit higher than your codebase coverage.

For UI’s, it’s a bigger challenge. UI’s are the perfect example of the disconnect between coverage and behavior. It’s often easy to achieve 100% coverage for a component but have visual breakage due to CSS break-points. UI components should behave correctly at a mobile, tablet, and desktop breakpoints for their most commonly used services, i.e. like a system service. Automated visual verification has definite limitations, but even with it’s limitations, it can get you the confidence that everything is working pretty well.

Additional Resources

Google: https://testing.googleblog.com/2020/08/code-coverage-best-practices.html

StackOverflow: https://stackoverflow.com/a/2609575/2673504

Microsoft (.NET): https://docs.microsoft.com/en-us/dotnet/core/testing/unit-testing-best-practices

DatadogHQ: https://www.datadoghq.com/blog/test-creation-best-practices/