Testing Habit - Right Wrong Answer

Your tests can fail in all sorts of ways. In all likelihood, they can fail in far more ways than they can pass. Our goal when writing our tests is to gather enough confidence that our production code does what we say it does. As much as confidence is eroded when tests don’t catch errors, it erodes too when tests fail for unexpected reasons.

Sometimes, these failures are completely out of our control. But other times, it’s more about the way we wrote the test or designed the system (or failed to design it) that causes surprise failures. That is, failures may come from an over-assertion (testing too much) or from indirect module coupling (architectural smell).

This is why it’s really important to “Read the error” when a test fails. Ensure the outcome is what you expected before proceeding, else you may “fix” the wrong thing. Further, “calling your shot” can help make sure a test is passing or failing correctly. When your tests are failing, make sure they’re failing with the right (correct) wrong answer.

Examples

Database ordering

This situation is so common, it warrants its own callout. If your product doesn’t care about item order from some interface or API, don’t assert on it. Nota bene: your product probably does care about it, so conversely, make sure ordering is defined.

Let’s say we’ve got an API endpoint that returns widgets.

class WidgetRepo {
    save(props: { name: string }) {
        // db insert logic
    }

    findAllMatchingName(name: string | undefined) {
        if (name) {
            return db.query("SELECT * FROM widgets WHERE name ILIKE ?", `%${name}%`);
        } else {
            return db.query("SELECT * FROM widgets");
        }
    }
}

app.get("/widgets", async (req, res) => {
    const widgets = await widgetRepo.findAllMatchingName(req.queryString.name);
    return res.json(widgets);
})

You may decide to write a test for this that asserts on the exact widgets returned, maybe by ID. We can start by inserting some widgets, making the API call, and asserting on the response.

test("returns widgets matching the name", async () => {
    const whizbang = await widgetRepo.save({ name: "whizbang" });
    const sizzlewhoop = await widgetRepo.save({ name: "sizzlewhoop" });
    const _benderfloss = await widgetRepo.save({ name: "benderfloss" });

    const resp = await(await fetch('/widgets?name=wh')).json()

    expect(resp.widgets.map(widget => widget.id)).toEqual([whizbang.id, sizzlewhoop.id]);
})

This test would likely pass many times. Probably enough times to make it into production. But, eventually, it will fail. Can you spot the error?

That’s it! The queries aren’t specifying an order. Eventually, that error will say that we got the correct IDs, but flip-flopped. This isn’t telling you the endpoint is “broken”, but rather giving you a clue about your feature design - this is the right wrong answer.

This usually happens due to database ordering being non-deterministic, or at least distant from the test. For instance, you might have an implicit order on id, which works great for numeric IDs, but less so for UUIDs. Regardless of the underlying technical “flakiness”, the test is failing correctly because the implementation hasn’t specified a deterministic answer.

There are two primary fixes, like we called out at the beginning. The right one depends on your use case, but likely, you actually do care about the order. In that situation, it’s actually awesome that your tests failed correctly and gave you product feedback! The necessary change then, would be to fix the implementation to guarantee the sort order. For our example, that might mean sorting alphabetically, on name. This also requires updating the assertion to match the new product requirement.

class WidgetRepo {
    // elided for brevity

    findAllMatchingName(name: string | undefined) {
        //...
        return db.query("SELECT * FROM widgets WHERE name ILIKE ? ORDER BY name", `%${name}%`)
        //...
    }
}

// then in the test
expect(resp.widgets.map(widget => widget.id)).toEqual([sizzlewhoop.id, whizbang.id])

Alternatively, you can make your test less dependent on the order and more resilient to database ordering. That looks like picking a deterministic ordering you can use in your test to ensure predictability.

// in the test
const byName = (a, b) => a.name.localeCompare(b)
const widgets = [...resp.widgets].sort(byName)
expect(widgets.map(widget => widget.id)).toEqual([sizzlewhoop.id, whizbang.id])

There also exist libraries and tools that provide this kind of matching for you. For example, vitest supports this via the assert.sameMembers API from chai.

Either way you decide to fix this issue, the test has successfully forced a conversation. Change the test, make it fail with correctly, and you’re on your way.

It’s worth mentioning that while this example uses raw SQL (like it or not, people do it), most times I’ve seen this crop up, it’s with an ORM.

Mock function call assertions

Riffing on the theme of incorrect list-ish assertions, let’s take a quick look at mocks. Another place where tests fail in surprising ways is when verifying mock class/function calls.

Oftentimes, I’ll see tests that make unnecessary test assertions on the number of calls to a mock. That could look something like this:

def test_foo_service_calls_api():
  api = Mock()
  api.fetch_users = Mock(return_value=[{"id": 1}, {"id": 2}])

  foo_service = FooService(api=api)

  assert foo_service.user_count() == 2
  assert api.fetch_users.call_count == 1

The last assertion (call_count == 1) is superfluous - it offers no value (in this test at least). It’s purely an implementation detail that can cause this test to fail for unexpected reasons. Additionally, these kinds of tests tend to be copy-pasted, further increasing the brittleness of the test suite. In most cases, the input/output of this test is good enough - just test that the result is correct.

Imagine what happens when a developer needs to tweak the FooService#user_count method to make multiple API calls. This could happen for all sorts of reasons. Maybe user_count now takes a nameLike parameter. If the API doesn’t expose a direct mapping between name, first_name, & last_name, you may have to fetch_users multiple times to satisfy the requirement.

Regardless of why they might need the change, when the developer changes it, the assertion will fail. Correctly, though for the wrong reasons. The test in question didn’t care and that test is now giving a “technically correct” failure, but one that doesn’t matter. Removing the assertion makes the test more robust, clear, and focused.

Suppose, though, you needed this assertion to reduce API usage or rate-limit consumption. I would argue it’s better to have a separate test that explicitly calls out that case. This achieves several goals:

If the implementation changes to need more than one API call, only one test fails
The test has a descriptive name, communicating failures as specification breaches

A reworked version of the test would look like this:

@pytest.fixture
def api() -> Api:
  api = Mock()
  api.fetch_users = Mock(return_value=[])
  return api

def test_foo_service_calls_api(api: Api):
  api.fetch_users = Mock(return_value=[{"id": 1}, {"id": 2}])

  foo_service = FooService(api=api)

  assert foo_service.user_count() == 2

def test_foo_service_makes_minimal_api_calls(api: Api):
  foo_service = FooService(api=api)

  foo_service.user_count()

  assert api.fetch_users.call_count == 1

These tests are cheap enough that having multiple, even though they do something similar, is worth it. The test descriptions serve as granular documentation of system behavior (e.g. specification).

Additive changes should (rarely) break tests

Another smell that tweaks my nose is when a test breaks because I added something to a return value. This is a violation of the Robustness principle, also known as Postel’s law.

be conservative in what you send, be liberal in what you accept

For our tests, the general idea is that additive changes to our systems shouldn’t break existing tests.

For this one, let’s say we’re building a small statistics utility. It accepts an array of numbers and returns various figures describing the series (average, mode, etc) as a dict.

Tests for this can be quite straightforward.

def test_statistics_returns_average_and_mode():
  assert stats.compute([1,2,3,3,4,5]) == { "mode": [3], "average": 3 }

def test_statistics_with_empty_returns_None_average_and_None_mode():
  assert stats.compute([]) == { "mode": None, "average": None }

But now say we want to add a new field to the stats dict, median. That will break all our tests!

AssertionError: { "mode": [3], "average": 3 } != { "mode": [3], "average": 3, "median": 3.0 }

This failure highlights a wrong wrong answer. The tests have shown you a “technically” right answer, but to the wrong question - “are these objects identical?” If we change the question, we can get closer to what we intended - “are the statistic calculations correct?”

In general, the same idea applies here as in the mock example above - think of it as the Single Responsibility Principle (SRP) for a test. If you don’t care about the full result structure, don’t assert on it. If you do care about the full result structure, test the structure once.

The same argument can apply to testing the mode and average Currently, all tests are verifying both stats - SRP might lead you to separate those out, too.

def test_statistics_returns_stats_dict():
  result = stats.compute([])

  assert result.keys() == { "average", "median", "mode" }

def test_statistics_computes_average():
  average = stats.compute([1,2,3])["average"]

  assert average == 2.0

def test_statistics_computes_mode():
  mode = stats.compute([1,2,3])["mode"]

  assert len(mode) == 0

If your structure is big enough, or your test is for something like an API response, it may be valuable to use a more robust verification mechanism, like JSON schema.

Conclusion

No matter what you’re testing, be on the lookout for tests failing for the wrong reason. When you see tests failing (or passing) with the incorrect result, they’re trying to tell you something. What would need to be true for your tests to fail correctly? To make them fail with the right wrong answer?

Listen to the feedback your tests are giving you. Your architecture, coworkers, and future self will thank you.