Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The argument against clearing the database between tests (2020) (calpaterson.com)
86 points by stackbutterflow on Jan 2, 2022 | hide | past | favorite | 86 comments


Truncating the DB between every test is indeed horrifically slow. However it's much faster to wrap the test in a transaction and roll it back at the end. Transaction based cleaning also allows parallel testing. That mostly leaves the argument of not writing tests that rely on the state of the database being clean. I have mixed feelings on this one.

Just last week I opened a PR to fix some tests that should not have been passing but were due to an issue along these lines. The tests were making assertions about id columns from different tables and despite the code being incorrect the tests were passing because the sequence generators were clean and thus in sync. The order in which the test records were created happened to line up in the right way that an id in one table matched the id in another table.

So, I get the pain. But I'm not yet convinced it's worth a change.

Another option that I think isn't a bad approach is the default testing setup that Rails uses. Every test runs in a transaction but the test database is also initially seeded with a bunch of realistic data (fixtures in Rails lingo). This makes it impossible to write a test that assumes a clean database while also starting every test with a known state.


You must have this backwards.

Truncating a table is extremely fast. Rolling back a transaction is very slow. If you're not seeing this then there's something wrong with your setup.


(Not grandparent commenter) I think you're usually right but I doubt it makes a difference at the scale of 2-5 objects created in a test case. The big game changers IME are in-memory dbs (SQLite) or parallel execution of tests.

This idea of "transaction rollback in test teardown because performance" has a life of its own. The recommended base class for django unit tests (informally recommended, via code comments, not actual docs) uses transaction rollbacks instead of table truncation [0].

On top of this, I think, db truncation gets mixed up with table truncation sometimes too. For example, from OP:

> The time taken to clean the database is usually proportional to the number of tables

... only if you're truncating the whole db and re-initializing the schema, no?

And people sometimes actually do clear the whole db between tests! One unfortunate reason being functionally necessary data migrations that are mixed up with schema-producing migrations, meaning truncating tables doesn't take you back to "zero".

[0]: https://docs.djangoproject.com/en/2.2/_modules/django/test/t...


> ... only if you're truncating the whole db and re-initializing the schema, no?

Nope. In PostgreSQL the cost of truncating tables is proportional to the number of tables while doing a rollback is constant time (and a low constant at that, less than a commit for example).

In other databases like MySQL I believe truncating data is still proportional to the number of tables while rollback is proportional to the amount of data inserted or updated. So which is cheaper depends on your application.


This is definitely not true in PostgreSQL. In PostgreSQL rolling back a transaction just requires writing a couple of bytes of data while truncating requires taking a lock on every table you want to truncate and then for every table probably write more data than the rollback required and then you need to do the commit which is also more expensive than a rollback.


Not with PG. A couple weeks ago I was working on a project that used truncation cleaning, the test suite took 3m40s. I switched it to cleaning with transactions and the the test suite ran in 5.8s.

Truncation cleaning is extremely slow, not only because the cleaning is slower but because you actually have to commit everything your test code does.


I used to effectively do this in postgres with rsync on a known fixture snapshot of the data files. It would usually take under two seconds to reset the state and restart the servers, which was easily fast enough to do effective TDD.

I had a few other ideas to speed it up, also.


There's also creating a template database that exists at a known good state, and using that database template to CREATE DATABASE from.

https://www.postgresql.org/docs/current/manage-ag-templatedb...


Unless the test is checking performance, another option might be to start up multiple instances of the database, and run many tests concurrently each on identically prepopulated separate databases.


> Truncating the DB between every test is indeed horrifically slow

Using a database at all in unit tests is horrifically slow - one of the (many) reasons you shouldn’t.


They’re for integration tests not unit tests. Although the distinction is frequently treated as something that means something by purists, I only use it as a way to distinguish conceptually how many complex layers are being stacked since both run under “unit test frameworks” usually for reporting and assertion purposes. I view mocking as usually an anti-pattern. Careful DI usually gets you far enough and is easier to work with. You want the code under test to resemble what’s happening as much in production as possible. The more “extra” you have, the more time you’re wasting maintaining the test infrastructure itself which is generally negative value (your users don’t care about the feature being late because you were refactoring the codebase to be easier to test each function in isolation.

Empty databases should generally start quickly unless there’s some distributed consensus that’s happening (and even then, it’s all on a local machine…). You also don’t even need to tear it down all the way - just drop all tables.


Ultimately what matters is the entire application, database queries and all, works. I think calling out to the DB in tests is important for ensuring the entire app works.

And tests hitting the database can be fast: https://www.brandur.org/nanoglyphs/029-path-of-madness


This is true. I think however this is not relevant to unit tests. If you choose not to do unit tests, because they're not valuable in your software compared to automating end to end tests, that's fair enough, but on the topic of unit tests, talking to a db isn't really a thing.


Transactions / savepoints and parallelism make a huge difference. I have an app using Ecto and PostgreSQL, and running its ~550 tests takes under 5 seconds. Almost all of them hit the DB many times. The DB is empty and each test starts from a blank slate, inserting any fixture it needs.

An important trick when doing this is to respect unique constraints in fixtures. For instance if you have a users table with an email column as primary key, make the user fixture/factory generate a unique email each time ("user-1@example.com", "user-2@example.com", ...) Then you don't get slowdowns or deadlocks when many tests run in parallel.


One supposes horrifically slow might be a bit subjective.

I notice in a VM on my laptop establishing the initial connection to postgres seems to take 2-3ms, and running a trivial query takes 300-1000us.

I routinely involve the database in unit tests, it is certainly slower but my primary concern is the correct behavior of production code which uses real databases.


If testing using the db is slowing you down that means the test has discovered slow code, and worked, not that you should get rid of the test.


It depends on what is under test. If you're testing a model file that is highly coupled to the database, and whose entire purpose is more or less to function as an interface to the DB, tests need to include the DB almost by necessity. The alternative is to mock so much out that you're essentially testing your mocked code more than the unit under test.


What is the purpose of automated testing? Is it to ensure the code works correctly or is it to "run fast"?


To be able to say to your bosses that you have 100% code coverage.

No, I agree. Hitting the database is slower but not that much slower (at least if you use PostgreSQL and do rollback after every test). And since the goal is correctness I think that this performance hit is small enough to be worth taking.


> to ensure the code works correctly

It's to ensure the code works correctly and indicate where the problem is, whenever it doesn't work correctly. Querying a live database during unit tests fails on both accounts. It doesn't tell you whether or not the code works correctly - it tells you either that the code didn't work correctly or that the database wasn't available at the time the test ran.


Well both things are problems which is nice to know about so you can fix them. Certainly better than not knowing about either problem.


Most tests should test only one aspect of the system and that aspect is most often correctness. Dirty databases make the tests non-deterministic.

You can use a pre-seeded database for testing. SQLite can be used sometimes for instance.

Create load tests to test the performance.

Data can and sometimes should be reused for a specific set of tests. This is needed for browser tests for instance.

As a person who worked on systems with thousands of tests. I cannot stress enough that there should never be any randomness added to the automatic tests. Most times it is better to add more tests then to make a test do many things.


> As a person who worked on systems with thousands of tests. I cannot stress enough that there should never be any randomness added to the automatic tests.

To take a counter position, I'm a person who has also worked on many systems with many thousands of tests. Every single test suite I've seen pulling static fixtures as test data was brittle and near worthless. Introducing randomness to the data you test against is the only way to actually test your product with near-realistic inputs. To this end, packages like Faker[0] are worth their weight in gold.

[0] https://github.com/joke2k/faker


I just had this discussion yesterday on Slack.

I think that your core test suite should be completely, 100% deterministic.

Property-based testing/random input testing/chaosmonkey/profiling/etc. are all a different type of testing to me. Still useful, but separate. Still should be run, but not every time.

You do not want non-deterministic tests. (And if you use random testing, ALWAYS record the seed so you can reproduce it!)

If the core functionality of your code is fundamentally random (I once tested a maze generator this way https://github.com/pmarreck/elixir-snippets/blob/master/maze...), use seeded random data.


I’ve spent more time than I wanted to fixing tests that used a factory for test data and the factory didn’t understand things like, “all the results have to be unique”,and so would fail once in a while.

When you repeat “once in a while” 100 times, now they are failing a couple times a week.

The whole point of tests is 1) to have confidence in the code you’re shipping, and 2) to only demand attention when something is wrong. The latter is why tests have to be fast, and deterministic. Random failures and thumb twiddling both cost you things you are paying a huge chunk of theoretical developer output not to have to deal with.

It’s often been the people who don’t like tests who write the least effective ones, and sometimes it’s hard to discern lack of experience from wish fulfillment. If these were acceptable tests, I’d hate unit testing too.


Came here to say this same thing. Create a deterministic suite of tests that runs on every commit. Save the exploratory testing for another time / place. Keep them separate.


Well, the question is what you do when a test starts failing non-deterministically.

If your team respond to a random test failure by finding the root cause and fixing it, uncertain test inputs could be right for you.

On the other hand, if your team responds by hitting the 'rerun tests' button until it passes? In that case random failures are just slowing you down.


I'm not entirely following here. What is the point of even having tests if there are teams out there who disregard their outputs?

Freezing a set of parameters that always pass, in case you don't run into the failing edge case, defeats the entire point of testing in the first place. If it's one test that is failing because the test itself is defective, and if that test is not for a crucial part of the product, then skip the test in the interim and come back to it later. But giving it a happy path for success sounds really strange to me.


The correct response is to "freeze" the parameters of the failing test into a regular test which fails consistently.


With random in this case, I mean data that changes between test runs. I often use generated data sets. If you can find errors by feeding random data into your tests, the tests does not cover the code in the first place.


Automatic tests should be brittle. Their job is not to robustly complete their apparent task, regardless of how the software has broken.


You're both right.

You're just talking about different things.


> Dirty databases make the tests non-deterministic.

If this is the case, is the test really testing something useful? Tests should verify that the code works as it's intended to work. If pre-existing (but presumably valid) data can break the test, then surely there's states where whatever behaviour is being tested won't work in production?

Provided the test is creating and verifying its own data, it's not really non-deterministic except in a very technical sense.


> If pre-existing (but presumably valid) data can break the test, then surely there's states where whatever behaviour is being tested won't work in production?

Imagine I'm testing a web page that shows a list of all customers who have placed at least one order.

If I set up the test by wiping the database and adding ten customers, I can assert that the list of customers should contain those ten and no others, giving it ten rows.

If I don't wipe the database, the web page may show more than ten customers - and I can't tell if the extra entries indicate a fault in my code or not.


Sure, but in that case I wouldn't test that the list contains exactly the 10 customers I expect, since that's not what the code is supposed to do. That's a shortcut you introduced while testing. I usually approach things like this with two assertions - that the size of the list under test increased by X, and that the newly created record(s) exist in the list.

After all, the code doesn't show the 10 most recent customers that purchased an item, it shows ALL customers. You could hide a pagination bug where only the first page of results functions if you wipe the database first.


But then how do you deal with unique constraints or things like upserts? If I want to add 10 entries, then I need to know that all have been added and not pre-existing, or else the code for adding new entries doesn't actually get tested. Or you end up with spurious test errors if insertion fails due to duplicates.


> But then how do you deal with unique constraints or things like upserts? If I want to add 10 entries, then I need to know that all have been added and not pre-existing, or else the code for adding new entries doesn't actually get tested.

Then your test code needs to take into account the current state of the database when creating input data for the code under test. It is more complicated than with a know start state where you can pick fixed data, sure, but it's doable.


> Then your test code needs to take into account the current state ... [ etc ]

... and you're on your way to writing an application, not a test!

Tests should be as simple as possible, and a clean slate goes a long way toward that.

A test should be riddled with as few "needs to do" responsibilities as possible while still achieving the intended coverage.


If you want to test that “it adds X entries” then write the test to add X entries from a clean state, then add another X entries and check it is done correctly. You don’t need to rely on a dirty start state to do that.


I personnally use tests for two things : - in tdd, to make sure my code works in expected scenarios - in CI to make sure new code don't break old features.

In both case, i need to control my testing environment.

Randomizing inputs could be used in a third scenario for extensive testing, a bit like fuzzing. But it's yet another different usage.


I guess the disconnect is that I don't think of the state of the database as inputs, per se, but more the environment the test is running in. I wouldn't for instance necessarily need to set my clock to a fixed time (provided I'm not specifically testing code where that is relevant), or wipe out the hard drive that the code is running on for a test.


What if the test thinks it is creating its own data, but that has failed, and the old data is mistaken for new data?


Some tests simply cannot be deterministic because the underlying system is not deterministic (eg testing something that has multiple threads, perhaps some of which live in far-flung members of members, etc).

A dirty database is what I would call “avoidable non determinism” because you can easily clean it and you probably do not want to test handling bad databases (maybe you do though). Thread scheduling is not something you would want to work around, because you want to test a wide variety of actual production scheduling patterns to make sure there is not something like a deadlock (which also means, to get full value from the test, you probably need to run it hundreds of times and/or turn on sanitizers).

One perhaps better word you are looking for re determinism is hermeticism. A dirty database with junk from previous runs makes the test non-hermetic (not self contained). You can have a hermetic but non-deterministic test also, like telling a bunch of threads to print the ABC’s.


> As a person who worked on systems with thousands of tests. I cannot stress enough that there should never be any randomness added to the automatic tests. Most times it is better to add more tests then to make a test do many things.

Completely agreed.

To put a real cost to this: I worked at AWS on a major service. The test suite ran nightly (it required a shared environment and the tests took almost 10 hours; an argument for the importance of independence and test speed for sure, but that's a different discussion). It was also filled with spurious failures; on a good day, each team might "only" have to deal with 10 random failures, while a "bad" day might have seen 20-30.

As such, a process was put into place: each team had a dedicated "QA on-call" resource that rotated every week. The QA on-call's role was to, every day, look at the test logs and investigate the failures assigned to their team, providing an explanation of why the test failed, and certifying that it wasn't actually a problem. Due to the nature of the system, this generally took all day regardless of whether it was a good or bad day, though on bad days it might even require pulling in an extra person on your team to help.

So with 10 teams and let's call an average AWS engineer's salary $250k, you're looking at $2.5 million/year spent just due to flaky tests requiring manual investigation.

In fairness, these tests were flaky because they were integration suites using real networking, hardware, and other AWS services, so flakiness was somewhat inevitable. But intentionally introducing nondeterminism and random failures in tests would lead to the same thing.


> Dirty databases make the tests non-deterministic.

If this breaks your tests, you have a concurrency bug, pure and simple. Either in your code, or in your test assertions.

Your test assertions (like your code) should never assume that what you just put into the database is the only thing in there which will match, or the last thing that was put in there.

In my case (working in Elixir/Phoenix), it runs all the tests concurrently. This will cause any concurrency issue to crop up real quick, and that's exactly how I like it.


In my experience such tests quickly become unmaintainable and suddenly the correctness of the test becomes dependent on the order tests are run in preventing you from running subsets, running them in parallel, being extremely flaky (since the shape of the data is random on each machine it’s run on). This is discussed but dismissed, but tests need to be as maintainable as the code itself.

I tried this on a codebase I came into that had maybe 10 end to end integration tests. This was a bit worse because each test relied on the one declared before it running. Refactored it so that we spun up a unique s3rver and tore it down for each test. Sure, this means only the test cases we write are tested. That’s a feature not a bug. If you want coverage of weird states then use proper testing tools (eg fuzzing and property testing) to generate those complex test cases. Doing so formally is a much better recipe for success then hoping upon success by incidental randomness. I went from spending a day and a half and failing to add one new unit tests of some functionality to a day of pretty straightforward test case writing once I refactored each test to be standalone. Make it easy to write and debug tests not harder and let your testing frameworks handle the heavy lifting of finding corner cases.

Also, converting serial tests to parallel isn’t as complicated as the author makes it sound. You would start piecemeal and let that grow by enforcing policies in code review about what tests look like going forward while you go through the backlog of converting brokenness (running the parallel pieces in parallel and the serial pieces serially)


> suddenly the correctness of the test becomes dependent on the order tests are run in preventing you from running subsets

I once write a test library where you could annotate tests with TestA requires TestB which in turn requires TestC, and it automatically runs all the requirements for each test.

This seemed to work quite nicely.

> tests need to be as maintainable as the code itself.

Yes, if not more so. Tests that are a hassle to use won't get used, and so are useless.


In the Java ecosystem, you can do that with TestNG


If we could wave a medium-sized magic wand to make the database fast for tests (as least as fast as the proposed alternative), would we still use a single shared database for all tests? I would use a separate database per test since I don't find the arguments about realistic data shape convincing.

Back the real world: since the database isn't fast, we can either make the test infrastructure faster or hoist the problem into the application domain by using tricks like the OP. I'm a platform engineer at heart so I lean towards making the infra faster in order to accelerate app development.

My hot take is that developers are too comfortable with slow test databases and we're due for an "esbuild moment". With appropriate elbow grease, some databases, like Postgres, can be quite speedy (~300ms to start a cluster for a suite and ~40ms per test). The main tricks:

- Avoid Docker. Postgres is cross platform and Docker on macOS is sloooow.

- Create a single database cluster (Postgres lingo). Initialize a fresh database in the cluster with the seed data. Use that database as a template for all future tests. Creating from a template was 12x faster than creating from scratch in a benchmark I ran.

- Use a ram-backed file system. Using /tmp is probably good enough since its backed by tmpfs which should use RAM for small files.

- Disable fsync for the database.

I wrote up a more detailed take for Postgres here: https://news.ycombinator.com/item?id=28472062.


I'm not sure what distro you use but /tmp isn't generally tmpfs (like on Ubuntu for example), so you might want to double check before relying on that.


Nice catch, that was a neat rabbit hole [1]! I run an Ubuntu derivative so you're correct that /tmp is not mounted using tmpfs. Looks like Arch and Fedora use tmpfs and the Debian flavors do not.

[1]: https://lwn.net/Articles/499410/


At a high level: yep, completely agreed. In my experience, not-cleaning the database has consistently lead to more realistic tests, near-trivial parallelism, and fewer errors. The extreme few exceptions (e.g. testing db-wide behavior, like reports) are easy to put in a special folder... and people will quickly grow annoyed with how slow they are to run in comparison, and will avoid adding them. It's a nice self-preserving feedback loop.

My only recommendation / near-requirement for doing this: randomize your test order (prevents cross-test dependencies from leaking in), and consider running them twice in CI. Once with a clean DB (ensures you can bootstrap your test environments - it's easy to lose this unless you have something checking it), and once without cleaning (helps ensure you don't have imprecise tests, and just rerunning is a trivial way to get a decently-strong assertion, without having to resort to truly persistent shared databases). Or just do a weekly test with a clean db or something - bootstrapping problems are usually easy to identify and fix, it's fine if they leak temporarily.

---

If you don't want to do this, or want a magic performance hack for db-bound tests while improving things: throw your test database onto a ramdisk. It's particularly easy with a dockerized DB: just use the ramdisk as the DB's volume.


> consider running them twice in CI

that's an interesting suggestion that i have not thought about before!

And it makes sense - running a test twice doesn't take _that_ much extra time, but it can catch problems with the test itself!


Yes, this is slow because of how many database roundtrips the author is doing:

  @before_test
  def clean_session:
      for table in all_tables():
          db.session.execute("truncate %s;" % table.name)
      db.session.commit()
      return db.session
But if you instead modify your before_suite (not before_test) flow to dynamically query the table names and subsequently create/update a stored procedure in the database that has your truncate queries in it:

  delimiter //
  create procedure truncate_tables()
    begin
      truncate table1;
      truncate table2;
      truncate table3;
      truncate table4;
      ...etc...
    end
  delimiter ;
and then calling `db.session.execute("call truncate_tables()")` ONCE in your before_test instead, it will be orders of magnitudes faster.

I improved the test suite speed 15x at my previous job with this approach. We had a lot of tests and a lot of tables. O(n^2) to O(n)... :)


There is a much better way to reset the database: Postgres has template databases. You can clone a new database for each test like this:

CREATE DATABASE test_run_1234 TEMPLATE "vanilla_database";

It's fast. Also, truncating tables doesn't reset the database; there is a lot of "static" data initialized in my migration scripts. Rerunning the migrations would be very slow.

In CI, my template database is initialized with migration scripts at the beginning of each suite run. Locally, I reset the template database (and purge the generated test databases) by running a script manually.


Wow, that's a fantastic feature for development! I remember once we explored various approaches to resetting the DB, because resetting the DB between tests was about half the time of the test suite.

In our case the static data wasn't huge so just truncating every table then re-adding the static data was fast enough. I think originally it was doing something like a mysqldump restore.


Good article. Data & state are hugely important for testing, and it's good to see others engaging with the reality of these.

I had key responsibility for a very large software system (10M LOC) which was highly configurable; everything was data-dependent, both for product configuration and the actual accounts being tested.

Others had tried building test datasets using in-memory DBs, but the cost of maintaining these test datasets had already exceeded their value without even achieving 10% test coverage.

Rather than avoiding the database we took the architectural choice to acknowledge it, leverage the already-available standard configuration datasets, manage & minimize the degree of changes committed (test in transactions & rollback wherever possible), and manage & minimize dependence on pre-existing account state.

Within six months we were able to achieve a level of coverage & reliability where >60% of bugs were detected promptly on being introduced. Prior to this they had taken 3-5 days before end-to-end testing had confirmed detection.

Use of a database in testing is not completely easy, but sometimes it can be the most efficient solution to the problem at hand.


>assert db_session.query(User).count() == 1

>assert db_session.query(User).get(user.user_id) is not None

These aren't testing the same thing. Wind database clearing you are observing count going from 0 to 1, but without clearing the database you way still pass if it goes from 1 to 1 instead of 1 to 2.

If you don't clear your database your database state will be considered an input to the test and it may be needed to reproduce a test.


Unit tests should be small, specific enough tests that they should not ever require using the database.

End-to-end tests, behavioral tests, integration tests however can involve the database as you can actually test how the application responds to requests.

I don’t know if it’s laziness, IDGAF, ignorance, stubbornness, or what but the above tests seemed to have merged recently where we now have “unit tests” that are actually behavior tests.


Because people hear “unit tests” and “integration test” and “behavior test” and bend over backwards to start DI and mock every single little thing and be very careful about organizing things in that way, which is a huge anti pattern in my experience. You should DI/mock only when you have no other choice (eg you need to corrupt the data returned by an I/O call to test some corner condition). Generally, the less DI/mocking/supporting code just for testing purposes you have, the simpler and more testable (ironically) your codebase actually is. The reason is that you’re actually testing the behavior of each function in a wider variety of test cases than just the ones written for that one function. You use fuzzers or property tests to generate a wide variety of test cases. The purists will argue that this makes tests brittle because changing a core routing can suddenly cause many tests to fail/ need updating. In practice I’ve observed that such updates are typically trivial and mechanical while with mocks you end up doing the same thing anyway (change an interface and suddenly you have a bunch of mocks that need updating anyway).

Mocks and DI can be useful. Don’t get me wrong. But use them sparingly and judiciously when they’re the right tool for the job. Recognize that they’re usually not.


Agreed. If you have a function that executes an SQL query then your unit test(s) of this function should also execute the query. Otherwise, what exactly are you testing? Just because the database runs as a separate process does not mean that this is an "integration" test.

Imagine if you had a function that did some floating point math and then you wanted to test it and I told you "hang on, this function actually utilizes the FPU which is a separate piece of hardware and therefore if you test it then you're doing an integration test.. if you want a proper unit test then you must mock the FPU and use dependency injection". Madness. Executing a database query is no different from executing a CPU instruction or calling a library function.


Very insightful. I've found the same about "using DI only to force errors / corrupt data on IO".


> Unit tests should be small, specific enough tests that they should not ever require using the database.

That's somewhat architecture dependent; in a microservice architecture, it can be quite reasonable to see the service boundary as the unit boundary, in which case a DB-involved “unit test” makes some sense.

You probably still also want fairly complete DB-independent tests that you can run automatically as you code, but the idea that the latter fully encompass unit tests is, I think, an artifact of predominantly monolithic architecture of the time the language developed, where you’d have multiple logical units in monolithic app that would then connect to a database (which even the monolith might not fully own).


The “unit” in “unit testing” is unambiguous across the industry, referring to a function or a class that you want to verify the implementation of. Trying to impose another meaning onto the word is a sure way to complicate knowledge exchange with unnecessary misunderstanding (I certainly would think twice about joining a team that invents its own bespoke language to reference common concepts).

Testing a unit of any complexity is already a can of worms; with multitudes of factors like hardware conditions, memory state, or phases of the moon causing Heisenbugs what we need is more predictability, not less.

So if some runtime quirk at the database server, unavailable connection, etc. has the power to fail your tests, call those tests “integration”, “component”, “service”, “behavior”, “e2e”, anything else—the available vocabulary is rich enough that it doesn’t really call for overloading existing designations.


> Unit tests should be small, specific enough tests that they should not ever require using the database.

What if the code you're testing is an ORM, or something else that intrinsically implies a database?

It seems to me that tests that're good enough that they'll catch a decent proportion of errors would involve a database here.

Of course, you might designate such tests "integration tests" rather than "unit tests", but I think the important thing about tests is not what you call them, it's whether they catch bugs.


What you call them is also important. Otherwise people won't know what you're talking about. You might say "it's better to do integration tests with SQLite than unit tests", but that's a different question.


> Otherwise people won't know what you're talking about.

That's a good point!


> Unit tests should be small, specific enough tests that they should not ever require using the database. > I don’t know if it’s laziness, IDGAF, ignorance, stubbornness, [...]

My impression is that most people who think along these lines just lack the technical ability to make tests that involve the database fast (or maybe even robust), or just parrot something they heard. If you want to test a service that requires DB access for most things it does as part of its core logic, why either exclusively rely on more difficult to debug end-to-end tests, or put in extra effort and bloat for worse tests with nonsense like mocking when you can just give it a database in a way that adds basically no discernible overhead to your test run?


I respectfully disagree. I have found great value out of database tests that did not require the overhead of end to end tests, which I've found can be brittle and slow - especially if your service depends in some way on another application or outside piece of infrastructure.

As an aside, anytime I hear "not ever" from somebody wrt design patterns, my doubt sensors jump to alert.


As usual, "it depends." For every person complaining about unit tests using databases, there's another complaining about all the annoying mocks they need to maintain, when they could just use the real thing.


Unit Test is the common generic term anymore. So Unit test is most commonly used to refer to any type of test, integration, end to end, or isolation.

Most people will view you as being pedantic and old school for insisting on using an older definition. Which is exactly the feedback I kept running into so I changed my stance on the subject to fit what people expected. At least I got them to make more tests.


Calling a mix of tests into the category of unit tests is incredibly wrong and misguided. It is a formal term that has been around for a long time and talks about a test that is extremely limited in scope and is tested in isolation.


> So Unit test is most commonly used to refer to any type of test, integration, end to end, or isolation. Most people will view you as being pedantic and old school for insisting on using an older definition.

I understand that anything can be categorized as a unit, even a functional test. That's fine. Transactions for integration tests. I'll suffer the distain of people who rather produce problematic (eventually buggy) software in lieu of understanding categoric differences and purpose, leading to these kinds of "questions". If that's "most people", so be it.


The old definition is still how we learned it in undergrad software engineering (taught by the youngest professor in the department, no less) in 2021. Integrating into a database, much less one with simulated real-world data, would be integration or end-to-end testing, definitely not unit. The name unit implies testing a single unit of code, not how everything runs in real world conditions.


Yeah, I wondered about that too. Lumping it under one term is imprecise. I don’t like that there’s 3 different types of non unit tests, personally I would say pick one and run with it so you’d say we have unit tests and behavior tests for example.


I'd like to follow this advice, but it falls apart when you have a service that interacts with or worse relies on state in external services. Best case, you can create new test data on each run in those services as part of your tests if they expose an API. However, far more often you need to prearrange test data in those external services to cover the various possible states.


Why not stub or double the external service?

For truly end to end that may be too fake but could help with less rigorous integration tests. Or could be an option for frequent ongoing testing, then be turned off to use real 3P services for final pre-deployment tests.


I wouldn't want my main test suite to be dependent on anything external.

Ensuring a third party API is still doing what you think it's doing is a legitimate concern, but that test should be standalone and scheduled.


Test data only has a "realistic shape" in some applications; applications with short small transactions and low cardinality relationships. I agree keeping the data isn't a bad idea in those cases. It's probably normal in B2C apps.

Some B2B apps have large background transactions with 100k to 10s of millions of rows inserted or updated (with application controlled queuing rather than only relying on database locking), and you won't get realistically shaped data long term on unit tests designed to run in reasonable time. It's just as easy to end up with a performance problem in test code that you never see in production. Except when you have a customer who tests your service remotely on an automated basis. That can be annoying.


I follow the same testing philosophy. No mocking, except for external dependencies (mainly 3rd party APIs). No cleanups, except for problematic cases. RDBMS cleanups, when needed, are done by enclosing tests with transactions and explicit rollbacks.


Nobody mentioned the biggest disadvantage, which is losing determinism. Good luck debugging CI tests that run locally because some other PR added some quirky row to the db.

Rollbacks are not perfect because sometimes you need data to be visible outside of the tx.

Tests must be simple and as close to the real thing as possible. The main reason we have unit tests instead of integration is because they are simpler to exercise and maintain, thus giving a quicker feedback in the long run.


  def test_adding_a_user_2(session):
    user = make_user()
    db_session.add(user)
    # this is safe, doesn't assume no other users exist
    assert db_session.query(User).get(user.user_id) is not None
If the database is dirty, and that user ID happens to exist in it, this test will continue to pass even if db_session.add(user) has suffered a regression such that it does nothing.


people keep re-discovering these things over and over again. i wonder if this is due to microservices and the desire to avoid "big" frameworks?

as for loading test data, you could go hybrid by loading all of it once, upfront, and then using database transactions.


I can't even get past the first paragraph. It's not a "unit test" if it touches the database, by definition. It should say integration tests, or just stick with "tests."


I wonder if "xUnit Test Patterns" (http://xunitpatterns.com/) is still relevant.


I never clear database, use fuzzy tests, and combined its VERY hard in practice. The reason you do it is mostly DEVELOPER HAPPINESS and QUUICK TEST WRITTING/DEBUGGING which I can't state enough how important it is.

So, integration tests. I have REST API which i test via PowerShell and pester so no transactions are possible. There are 2 options:

1. The database is dropped before each run and recreated via migrations and seed

2. The database is never dropped

3. This one possible to: run thousands of times then drop the database. It helps with randomization exhaustion of certain limited set of objects.

First step lasts from 20s to several minutes depending on the size of the db so its not that bad during CI/CD run, but its BAD during development because you can't just run single test 10 times quickly but you need to wait for DB cleaning in between. That clearly sux. Oracle db is the worst here.

Another thing that sux is once you have it running longer then 5 minutes, you again block development AND deployment. That sux big time. Now you want to parallelize which is mega problem since its ONE THING to run single test on database, and ANOTHER THING to run multiple tests on the single database in parallel. It can be done, but its better to forget this IMO and just make it in parallel on different databases (I did this one tests run on 10 servers each using its own db and taking set of suites of tests).

Now, WRITING TESTS LIKE THIS REQUIRES LOTS OF THOUGHT, REFACTORING OF BOTH TESTS AND BACKENDS, particularly combined with FUZZY tests (where in essence you randomize test input on each run)

Fuzzy tests with the non-dropping database are absolute astonishing both in its testing ability and its potential to produce flaky tests. Out of 10 failed tests, in my experience 1 is due to bug, 9 are due to flakiness of randomization and database state.

And all could be fixed. But it requires infrastrucutre, tools, and methodology.

1. I log each test run on InfluxDb and dashboard them on Grafana so I can compare months or runs and see what tests are flaky and if there are any other patterns.

2. I require from developers to NOT drop the database and use randomization for any input of almost ALL attributes it has.

3. I ask for them to run serially any test thousands times without ANY error (to prevent flaky tests). Then I ask them to run them in combination with all other tests (so they don't get dependent).

4. I have tools like complete db export when tests fail, verbose communication details with any touched REST endpoints, option to quickly run single or set of test in a loop, option to change tests before run without any particular toolset (i.e. notepad/vim).

This is VERY hard to do, honestly. But it totally rocks if you happen to do it right. We almost never had a bug found by a user.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: