The query is now more readable as it avoids repetition. However, the performance dropped and the query now executes in 13 seconds. Since the query returns the same result as the one before, no automated tests will find an issue. We wonât catch the performance problems with our typical set of unit tests or integration tests.
Schema Changes
Another issue is schema management. There are generally three different ways of modifying the schema: we can add something (table, column index, etc.), remove something, or modify something. Letâs see how these things may cause trouble.
Recommended reading: Common Challenges in Schema Migration & How To Overcome Them
Adding a column seems to be safe. We just add a new column to a table, test all the queries, and typically nothing breaks. However, adding a column may take a lot of time. The database engine may need to rewrite the table â copy the data on the side, modify the table schema, and then copy the data back.
This may happen when we add a column between other columns and there is a clustered index on the table, or when we add a very wide column and we donât have enough room in the pages. Our typical database performance testing suite will not find the issues because the schema modification runs outside of the tests. Also, when we modify the schema locally, we typically donât see any performance issues because our local databases are small. However, rewriting a table may take minutes or even hours when done in production with sufficiently many rows. Deploying such a change may even take the application down. Even worse, we may not be able to stop the deployment when we realize that itâs taking too long because of the locks already in place on the table.
Adding an index also seems to be safe. However, they need to be maintained over time. Indexes generally improve the read performance because they help us find rows much faster. At the same time, they decrease the modification performance â every INSERT, UPDATE, or DELETE must now remove data in yet another place. This may lead to performance degradation. Just like before, this is not something that weâll be able to capture with our test suite easily. Our unit tests will still be green because we donât change the correctness of the algorithms, only their performance.
Removing is dangerous because of similar reasons. However, it can also lead to a disaster when we do a rolling deployment or when we run a heterogeneous system. When we do a rolling deployment, some nodes are already running the next version of the code, while others still run the previous one. Since the previous version may still rely on the thing we removed (index, column, table, etc.), it may start failing. To avoid that, we need to maintain backwards compatibility between version updates. However, our tests typically run against only one version of the code. They take the latest code (after changes) and verify it, so they wonât capture the issues with compatibility.
A similar case happens when we run a heterogeneous system. We need to update all the applications running in our server farm, and only then can we remove things from the database. Since our unit tests are typically testing just one application, it is much harder to identify issues or lack of compatibility.
In addition, that removal may lead to data loss. Any change that results in dropping something from the database needs to be carefully reviewed to make sure that nothing is lost. Restoring the database from a backup often takes hours and nearly always leads to a system outage.
Finally, modifying the schema is also dangerous. If we change the column type, then we may get wrong results or decreased performance. In some cases, our unit tests will catch this, but often these issues are missed. Imagine that you change the collation of a column (way of ordering characters in the string). This is a normal setting for our database, however, most of our programming languages canât reflect that easily. We typically have just one string type in our language, and we canât represent various requirements regarding comparing lowercase characters with uppercase ones or the order of national characters in the alphabet. This may be also very difficult to spot if we use non-production components in our tests, like an H2 database instead of the production-grade one. Changes in collation may simply go unnoticed.
Execution changes
Yet another area of issues is related to the way we execute queries. Databases prepare a so-called execution plan of the query. Whenever a statement is sent to the database, the engine analyzes indexes, data distribution, and statistics of the tablesâ content to figure out the fastest way of running the query. Such an execution plan heavily depends on the content of our database and running configuration. The execution plan dictates what join strategy to use when joining tables (nested loop join, merge join, hash join, or maybe something else), which indexes to scan (or tables instead), and when to sort and materialize the results.
We can affect the execution plan by providing query hints. Inside our SQL statements, we can specify what join strategy to use or what locks to acquire. The database may use these hints to improve the performance, but may also disregard them and execute things differently. However, we may not know whether the database used them or not.
Things get worse over time. Indexes may change after the deployment, data distribution may depend on the day of the week, and the database load may be much different between countries when we regionalize our application. Query hints that we provided half a year ago may not be relevant anymore, but our tests wonât catch that. Unit tests are used to verify the correctness of our queries, and the queries may still return exactly the same results even when not performing optimally in production. We have simply no way of identifying these changes automatically.
ORMs and the Lack of Context
Another source of trouble comes from the tools we use. Developers may use some kind of ORM to run their queries against the database. However, ORMs hide lots of details from us.
We already mentioned the N+1 problem. One way of fixing it is to change the ORM to eagerly load the entities. However, based on the application code that we run, we have no way of telling whether the N+1 problem will appear or not. This is typically hidden somewhere in the ORM configuration. Whatâs worse, just changing the configuration doesnât solve the problem. Just like with the example we saw earlier that resulted in sending three hundred thousand rows to the application, some queries may require eager loading, some may need lazy loading, and some others may even need to be rewritten entirely. But we wonât identify that just by looking at the application code in unit tests or during the code review.
Another problem can come up around transactions. Do you know what your default transaction isolation level is? Do you know where it is configured? Or do you even know if you can nest transactions and who decided whether to commit them or roll back? Those are questions that ORM needs to answer, and the answers are typically hidden from us and scattered around the code base.
Some ORMs may also modify the database schema, especially when they find a drift. This may not be a problem if the given ORM is the only user of the database, but may be much worse when we run multiple applications or have a heterogeneous cluster. Each ORM may have its own way of tracking the schema, some store the data in an additional table in the database, some others just run plain SQL scripts. Maintaining these things is something that we typically donât test with unit or integration tests. We just apply the schema changes.
There are many more issues that ORM can cause. Caching, connection pooling, transaction scope, unsupported data types, and polymorphic execution with virtual functions are just examples. ORMs are great, but they also cause a lot of headaches.
The Testing Pyramid and Why Itâs Not Enough
Until now we discussed things that may break when working with databases. Our goal is to prevent the bad changes from reaching production. Letâs see why our current approach wonât work.
Letâs examine the test pyramid. Martin Fowler explains that it is a metaphor that tells us to group software tests into buckets of different granularity. His article goes into detail about building proper tests and where we should focus the most. Letâs see the pyramid and why itâs not enough for our needs.
Unit Tests
Martin Fowler describes that unit tests make sure that a certain unit (your subject under test) of your codebase works as intended. Unit tests are supposed to verify the correctness of the application, check the public interfaces, and make sure that the business logic does what we expect it to do. We may argue if we should test trivial code, mock dependencies, or test private methods. However, no matter what our approach is, unit tests do not verify how things execute. They wonât capture issues around N+1 queries, wonât verify if the database uses indexes, and wonât check when the database engine decided to join tables differently. First, unit tests donât even access the database as we typically mock it out. Second, even if we use an actual database in the unit tests, then we donât assert how it executes the operations. Unit tests wonât protect our databases.
Integration Tests
Martin Fowler then moves to integration tests that test the integration of your application with all the parts that live outside of your application. They are supposed to verify if our application interoperates well with infrastructure around: web servers, load balancers, and databases. There are multiple issues, though.
First, integration tests donât focus on how things execute. With integration tests we take the same approach as with the unit tests - we load the data, run queries, and verify if we got the expected entities or results in general. Integration tests check if our contracts between components are correct, not if the components work properly. To test the latter, we use unit tests much more. Integration tests simply have different goals.
Second, integration tests use different databases than the production ones. We typically go with either of two approaches: we create new infrastructure or reuse existing one. When we recreate the infrastructure from scratch using Infrastructure as Code (IaC), we spin up a new database cluster, fill it with some data, and then run our queries. While it sounds good enough, such a process doesnât prepare the database the same way we run it in production. We most likely wonât have enough data in the database (because loading the data takes much time), statistics wonât reflect the production scenarios, and we wonât have the same extensions, configuration, or even database edition. Even Martin Fowler shows how he uses an H2 database that is unfortunately not close to the actual components he uses in production. The other approach could be to reuse some existing database, that is probably prepared for the team environment and reused by multiple developers. This may be even worse because the testing database may drift from the production settings and make our experience different. We probably donât want to pay as much for the testing database as for the production one, so the testing database will have different hardware or editions. It will be often filled and cleared up, so the statistics will be much different. We may use different runtime configurations or extensions. We will have different partitions and the schema will often differ.
Third, integration tests do not react to changing production environments. They will test our code with some configuration obtained at a given point in time. However, if we later reconfigure our database, we most likely wonât rerun the tests with the new configuration. Therefore, changes in the production database wonât be reflected in the integration tests when these changes are applied. They will be taken into account when a new change is being tested, but that may be very late.
Fourth, integration tests assume the database is there. The database is created somehow, and the test uses the database to verify correctness. However, the test doesnât check how the database was created, whether the migration was fast enough, or whether the configuration is the one that we expect. If the database works, then the test will just work as well.
Other Tests
Martin Fowler mentions other test types in the pyramid. They can focus on contracts, UI, and interoperation of all the elements. These tests run on different levels, but they all test with the same approach - prepare some input, run the test, and verify if the output is correct. These tests check if things work, but they donât focus on how things work.
Interestingly enough, Martin Fowler doesnât focus much on load tests. They are actually the only tests that may effectively capture at least some of the issues we mentioned before.
Why Load Tests are Not the Right Tool
Load tests are supposed to test the performance of our application. They seem to be exactly what we need to avoid the issues we listed before. However, they are not good enough either. Letâs see why.
Drawbacks of Load Tests
Load tests are expensive. To run a load test for a big web application we may need to spin up a fleet for hours. Depending on the type of our application, we may need to configure other services like caches, databases, load balancers, credential managers, and others. Paying additionally for yet another fleet of services to run load tests may be something we are not willing to pay for. This may be even harder for ML-enabled applications that utilize multiple GPU instances, or applications using some custom hardware. This may be just too expensive.
Load tests are slow. Running a load test can easily take 12 hours or more. Developers typically let these tests run overnight and check the results the following morning. This forces a context switch and a much slower feedback loop. If we need to run multiple load tests before production (because of syntax errors, wrong configuration, or whatever other reason), then this approach may slow our deployments for a week easily. This also increases frustration among developers when they need to spend much time doing the same thing over and over again.
Load tests may not reflect the actual data distribution. If we deploy the application in the US but test it using data from Europe, then the results we get wonât be reliable. Similarly, if we use testing data from a big country, then we notice performance issues with small countries contributing a single-digit percent of traffic to the platform. Getting the right data distribution is hard, and maintaining that data is even harder. While we can capture a proper dataset once, itâs difficult to maintain it over time. Sometimes we can just replay the production traffic directly to the non-production environment, but itâs not always possible if the application is stateful or due to legal implications.
Load tests need to use data safely. We canât just take the production requests and run them locally because they contain confidential data. We need to adhere to GDPR, CCPA, and other local regulations. We need to redact the secret values, sometimes change distribution, and sometimes even remove some requests entirely. We need to pay attention to logs, databases, cache keys, configuration keys, and other places where confidential information may be stored. This is effectively yet another system that we need to maintain which takes time and is expensive.
However, the most important drawback of load tests is that they happen way too late. Most of the time developers run them after they merge their code to the main branch and run it through the pipeline. Developers then move to different tasks and focus on a new assignment. Load tests take hours to complete and developers get the feedback days after they merge the code. The code was already reviewed, verified, and tested, and now developers need to rewrite it or even start from scratch because the approach they took just wonât work. This hurts velocity and developer productivity significantly, and we canât let that happen. This feedback loop must be much shorter; ideally, developers should get feedback when they write the code.
Which Issues Will Not be Captured
Even if we run load tests, we still wonât capture all the issues. Load tests run against an already deployed application. They donât check the deployment per se. Therefore, slow schema migrations, changes in configuration, or even the deployments themselves wonât be tested by load tests. Developers may test these things explicitly, however, our automated pipelines typically donât verify them at all. Tests will fail if the migration fails, and they will carry on if the migration succeeds. However, tests will not check how long it took to run the migration, add an index, or change the configuration.
Database Guardrails - The Ultimate Solution
Based on what we said above, we need a new approach. Whether we run a small product or a big Fortune 500 company, we need a novel way of dealing with our databases. Developers need to own their databases and have all the means to do it well. We need database guardrails - a novel approach that:
- Prevents the bad code from reaching production,
- Monitors all moving pieces to build a meaningful context for the developer,
- And significantly reduces the time to identify the root cause and troubleshoot the issues, so the developer gets direct and actionable insights
We canât let ourselves go blind anymore. We need to have tools and systems that will help us change the way we interact with databases, avoid performance issues, and troubleshoot problems as soon as they appear in production. Letâs see how we can build such a system.
What We Need for Database Guardrails
There are three things that we need to capture to build successful database guardrails. Letâs walk through them.
Database Internals
Each database provides enough details about the way it executes the query. These details are typically captured in the execution plan that explains what join strategies were used, which tables and indexes were scanned, or what data was sorted.
To get the execution plan, we can typically use the EXPLAIN keyword. For instance, if we take the following PostgreSQL query: