Good write up. Another excellent resource straight out of the UC Berkeley Database Group that I keep close by is "Architecture of a Database System"[1] by three researchers in the field. It is very readable.
Honestly for this one I do not remember since it was two years ago, but I'd start over at http://arxiv.org/, which is a treasure. Granted most are research papers, but even those will link to overview papers, of which the one I link to is an example.
Good overview of traditional OLTP architectures. As complicated as they look from the article, it is just scratching the surface of a sophisticated implementation. There are many internals common to more advanced designs that are not even mentioned, and the article is already quite long!
The thing I love most about database engines is that there is probably more hardcore computer science per line of code than any other software system of similar scope. It is a very rich ecosystem for an algorithms and data structures geek.
Be careful with theoretical asymptotic complexity (big O) related to execution time. E.g. if your algorithm time complexity is O(1), but internally calls a higher complexity function, e.g. malloc(), implemented with higher complexity, e.g. O(log n), your algorithm time complexity would be O(log n) and not O(1). It could be even worse: on average or typical constant time algorithm could be in reality an O(n) one: e.g. case of hash table reindexation (that's the reason of why many big data structures, including most SQL databases, requiring real time behavior, are implemented as trees, tree hierarchies/division/clustering, instead of big hash tables).
I don't think that the `n` in the case of malloc would always be relevant to the semantics of the query. In that case, it would still be appropriate to refer to it as constant time.
For instance, you don't typically look at the size of the literals in the query when evaluating query complexity. If it's really unbounded, you probably shouldn't use a relational database.
Not necessarily, but could be the case. Typical malloc implementations have different management for at least small and big memory requests, using different pools, in order to reduce fragmentation. Also, because operations involving virtual address remap are expensive (realloc on a small block is faster with a full memcopy to adifferent location, rather than doing stuff involving the OS kernel doing virtual address remap).
The "problem" of malloc() function (or any other equivalent allocation stuff) is that internally manages free blocks (it is a middleman between the OS and user process -the purpose is to reduce OS calls-), if you have lots of them, dynamic memory could take time. For "malloc" I meant malloc/realloc/free, the whole kit. Those operations are not free (in most cases you're not going to have millions of allocations in one process, that was just an example of hidden things that could make your algorithm not behave like expected).
I you're right. In fact in the optimizer part I say (in a simple way) that big O (i.e. asymptotic complexity) is not the same as CPU cost but it's easier for me because the real cost of an operation depends on the CPU architecture.
Someone told me the same on the article comments and here is the answer I gave him:
You’re right and I agree with you.
When I wrote this part, I REALLY hesitated to give the real asymptotic definition and what it means for the number of operations but I chose a simpler explanation since the aim of this post is not to become an expert but to have a good idea. I hope that this won’t mislead people but I thought the real definition was too hard for a “newcomer” and not important to understand a database.
This is also why I added in this part “The time complexity doesn’t give the exact number of operations but a good idea.” and said at the end of the part “I didn’t give you the real definition of the big O notation but just the idea” with a link to the real definition.
Because I am interested in databases, I found the Se-radio's 2013 interview with Michael Stonebreaker [1] interesting, particularly in regard to traditional database design and more recent ideas:
> When it comes to relational databases, I can’t help thinking that something is missing. They’re used everywhere. There are many different databases: from the small and useful SQLite to the powerful Teradata. But, there are only a few articles that explain how a database works.
That's because the inner workings are really old, as in: emerged before blogging etc. was popular, hell before the internet was invented.
In the 'before/early internet days', we read books like 'An introduction to Database Systems' by C.J. Date. (I had to blow the dust off my copy to read the exact title ;)), which are more in depth than this article, but I like the article better, because it's more to the point and easier to understand. Well done!
Not that Cassandra and Hadoop don't have a place. But because NO-SQL is hot I see lots of young coders (I'm and old DBA) try to turn document store systems into relational databases. They should all be made to read this post.
(I'm the author of the article) I'm 28 and I’m currently a Big Data developer (I use Hadoop, HBase, Hive …) and I don’t understand the buzz surrounding Big Data and NoSQL.
With a relational database the complexity is hidden (more or less…) whereas with Big Data and NoSQL the developer needs to deal with this complexity himself/herself. As a result, most of the Big Data applications I’ve seen don’t work well.
A really like Big Data because it’s more complex but to be honest, most of the time my work does not required the “Big Data scale”.
At Couchbase we did a survey of developers (this was ages ago) and the biggest motivator for NoSQL was schema flexibility. Not having to coordinate migrations is seen as a productivity boost. [1]
The other thing document databases can offer that relational databases struggle with is taking subsets (which we use for offline sync.) [2]
I'm in the opposite camp. I don't like NoSQL because of the flexible schema. I have to build tools to make sure my data is consistent or have error handling. Migrations ensure that whenever I pull down a version of the code, the database is in the right state.
I don't see too many use cases where having being schemaless is a good thing outside of infrastructure ease of use. If you want to store arbitrary data in a table, BJSON in postgres is very efficient and flexible in that regard, while allowing you to have a schema for the rest of the data (e.g. if you are collecting data you probably want a schema for things like name, email, timestamp, etc and then have a BJSON field to jam in whatever you need.)
Having worked in ERP, Ecommerce, Financial tech, and general SASS based tech stuff, I agree with you on the "not too many use cases where having schemaless is a good thing". In most cases it's a shortcut and an unnecessary tradeoff made by people to avoid the few technical issues like DB migrations (also, a solved problem in many ways as long as you don't attempt to reinvent the wheel). The only time I saw a good use for a NoSQL db was to store products in Ecommerce. Managing taxonomy and attributes was always a nightmare and everyone was constantly afraid of performance issues (being on Magento and battling the EAV system didn't help). It would have been great to have only the products being stored on a NoSQL instance and the rest of the data being on the traditional relational data store.
As someone who spent several years studying programming languages, the thing that drives me crazy about traditional relational databases is the assumption that all data is tuple-structured. Much data is structured as unions of alternates or more complex things like maps. Shoehorning your data model into a tuple-based system is always possible, but often unnatural.
The place NoSQL shines is the acknowledgement that most data is complex. Of course it often also does away with ACIDity, which is a huge disaster (EDIT: the doing-away-with is a disaster, I mean).
I don't think this is correct. You can have a relational, columnar-stored key-value map that stores any values you want. Bonus: it's super easy to make these kinds of updates ACID. Of course, if you're maxing out the storage space, you're gonna have a rough time with indexes unless you take the EXACT SAME approach as you would with NoSQL.
I don't think there are any "inherent" problems to relational or NoSQL databases, but there are many tradeoffs. The tradeoff of NoSQL databases is that complexity gets very, very difficult to pull off in a distributed fashion. So throw 99% of the indices out the window, dumb your queries down, and cache any joins or scans as much as possible. The upside, I guess, is that the "schema" is pretty irrelevant if it's not your primary key (or secondary, in some databases). But, you lose joins, schemas, subqueries, orderings, many types of transactions, etc, etc, and a lot of "free" stuff that is really only "free" for small numbers of rows per table or strong assumptions about the data.
The relational model is extremely general, its' been argued (fairly successfully) by Date and Codd to be THE most general model (Graph being a close second). It's a rigorous approach to managing data with integrity.
I used to be a programming language oriented person, was big into data structures and objects, but then I read Date and my mind was blown at how beautiful and expressive the relational model is -- for its intended purpose (managing data for logical integrity and ad hoc queriability).
The main issues are
1. is that many implementations don't include some features such as unions.
2. Certain things (tree traversal) have also been hard to express in older versions of SQL or older versions of Tutorial D (Chris Date's language that's closer to the model).
3. Sometimes you don't care about long term data management (i.e. ad hoc queriability and integrity), you just want programmatic data persistence with pre-baked access paths that are FAST.
4. Relational integrity features are often crude implementations that slow things down too much or require custom triggers.
5. Queriability in reality requires decent knowledge of the physical layout and indexing if you're going to make it performant
6. Most relational databases have not been built in a cloud native era where we assume distribution across ephemeral disks and compute
So... great mathematical model, great way to think through and organize your data for no ambiguity, but the practical implementations leave a lot to be desired.
The problem is that "my data is too complex for the relational model" often means "I haven't thought through my data". Things like maps, unions, ordered sets, N-ary relationships, graphs and trees, are actually quite straight forward to represent in relations. The challenge is many of the lessons and arguments for this are trapped in books from the 70s-90s, not on the Web.
Agree, 100%. I too early in my career was very enamored by object and graph databases (this is pre the 'nosql' buzz), but once I started reading Date and Codd (and the inflammatory Fabian Pascal) some lights starting turning on in my brain.
Firstly, it should be made clear to people that SQL is not truly relational, and a lot of the things people dislike about it are nothing to do with the relational model and more to do with its late 70s, early 80s heritage. It was thrown together at a time when business systems were still very focused around COBOL.
The second thing that people are not picking up on is that the industry _already_ went through a pre-sql "nosql" phase in the 60s and 70s when network and hierarchical databases were the norm, and the relational model was developed to deal with the perceived faults those systems had: an enforced topology which could not easily rearranged at query time, lack of rigor in modeling, lack of standardized modeling concepts and notation...
Finally, I did find in previous jobs certain uses for nosql systems -- very low latency high throughput quasi-realtime systems that deal with very small bits of simply structured data and need to distribute it widely across a cluster. For that I used Cassandra (tho I understand now that there are successors that are better).
What I don't get is the point of systems like Redis or MongoDB which don't offer a compelling distribution implementation and simply replace the fairly well understood quasi-relational model of SQL with their own ultimately inferior graph/network/hierarchical models.
As a fan of Cassandra (though not for its relational model ;), what successors do you believe are better? Riak is the only one that comes to mind.
Btw, to me the point of Redis and Mongo is a very fast distributed dictionary. They're data structure servers, for when you want to persist and share data structures across processes, not "manage information". It depends on your goal.
Oh, this is true, and I think I came off too harshly. The relational model is general enough to support (almost?) any data model, it certainly has a lot of advantages in terms of efficient implementation, and the math is elegant.
I just don't think that it naturally reflects the data structures people use, and we should be willing to make the computer do the work, rather than humans.
The buzz around NoSQL is you don't have to worry about scaling the database. There are many, many more options now for e.g. multi-master, sharding, no-downtime copy-on-write migrations, etc., but just the idea of being able to run a tiny subset of queries or writes without having to worry about running out of resource capacity is a HUGE plus.
But having data corruption baked into the system design is shuge minus. Even the big shots at Google and Amazon are constantly firefighting data corruption in their NoSQL systems.
I'm not quite sure what you're referring to. But at certain scales of data (petabytes, probably?) data corruption is inevitable. I do not know if they use ECC.
"With a relational database the complexity is hidden"
That is my main issue. I use Cassandra over relational firstly for its linear scalability and multi-master-esque HA. But even ignoring those, I understand exactly what is being scanned and what is not, I don't have to fight with an optimizer at runtime based on several parameters.
I understand your point (and it’s a good one) and here is mine: Unless you're working in a team with a lot of good IT guys, you're likely to end up with worse performances and problems.
For example, when I started in Big Data, in less than 3 weeks I was able to optimize some batches just because I read the documentation of the framework used (PIG in this case) and read a small part of the source code to dig deeper. And it was not some touchy optimizations: I used in-memory joins and reduced the number relations in the scripts to reduce the generation of Hadoop jobs (which led to batchs 4 times faster).
There are often problems with our HBase database because it’s often overloaded (I’m not an IT operator so I can’t give more details) and no one really masters this database whereas it’s in production since 2014.
I do understand that in some cases a NoSQL database is mandatory and like you I like to understand what I’m doing. But:
- I’m not working in Silicon Valley
- Most of my co-worker are not geeks (and I respect that)
- It's VERY hard to find guys with real Big Data or NoSQL skills (this comes from a French technical recruiter)
So, if the geek part of me loves Big Data and NoSQL, the rational part prefers using well known technologies. If NoSQL and Big Data becomes mainstream and more known then the rational part will love them too.
While I agree with a couple of your premises, I don't think they all apply to Cassandra as much and is too broad of a brush to use.
I don't believe NoSQL means no validation. In fact, I've found things like Cassandra CQL actively prevent me from running expensive queries unless I opt in (e.g. ALLOW FILTERING). Validation is DB specific, but I don't think it's fair to say it's a footgun in NoSQL any more than in SQL databases.
As for choosing what is known by the employee market, I personally don't choose technologies that way (but I do choose based on maturity of course). I rarely look for skills as much as the ability to learn new ones, but I understand it can be a pipe dream when in the market for juniors.
Those optimizers you are fighting with have thousands of man hours of research behind them. For every silly choice they make, they make hundreds or thousands of correct ones.
Teradata, Oracle, PostgreSQL for example are reasonably complex databases to cluster and manage yourself. Just as easy/hard as setting up HDFS and installing Hive. In all cases people who are at big data scale are buying OTS solutions e.g. Cloudera appliance. They aren't rolling their own.
And if you are using Hive then I can understand why you are not feeling the buzz. But play around for Spark for a while and it's easy to see the future. Being able to write Scala/Python/SQL/R against a data set that can be anywhere from 100MB to 100PB without any changes is pretty compelling.
While it can handle 100 MB easily there probably are faster ways to handle that small amount of data. But yes, Spark can handle many PB and doesn't require a ton of changes in the code as you scale up from say 10 TB to 100 PB. The underlying cluster would change, and the performance profile would change a lot (10 TB can be done in-memory ... many PB, not so much)
Which really isn't intended for 100 MB (I bet I could write a unix pipe & filter script that's faster than Spark), but is intended for 10 TB through several PB.
I know. And I'm sorry, bitterly sorry, but I know that... no apologies I can make can alter the fact that in our thread you have been given a dirty, filthy, smelly piece of technical argument.
This is also a pretty accessible quick intro to complexity and data structures, nicely done. Definitely the sort of thing I would include as further reading in a beginner course -- some beginners love to understand "why" and this post answers pretty much all the "why" possible.
I decided to spend some time digging into SQLite. I highly recommend the overviews of their architecture and the details about each part of the puzzle.
It's really understandable, very straight forward, even if a lot of it refers to SQLite v2, it still seems very relevant.
Modern databases are probably some of the most sophisticated software in existence. That being said, you can pick a minimal subset of functionality and roll with that.
Great Article , I wish a book was written where a simple database with a query language was implemented from start to finish , even a nosql one, I always wanted to implement my own.
I think the important thing is knowing when not to care. Unfortunately, a lot of developers don't care because time complexity isn't even on their radar. So in the times when it does matter, they get burned.
If you liked this article, maybe you'll like my article on Shazam. I used the same pattern: I start from the basics of sound processing and computer science and finish with an in-depth explanation of Shazam.
The database part of http://witheve.com/ is written in Rust. It's not very technically interesting yet (eg no query optimiser) but the basics all work.
[1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf