Ddia-Storage-Retrieval

Storage and Retrieval

For me this is largely a review of CSEP 544: Database Management Systems, perhaps from a slightly different perspective. A quick recap:

We consider two main settings for databases:

OLTP: Online Transaction Processing

Think of databases supporting user-facing applications. Typically there are many various atomic transactions that need to happen to keep the database as a source of truth representing the current state.

E.g. an order is placed, a user creates an account, a message is sent, etc.

In this setting, there are typically a high volume of small requests. By small requests, I mean each interaction with the database likely only touches a few tables or even records within those tables. A request might be "get this record by key" and the storage engine uses an index to find that record quickly.

Disk seek time is often the bottleneck in this setting

Within OLTP there are two popular storage engine architectures:

LSM or log-structured-merge utilizes append only files with a merging strategy.
B-Tree treat the disk as fixed-size pages that can be updated in place.

The big difference is that, because LSM is append-only, you get sequential writes instead of random-access writes, which ultimately gives you higher write throughput. Conversely, B-Trees are typically said to have higher read throughput.

However, in settings with very high write throughput, the compaction necessary for LSM can not only throttle disk bandwidth resources for writes, it can actually lose out to them until unmerged segments take up the entire disk. Explicit monitoring is necessary for this.

OLAP: Online Analytics Processing

This is a rarer setting, simply because for this setting to become relevant, the company must already have access to a massive amount of data and is likely already large and successful. In this setting, the storage engine supports a data warehouse, which is a large database, typically separate but comprised from various OLTP databases. There might be a periodic data dump job or an ongoing stream from the various application-supporting OLTP databases, and then a transformation of this data into an analysis friendly schema, before it ultimately makes it into the data warehouse.

In contrast to OLTP, these systems usually handle a very low volume of requests, but the requests are very demanding and often require scanning across all records. A request might be, what is the average order total across all orders. As such, indexes are less useful, and column-oriented storage often fits well here.

Disk bandwidth is often the bottleneck in this setting