Columnar databases rely on specialized encoding schemes to reduce storage requirements. These encodings also enable efficient in-situ data processing. Nevertheless, many existing columnar databases are encoding-oblivious. When storing the data, these systems rely on a global understanding of the dataset or the data types to derive simple rules for encoding selection. Such rule-based selection leads to unsatisfactory performance. Specifically, when performing queries, the systems always decode data into memory, ignoring the possibility of optimizing access to encoded data. We develop CodecDB, an encoding-aware columnar database, to demonstrate the benefit of tightly-coupling the database design with the data encoding schemes. CodecDB chooses in a principled manner the most efficient encoding for a given data column and relies on encoding-aware query operators to optimize access to encoded data. Storage-wise, CodecDB achieves on average 90% accuracy for selecting the best encoding and improves the compression ratio by up to 40% compared to the state-of-the-art encoding selection solution. Query-wise, CodecDB is on average one order of magnitude faster than the latest open-source and commercial columnar databases on the TPC-H benchmark, and on average 3x faster than a recent research project on the Star-Schema Benchmark (SSB).
CITATION STYLE
Jiang, H., Liu, C., Paparrizos, J., Chien, A. A., Ma, J., & Elmore, A. J. (2021). Good to the Last Bit: Data-Driven Encoding with CodecDB. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 843–856). Association for Computing Machinery. https://doi.org/10.1145/3448016.3457283
Mendeley helps you to discover research relevant for your work.