Compression of Very Sparse Column Oriented Data

Vinicius Fulber Garcia, Sergio Luis Sardi Mergen


Column oriented databases store columns contiguously on disk. The adjacency of values from the same domain leads to a reduced information entropy. Consequently, compression algorithms are able to achieve better results. Columns whose values have a high cardinality are usually compressed using variations of the LZ method. In this paper, we consider the usage of simpler methods based on run-length and symbols probability in scenarios where datasets are very sparse. Our experiments show in which cases the simple methods evaluated provide promising results.


compression; column oriented databases

Texto completo:

PDF (English)


Abadi, D., Madden, S., e Ferreira, M. (2006). Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 671–682. ACM.

Abadi, D. J. et al. (2007). Column stores for wide and sparse data. In CIDR, pages 292–297.

Ailamaki, A., DeWitt, D. J., Hill, M. D., e Skounakis, M. (2001). Weaving relations for cache performance. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 169–180, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

Burrows, M. e Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm (technical report).

Council, T. P. P. (2008). Tpc-h benchmark specification.

Deutsch, L. P. (1996). Deflate compressed data format specification version 1.3.

Huffman, D. A. et al. (1952). A method for the construction of minimum redundancy codes. Proceedings of the IRE, 40(9):1098–1101.

Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandiver, B., Doshi, L., e Bear, C. (2012).

The vertica analytic database: C-store 7 years later. Proc. VLDB Endow., 5(12):1790–1801.

Matei, G. e Bank, R. C. (2010). Column-oriented databases, an alternative for analytical environment. Database Systems Journal, 1(2):3–16.

Skodras, A., Christopoulos, C., e Ebrahimi, T. (2001). The jpeg 2000 still image compression standard. IEEE Signal processing magazine, 18(5):36–58.

Witten, I. H., Neal, R. M., e Cleary, J. G. (1987). Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540.

Ziv, J. e Lempel, A. (1978). Compression of individual sequences via variable-rate coding. Information Theory, IEEE Transactions on, 24(5):530–536.

Zukowski, M., Heman, S., Nes, N., e Boncz, P. (2006). Super-scalar ram-cpu cache compression. In Proceedings of the 22Nd International Conference on Data Engineering, ICDE ’06, pages 59–, Washington, DC, USA. IEEE Computer Society.