Benchmarking Compression Programs

by Szymon LipiƄski
tags: compression

My main question is: which program should I use for compression?

The answer is not very simple.

It’s rather obvious that it will take longer to compress a bigger file than a smaller one. But will that be twice longer for a twice bigger file?

Are you able to sacrifice compressing a file for another hour just to make it a little bit smaller?

What about decompressing time?

Is it true that decompressing time depends on the compression level?

Data Respository

All the files, including this notebook, can be found at https://github.com/szymonlipinski/compression_benchmark

Testing Procedure

For generating the data I used the generate_data.py script.

Programs

I have tested five compression programs:

Compression Levels

Each of them can be run with a compression level argument which can have values from 1 to 9.

File Sizes

As this analysis is a side effect of another project, where I analyse chess games downloaded from lichess, I used a lichess file. Each of the files contain only alphanumeric characters with chess PGN notation.

You can find the lichess files at https://database.lichess.org/.

I used files with different sizes:

Machine Spec

Commands

Each archive was decompressed using a pipe and was redirected to a new file. I used a pipe because sometimes I need to process the compressed files one row at a time, so I can decompress them while processing.

Redirecting a pipe to a file shouldn’t be different from asking a program to write directly to this file. Or at least I hope the speed would be similar.

Compression Commands

Decompression Commands

Data

The benchmark data is stored in csv files, one file per program.

I have: 5 programs x 9 compression levels x 7 files = 315 observations

Notes

The Analysis

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import warnings
warnings.filterwarnings("ignore")
data = pd.concat([pd.read_csv(f) for f in glob("csv/*.csv")])
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 315 entries, 0 to 62
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            315 non-null    int64  
 1   program               315 non-null    object 
 2   command               315 non-null    object 
 3   compression_level     315 non-null    int64  
 4   compression_time      315 non-null    float64
 5   decompression_time    315 non-null    float64
 6   compressed_file_size  315 non-null    int64  
 7   input_file_size       315 non-null    int64  
dtypes: float64(2), int64(4), object(2)
memory usage: 22.1+ KB
data['compression_ratio'] = data.compressed_file_size / data.input_file_size

# Speed of compressing data in MB/s
data['compression_speed'] = data.input_file_size / data.compression_time / 1024 / 1024

# Speed of decompressing data in MB/s
data['decompression_speed'] = data.compressed_file_size / data.decompression_time / 1024 / 1024

data['input_file_size_mb'] = data.input_file_size / 1024 / 1024
data['input_file_size_mb'] = data['input_file_size_mb'].astype(int)

I have 9 different compression levels, I think they can be treated as different algorithms. So, I have 5 programs x 9 compression levels = 35 different algorithms. They can behave differently for each file.

First, let’s check the compression size and time needed. The Compression Ratio of 0.3 means that the compressed file has size of 0.3 of the size of the original file.

ax = sns.lmplot(x="compression_level", y="compression_time", hue="input_file_size", order=1, col="program", data=data, legend=False)
ax.set(xlabel='Compression Level', ylabel='Compression Time [s]')
ax.add_legend(title="File Size [B]")
plt.show()

ax = sns.lmplot(x="compression_level", y="compression_ratio", hue="input_file_size", order=1, col="program", data=data, legend=False)
ax.set(xlabel='Compression Level', ylabel='Compression Ratio')
ax.add_legend(title="File Size [B]")
plt.show()

The first thing we can notice is the totally different timing for the lzma and xz programs. The charts look the same, this is mainly because they use the same algorithm.

The best compression ratio was achieved using lzma/xz with compression ratio=9. The only problem was with the compression time, which is rather terrible compared to the bz2 (which has similar ratio but much smaller time).

Also, notice that for the larger files the compression ratio is much better for the larger compression level arguments.

Let’s check the numeric values for the 1GB file for the best compression level:

lzma_info = data[
      (data['compression_level'] == 9)
    & (data['input_file_size'] == 1000000000)
]

lzma_info[['program', 'compression_time', 'decompression_time', 'compression_ratio', 'compression_level']]
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

As you can see, bzip2 can make the same file size (17% of the original file size) but with 11% of the time of the lzma/xz.

However, the decompression time for bzip2 is 31s, compared to 11s for lzma.

Let’s ignore the lzma/xz from now. The time needed for compression makes these programs totally useless, especially when we can have the same compression ratio with bzip2.

Also, let’s see the characteristic of the 3 biggest files: 10MB, 100MB, and 1GB.

data = data[
      (data['program'].isin(['bz2', 'zip', 'gzip']))
    & (data['input_file_size'] >= 10000000)
]

Let’s take a closer look at the same charts as above:

ax = sns.lmplot(x="compression_level", y="compression_time", hue="input_file_size", order=2, col="program", data=data, legend=False)
ax.set(xlabel='Compression Level', ylabel='Compression Time [s]')
ax.add_legend(title="File Size [B]")
plt.show()

ax = sns.lmplot(x="compression_level", y="decompression_time", hue="input_file_size", order=2, col="program", data=data, legend=False)
ax.set(xlabel='Compression Level', ylabel='Decompression Time [s]')
ax.add_legend(title="File Size [B]")
plt.show()

As you can see, the compression time for the Compression Level = 9 for the programs is similar. Let’s stick to this one only:

data = data[(data['compression_level'] == 9)]

Let’s see how fast the files are compressed and decompressed

data = data[[
    "program", "input_file_size_mb", "compression_ratio", "compression_speed",
    "decompression_speed", "compression_time", "decompression_time"]]
data
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

means=data.groupby('program', as_index=False).mean()
means
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Final Remarks

Zip and gzip have similar characteristic, so let’s compare gzip with bzip2.

The bzip2 program makes 40% smaller files. But it’s not for free. The compression time is 30% larger. However, the biggest difference is in the decompression time, bzip2 is 5.4 times bigger.

On the other, hand in case of compressing a 1GB and decompressing the archive, the times are:

bzip2 gzip
compressing 89s 69s
decompressing 31s 5s

I think you should answer the question yourself:

is the additional 11% of the archive size worth 5400% longer decompressing time?

lichess_bzip2_size = 267

bz2_compression_ratio = means[means['program']=='bz2']['compression_ratio'].values[0]
print(f"Bzip2 compression ratio {round(bz2_compression_ratio*100)}%")
gzip_compression_ratio = means[means['program']=='gzip']['compression_ratio'].values[0]
print(f"Gzip compression ratio {round(gzip_compression_ratio*100)}%")

unpacked_files_size = lichess_bzip2_size / bz2_compression_ratio
print(f"Estimated size of unpacked files {round(unpacked_files_size)}GB")

lichess_gzip_size = unpacked_files_size * gzip_compression_ratio
print(f"Estimated size of lichess files as compressed with gzip {round(lichess_gzip_size)}GB")

bzip2_decompression_speed = means[means['program']=='bz2'] ['decompression_speed'].values[0]
gzip_decompression_speed  = means[means['program']=='gzip']['decompression_speed'].values[0]

bzip2_decompression_time = (lichess_bzip2_size * 1024) / bzip2_decompression_speed / 3600
gzip_decompression_time  = (lichess_gzip_size * 1024)  / gzip_decompression_speed  / 3600

print(f"Estimated bzip2 decompression time: {round(bzip2_decompression_time)} hours")
print(f"Estimated gzip decompression time: {round(gzip_decompression_time)} hours")

Bzip2 compression ratio 17.0%
Gzip compression ratio 28.0%
Estimated size of unpacked files 1530.0GB
Estimated size of lichess files as compressed with gzip 434.0GB
Estimated bzip2 decompression time: 14.0 hours
Estimated gzip decompression time: 3.0 hours

Small Notes about Chess Files

This analysis is a side effect of my work on analysing lichess data. I have 267GB of bz2 files. With the compression ratio of 17%, the size of the decompressed files is about 1.5TB.

If the files would be packed with gzip, then the archives would have 439GB.

Then the decompressing time would be 3h. In case of the bzip2 files, it’s 14 hours.

As I’m not going to decompress the files, but analyze them using streams (so I will decompress them redirecting the output to an input of a program analyzing the content) it’s possible that I would need to run it more than once.

So, in this case I’m not sure that using bzip2 is a good idea.

Or maybe a good idea would be to repack the bzip2 files to gzip ones.