File compression is a method of reducing the size of a file without a significant loss of information in order to save storage space and easy transfer of the files. Database compression is a method of reorganizing data to save space and increase performance. It could be a file or database, compression generally is meant to compress the size and speed up the file transfer.
There are two main types of compression namely; Lossy and lossless compression. This blog will provide insight into file and database compression with a short summary of the compression algorithms.
Lossy compression removes insignificant information and commonly done on video and audio files to reduce the file size. The loss is usually not noticeable, however, if the file is heavily compressed, then there would be a noticeable drop in quality.
This compression is a great option for image or video uploads, because it preserves the required information and reduces the file size but makes it easier for transfer and the loss in quality is usually not very noticeable to the user. This means Lossy compression reduces the quality of the media, but it is not visible to naked eye. The picture below is a simple example of how a PNG file is compressed to a JPEG file. While the size is reduced, there is not much of a visible effect from the output.
Lossy compression is not ideal for files where the details of the information are crucial (like in spreadsheets and word documents) because the compression can mess up the output; meaning when the file is decompressed the text might be garbled leading to loss of information.
Lossless compression reduces the file size without losing any information. The algorithm works on the principle of eliminating or handling redundancy. It can be applied to both text and image files. As there is no loss of information, when the file is decompressed, it will restore it to its original state.
This is a perfect technique for text compression where there is no compromise for loss of any data after decompression. Zip file program in Windows uses this method of compression.
Compression for databases can be done using lossy or lossless technique as well. The type of compression to be applied is decided based on the data stored in the database but most database applications use lossless compression to preserve the quality of data. Database applications employ different types of compression algorithms like run-length encoding, prefix encoding, compression using clustering, sparse matrices, dictionary encoding or any own proprietary compression methods.
Popular Compression Algorithms
Run-Length Encoding (RLE)
Run-Length Encoding is a lossless compression technique which scans data sequentially for repetitive value and its occurrence then encodes to a sequence of value and its count.
With RLE algorithm applied to the above line, it will be encoded to the following: 4W4H4Y
This algorithm also removes redundant data but its rather difficult to implement. This compression technique uses prefix encoding to compress and then transfer the data.
In the example above the data has the same prefix thus it was simple to choose the prefix, however usually data The problem with this compression is selecting the right prefix for data without much of similarity. This compression algorithm is good for date and time, and geolocation as these data have a good prefix pattern.
Dictionary compression algorithms replace long strings with shorter codewords. The codewords are compiled into a dictionary and stored in the header row. As an example, let’s say a dictionary has codeword value as 1-life, 2-great, 3-and, 4-but. Now if the file content is “Life is great and fun but complicated”. These words will be replaced with indexes from the dictionary. Now the file will have 1is23fun4complicated. Use the dictionary to decompress the file to its original form. If the file has many words matching the values in the dictionary, then there will be a significant reduction in the file size.
Benefits of Database Compression
- Compression is used to reduce the overall database size to save database storage space. The compression rate is better if there is a lot of repetition in data values or there are many tables with less data or zero value data.
- Read speed is faster when the file size is small, however, the write operation might be a little slower because the data needs to be decompressed before the write operation.
- Resource utilization is lesser because more data can fit in memory or buffer.
Disadvantages of Database Compression
- Compression algorithms always build a keyword dictionary which is a part of the compressed database. If the database is small, there is a possibility that with the keyword dictionary the file size is larger than the original database.
- Extra overhead cost with compression and decompression because of extra CPU/ memory utilization.
- It is not recommended to compress numerical data and non-repetitive strings as this might increase or decrease the file size.
File Compression vs Database Compression
File data compression looks at how to minimize the value of the data items to reduce the size. Database compression, on the other hand, is an aggregate compression because it compresses data, indices, views, clusters and heaps. The compression is done on data across rows, columns, and at field level. BLOB data such as Image, video, and audio stored in the database can be compressed using the lossy compression.
In summary, compression is a technique to reduce the size of the file or database. This is very useful to reduce resource utilisation as well as save storage costs. A smaller size file or database is easy to transfer and speed operation processing. Not only this, a smaller file requires less memory and CPU utilization which is also another means to save overhead cost and increase data processing power. Large scale applications like Facebook, Google, or Oracle uses high-end compression algorithms like ZStandard, Snappy, XPress, and Oracle Advanced Compression to perform compression for the large data sets.