Open-Source Articles


Compression Programs for Chess Programmers and Tournament Directors

by Ron Murawski © 2006

Introduction

Chess is a hobby which invariably leads to mountains of data. Larger-sized PGN collections tend to be in multi-megabyte sizes. The most popular way to squeeze down the sheer size of the data is to compress these files using the Zip format. Zip is quick, convenient, and seems to do a great job. But... What if I told you that zip files are not very good at compressing data?

If you actually test it, you will find that the Zip method of compression rates near the bottom of the list of the many archiving programs that are available! It turns out that Zip is optimized for execution speed only. Zip's compression is woeful by comparison to better compression programs. So, what are these better archiving programs? They have names like 777, 7-ZIP, ACB, ARI, BOA, BZIP, COMPRESSIA, DURILCA, ENTROPY, IMP, SBC, and ZZIP. Never heard of them?

Most of these programs are somewhat experimental in nature and are available at the renowned Elf Stuba ftp site. This site allows a limited amount of simultaneous anonymous users so you may have to try repeatedly before you finally connect. Many of the same programs plus a few more can be downloaded from SAC. Some of the programs are available in source code form -- these are mostly GNU "copyleft" programs. Some are commercial efforts and are offered as shareware. Hint: When running an unfamiliar commandline compression program always type "<program_name> -help" so you can find out how to use it.

Believe it or not, but compression programs have become a hotbed of refinement using older ideas in new ways. Almost all create "solid" archives and call older archives "hollow". The truly curious can visit DataCompression.info. There you can read doctoral theses on the subject or download source code and libraries. It's an almost endless listing of people, conferences, and information devoted to compression.

Most lossless compression programs use run-length encoding (RLE) and/or Huffman coding, which is a specialized form of arithmetic coding, in some part of their algorithms. After that the programs diverge, using either some form of LZ compression (Limpel-Ziv), Burrows-Wheeler Transform (BWT, also called block-sorting compression), or PPM (Prediction by Partial Matching). There are many other compression methods but the others are usually specialized and not good for general use. As a rule of thumb LZ seems to do well compressing exe files, BWT excels at large files with repetitions, and PPM does best compressing predictable data.

I've done some experiments that are aimed at finding the "best" compressor for chess authors and tournament directors. This is my set of rules of what constitutes a "best" compression program:

  1. It must compress reasonably well
  2. It must compress/decompress in a reasonably fast time
  3. It must be convenient for an average user to decompress the file

Rule (1) is self-evident. I'm assuming that you have a website with files that are available for public download. You will want the compression to be very good. Many web hosts limit the total number of bytes transferred per day or per month. Going over the limit will either shut down your site temporarily or it can mean a larger bill to pay at the end of the month. Other hosts charge extra if your files get over a certain-sized quota -- exceed the quota and your monthly payment will be larger. The smaller the size of your files the better off you are.

A good example of a program that violates rule (2) is the PAQ program. You can brew a pot of coffee and bake some muffins while waiting for it to compress a single large file, but the resulting file may very well be the smallest compressed archive you have ever seen. Rule (2) is a yes/no item for me -- it's either "fast-enough" or it's not.

Rule (3) is the controversial one. "Convenient for the average user" -- hmmmmmmm. What that means to me is that there must be an easy way for a user to decompress the file. It must have either an easy-to-use GUI program or, the compressor must be able to create a self-extracting exe file that, when run, decompresses itself automatically. I also don't like shareware programs -- every time I'm forced to use WinRar I feel guilty for using it beyond its 40-day evaluation period.

Running various benchmarks and following the above rules left me with just three programs to choose from. All the others failed at least one of the rules. For informational purposes I will include some other popular programs and some others that just seem to compress chess files really well.


The PGN Test

I decided to compress a huge pgn file. The original size was 219,033 KB, or about 214 megabytes. PGN files are highly repetitive and predictable -- compression programs do especially well with them. The maximum compression option was used throughout all tests except for SBC where maximimum compression took much too long.

ProgramMethodTime (mm:ss)Size (KB)% Orig SizeComment
7-ZipPPMD2:2536,98916.89%very good
7-ZipLZMA11:0643,76519.98%too slow!
7-ZipBWT6:3242,61419.46%slow
BZip2BWT1:4842,62019.46% 
Compressia???2:1639,71718.13% 
Durilca Light-t13:1735,93116.40%best!
SBCBWT2:2537,89217.30% 
WinRarLZ772:3737,97217.34% 
ZipLZ0:4065,17729.76%worst!
ZZipBWT2:2440,94718.69% 

First of all, please note that for the 7-Zip compression method I have used the true method name of BWT rather than the "BZip2 method" the program calls it. Also notice that the Durilca Light file is about *half* the size of the Zip archive: I warned you that file compression with Zip wasn't very good! 7-Zip PPMD does well here.


The WinBoard Engine Test

In the past I have always used Zip files for Horizon downloads. For this test I compressed the files from Horizon 4.3. Uncompressed there are 9 files that total 364 KB in size. The Horizon exe file is by far the largest being about 75% of the total size.

ProgramMethodTime (mm:ss)Size (KB)% Orig SizeComment
7-ZipPPMD0:0116846.15% 
7-ZipLZMA0:0215743.13%good
7-ZipBWT0:0218149.73% 
BZip2BWT0:0118550.82% 
Compressia???0:0116645.60% 
Durilca Light-t10:0414339.29%best!
SBCBWT0:0116144.23%good
WinRarLZ770:0116745.88% 
ZipLZ0:0118651.10%worst!
ZZipBWT0:0117848.90% 

Notice that the Durilca Light file is about 3/4 the size of the Zip archive. Zip compression is again the worst but the slight size savings are perhaps not worth switching away from it.


The Crafty Book Test

I have a Crafty book, book.bin, that is 72,702 KB in size. Opening books are not as predictable nor as repetitive as PGN files.

ProgramMethodTime (mm:ss)Size (KB)% Orig SizeComment
7-ZipPPMD1:5431,42543.22% 
7-ZipLZMA2:0725,06734.48%very good
7-ZipBWT2:3631,27243.01% 
BZip2BWT0:3631,61243.48% 
Compressia???0:5225,48935.06%very good
Durilca Light-t13:1232,12944.19%worst!
SBCBWT1:4524,84234.17%best!
WinRarLZ770:5328,14438.71% 
ZipLZ0:4829,28240.28% 
ZZipBWT0:5030,63742.14% 

Finally! A test where Zip compression does not finish last... The best compressor's size is 85% of Zip's size. Durilca falters badly and, once again, one of the 7-Zip methods finishes as near-best.


Comparing Compressors

First of all some general observations and comments:

  • 7-Zip: never use the BZIP2 option (using BWT method) to compress a file -- it gives comparible-size archives as the genuine BZip2 program but runs about 4 times slower
  • Durilca Light results are very inconsistent. It's either the best or the worst.
  • ZZip seems to be an improved version of BZip2, squeezing out more using block-sorting (BWT) than BZip2 can manage, but taking longer.
  • SBC is even better than ZZip. It would be a program of choice if it had a GUI interface or could create self-extracting archives.
  • Compressia, with a Windows GUI and an option to create a self-extracting archive, would have been a program of choice if it was still available. The author has removed it from his website and it is no longer supported.

One way to view the results is to present the list of programs along with the average compression and totalling the amount of KB saved on the 3 tests:

ProgramAvg reductionKB saved vs Zip
7-Zip31.50%32,432
SBC31.90%31,750
Compressia32.93%29,283
Durilca Light33.29%26,442
WinRar33.98%28,362
Zzip36.58%22,883
BZip237.92%20,228
Zip40.38%0

The last column is interesting -- Yes, it's true: storing 7-Zip files instead of Zip files would save about 32 MB of space! Now: Which are the programs that can pass the three rules? Certainly not Zip because the compressed files are much too large compared to the others. Not Compressia because the program is no longer available. Not BZip2 nor Durilca Light nor SBC because they all require commandline arguments that might confuse a non-technical person. Zzip also requires commandline args but it can create a self-extracting exe file. In the final analysis, we are left with three candidate programs as "best". 7-Zip, WinRar, and ZZip. What are the pros and cons of using these three?


ZZip

ZZip is a good block-sorting compression program; it always seems to outperform BZip2. The biggest drawback to using ZZip is that you need to deal with a commandline interface. But if you are reading this, you are either a programmer or a tournament director and this should not dissuade you. This is an open-source project with a nice source code library that can be included in C programs. Two programs need to be run to create self-extracting files: Zzip.exe creates the archive, then Zzip-sfx.exe prepends the extractor. Keep in mind that any self-extractor contains not only the compressed archive, but the code for the decompressor as well. For ZZip this is about 27 KB of overhead which may be (or may not be) fairly small in comparison to the total archive size.


WinRar

I've said it before and I'll say it again -- I feel guilty when I use WinRar because I didn't pay for it. It's nagware so the first splash screen always reminds me that I *really* should purchase the program. If you can get past the guilt you'll find a very professional GUI interface that's as easy to use as any. It has a good compression engine as well: it's not the best, but it's in the respectable range. But, in all my tests, both 7-Zip and SBC out-performed it. If you decide to put Rar files on your website you are forcing others to use WinRar and perhaps bothering their consciences as well. (There is a free commandline UnRAR available, but I disqualify it from consideration because of the difficulty it would cause the average user.) The only ethical solution I see is to buy the program and use it to create self-extracting files. (The self-extracting option adds about 46K to the size of the archive.) Then your users can feel relief.


7-Zip

7-Zip has a nice, easy-to-use GUI interface. It also adds a 7-Zip sub-menu to the normal Windows menu that pops up when right-clicking on a file. Using the right-click or the GUI makes it easy to create or extract an archive. The only downside is that 7z files are rarely encountered and posting a link to the 7-Zip site is mandatory so people who download 7z files can decompress them. 7-Zip can create a self-extracting exe file which may be preferred by users to the 7z format, but it will cost an extra 129K of overhead. 7-Zip is an open-source project. When creating an archive, the encoding method must be chosen. Each method excels at compressing certain types of files. A few experiments will quickly reveal the best compression method. There is a source code library, but it's only for the LZMA method of compression.


Conclusion

7-Zip is the winner of my compression contest. Among the three finalists it produced the smallest files when all three tests were combined. 7-Zip's compression is so good and the Windows program GUI interfaces so clearly designed and easy-to-use that it is my compelling choice as the best chess-related compression program. Did I mention that it can also encode/decode plain old Zip files, BZip2 files, and WinRAR files as well?


Disclaimer

I don't claim to be an expert in compression, so if you find an error on this page, please email me and I will correct it. If you have a favorite compressor that you feel is superior to those mentioned, please email me and tell me about it; if it is comparable to the top three programs, I will add it to this page.

If you find an error, a bad link, or just want to comment, please email Ron Murawski