The Deliverator – Wannabee

So open minded, my thoughts fell out…

The Case for Archiving

Posted by Deliverator on April 25th, 2006

Today’s hard drives are huge. 500 GB drives have been on the market for quite a while, and 750 GB drives that utilize perpendicular recording techniques are on the horizon. Today’s media junkies still find ways to fill these huge capacity drives. Most people’s approach to data bloat has been to simpy add another drive. With the $/GB ratio as favorable to the consumer as it is, there are few obvious reasons why not to simply add another drive.

Earlier in computing’s history, the picture was very different. My first real computer had a 80 MB hard drive, which was absolutely huge for the time. Even with a vast 80 MB hard drive, I still found ways to fill it. I was forced to run many programs (ok, games) from floppy disk. I got to know my filesystem real well in those days, and pruned unnecessary files like a overzealous gardener. Adding a second hard drive was not an option. The cost of storage was huge, and my computer couldn’t have accomodated another hard drive even if I could have afforded one. When cd-rom drives finally came on the market at a reasonable price point, I had to hook it into my system using the port on my soundblaster card, for lack of a port on the motherboard. It was in this era of storage frugality that disk and file compression utilities started being developed in ernest. Utilities like Disk Doubler offered to magically increase your storage capacity, albeit at a rather heavy performance cost of performance. My system, a 386/16 with 4 MB of ram, couldn’t really afford to take a performance hit, so instead of compressing the filesystem, I archived less often used files with pkzip. Many new archive/compression formats have been introduced over the years, but the zip format has remained one of the most commonly used, especially for distribution of files online, where modems still predominate.

With huge hard drives readily available, what then is the case for archiving files today?

  1. Archiving files leads to reduced filesystem complexity and better performance. Most file systems are still not very good at managing large number of files. The more files you have, the slower your computer will be at accessing them. In large part, this is due to reliance on file allocation tables, a list which must be scanned to locate the physical location of a file on a disk. The more files on this list, the longer it takes to find any particular one of them. This is a simplification to be sure, but lets just say that having fewer files on your hard drive is beneficial to system performance.
  2. Reduced system slack. File systems allocated space on a hard drive in discrete chunks. On most systems, your computer allocates 4 Kilobytes at a time. If you have a 3 KB textfile, it actually takes 4 KB to store on disk. If you have a 6 KB file, it take two 4 KB chunks to store it. Having lots of small files on you hard drive can eat your available hard disk capacity quickly. A single big file can only use a max of 4 KB of “slack” space. On the other hand, 5000 small files will have on average 2 KB of slack per file. If you can replace 5000 files with 1 big one, you have saved 10 MB in slack space alone, above and beyond the space gained by file compression.
  3. Although zip is still widely used as a archive format, many better alternatives exist. For the past few years, I have been using 7zip. 7zip, is a great open source program. It handles almost every archive format known to man, but also comes with its own compression algorithm, called LZMA. LZMA, with the right settings, is often 20% more efficient than zip. I managed to free up somewhere between 15 and 20 GB simply by recompressing old zip and rar format archives to 7zip format. The big downside to the 7zip format is that it is very resource hungry. During compression, LZMA needs to use a lot of memory. On the maximum compression setting (using a 48 MB dictionary), 7zip uses over 500 MB of ram. To decompress a file, it needs a little bit more ram than the dictionary size used to compress the file in the first place. In other words, it takes about 500 MB of ram to compress an archive and 50 MB to decompress. So, 7zip might not be the best choice for lower performance systems, but I haven’t found anything else that comes close to its compression ratios.

I spent the last few days doing spring cleaning on my system. I hadn’t done a thorough expunging of garbage from my system in years. It is really remarkeable how much space I saved and how much snappier my system feels now.