Posted by Deliverator on September 25th, 2005

I have found Wikipedia to be a wonderful quick reference and jumping off point for more serious research. Being, as Ryan so aptly puts it, “crazy,” I set out to figure out how I could carry this veritable Encyclopedia Galactica with me at all times. An older, offline version of Wikipedia is available in different versions of the Tomeraider ebook format, but there is as yet no reader for this format available for older ppc/hpc devices like my Jornada. While tomeraider’s formatting abilities have improved in more recent versions, it still mangles complex tables and other more layout-dependent data. I looked into alternatives and found a script that goes through a downloaded copy of the wikipedia database and extract all the articles out as html flat files. Another alternative may be to use the Wikifilter apache filter to generate the html on the fly. This would require me to run apache on my Jornada (which is doable), but I would probably have to do the indexing elsewhere. I tried a PHP based extractor, but ran into too many issues. For now, I have stuck with the most recent downloadable output from the tero-dump script and after some curiosities relating to NTFS reserved words and FAT FS supported characters, I now have a copy of the Wikipedia on a 2.2 GB CF Microdrive for use on my Jornada! This version does not include any images, but manages to maintain most of the same formatting as the online version. The total number of html files produced by the script totalled 208,000+ files taking up some 1.25 GB in size (more with FS overhead). That is a whole lot of text :0

