IF Archive download-the-whole-thing link

zarf · February 19, 2024, 12:52am

Occasionally someone asks us how to download the entire IF Archive at once.

Currently the only supported answer is to use a web-scraping tool based on the public index pages or Master-Index.xml. I know from our bandwidth usage that people sometimes do this.

As an experiment, I have created a single-file download source. For information about how to get it, read this page.

I plan to update that download weekly.

(The experiment is (a) does this reduce scraping of the main Archive server? And (b) how much does it wind up costing us in AWS hosting fees? I’ll report back in a couple of month.)

ramstrong · February 19, 2024, 3:15am

I appreciate what you did, truly, but I’m not about to download 30GB worth in one sitting. In fact, I keep thinking that anything over 500MB is rather stiff. I’ve downloaded 6GB zip files before and the experience were always unpleasant, regardless of my available bandwidth.

Any chance of splitting the zip files to their top most directories? There are directories that I want to download wholesale, but not all of them!

Actually, depending on the directory, it may even make sense to split the subdirectory further, as well as providing a “new file addition” so people can just download them monthly and be up to date.

Just a thought.

zarf · February 19, 2024, 4:07pm

It’s safe to say that if you’ve never tried to figure out how to download every Archive file with a script, you’re not the target audience for this announcement.

That wouldn’t change much, as /games would be on the high side of 25 gigabytes by itself.

I don’t want to get into trying to guess what subsets people want. That really is a job for a scraping script. “Recent files” is possibly a workable idea, though – thanks.

ramstrong · February 19, 2024, 6:21pm

You mean something like rsync?
I’ve use that before with 15 min time out, although I mostly use that on local intranet.

Edit: Hmmm. Maybe I can filter out the master list and use wget from there. We’ll see.

Wysardry · February 21, 2024, 6:49am

Maybe you could also add a link to the HTTrack website copier to make it easier for people who just want to download a subdirectory?

It can also update an existing mirrored site, and resume interrupted downloads, so that might save you some bandwidth in the long run.

zarf · April 4, 2024, 1:12am

Followup note: This link was costing us more in AWS bandwidth fees than it seemed like it was worth. I’ve disabled it.

If you want a snapshot as of February, see here: https://archive.org/details/ifarchive-archive

Rsync seems like the next sensible thing to try.

AndrewStephens · April 4, 2024, 1:29am

Have you considered seeding a torrent? More accessible than rsync and clients usually let users download what they want without trouble.

zarf · April 4, 2024, 2:56am

People keep suggesting bittorrent, but I think it wouldn’t fit the use case at all. The dump file changes more often than people want to grab it.

ramstrong · April 4, 2024, 3:15am

Great! Downloaded.

Rsync may be overkill if done regularly by a lot of people. Maybe those who download the whole site can be persuaded to open up their site as mirrors?

Maybe monthly patches (diff) is desirable? I don’t know how non-unix users are going to cope, though. Diff and patch may not be standard commands in Windows or iOS.

Edit: Added mirroring site possibility.

Dannii · April 4, 2024, 3:29am

There is a torrent for the archive.org file. But I’m not sure if anyone would be seeding it, maybe archive.org does itself?

Wysardry · April 4, 2024, 4:33am

As a Windows user, I searched for an alternative to rsync and the closest free tool I could find was FreeFileSync.

For downloading websites, HTTrack (that I previously linked above) is likely to be easier to use.

zarf · April 10, 2024, 3:02am

I’ve added a page describing rsync access:

Downloading the entire IF Archive

MacOS and Linux come with rsync installed. Windows can install it via WSL or the Chocolatey package manager.

The nice thing about rsync is that it’s incremental. For example, this command will download all the files in games/zcode:

rsync -a rsync://rsync.ifarchive.org/if-archive/games/zcode destdir

It takes about 70 seconds the first time you run it. After that, destdir has the files, so if you re-run the command, it only downloads changed or new files. If nothing has changed, the command determines this and exits immediately.

ramstrong · April 14, 2024, 7:53pm

Tried it successfully.

It took me 5 minutes to download the zcode directory on my end, so I set the timeout to 60 seconds as to not burden the bandwidth.

The magazines directory took 3 tries.

I use the command rsync -avv to tell me which files are already present and not updated.

Thank you for providing this service!