Tuesday, June 28, 2011

Node serving zipfiles

One problem I've been looking at recently is how to serve - efficiently - directories containing large numbers of tiny files to http clients. At the moment, we just create the files, put them on a filesystem, and let apache go figure it out.

The data is grouped, though, so that each directory is an easily identifiable and self-contained chunk of data. And, if a client accesses one file, chances are that they're going to access many of the other neighbouring files as well.

We're a tiny tiny fraction of the way into the project, and we're already up to 250 million files. Anyone who's suffered with traditional backup knows that you can't realistically back this data up, and we don't even try.

What I do, though, is generate a zip file of each directory. One file instead of 1000, many orders of magnitude less in your backup catalog (thinking about this sort of data, generating a backup index or catalog can be a killer), and you can stream large files instead of doing random I/O to tiny little files. We save the zip archives, not the original files.

So then, I've been thinking, why not serve content out of the zip files directly? We cut the number of files on the filesystem dramatically, improve performance, and make it much more manageable. And the access pattern is in our favour as well - once a client hits a file, they'll probably access many more files in the same zip archive, so we could prefetch the whole archive for even more efficiency.

A quick search turned up this article. It's not exactly what I wanted to do, but it's pretty close. So I was able to put together a very simple node script using the express framework that serves content out of a zip file.

// requires express and zipfile
// npm install express
// npm install zipfile
//

function isDir(pathname) {
if (!pathname) {
return false;
} else {
return pathname.slice(-1) == '/';
}
}

var zipfile = require('zipfile').ZipFile;
var myzip = new zipfile('/home/peter/test.zip');

// hash of filenames and content
var ziptree = {};
for (var i = 0; i < myzip.names.length ; i++) {
if (!isDir(myzip.names[i])) {
ziptree[myzip.names[i]] = myzip.readFileSync(myzip.names[i])
}
}

var app = require('express').createServer();

app.get('/zip/*', function(req, res){
if(ziptree[req.params[0]]) {
res.send(ziptree[req.params[0]], {'Content-Type': 'text/plain'});
} else {
res.send(404);
}
});

app.listen(3000);
console.log("zip server listening on port %d", app.address().port);

So, a quick explanation:

  • I open up a zipfile and, for each entry in it that isn't a directory, shove it into a hash with the name as the key and the data as the value.

  • I use express to route any request under /zip/, the filename is everything after the /zip/, and I just grab that path from the hash and return the data.


See how easy it is to generate something pretty sophisticated using Node? And even I can look at the above and understand what it's doing.

Now, the above isn't terribly complete.

For one thing, I ought to work out the correct content type for each request. This isn't hard but adds quite a lot of lines of code.

The other thing that I want to do is to have the application handle multiple zip files. So you get express to split up the request into a zipfile name and the name of a file within the zip archive. And then keep a cache of recently used zipfiles.

Which leaves a little work for version 2.

1 comment:

Dobes said...

I think if the main issue is backup you might find a way to use a filesystem to store the files, where the filesystem is a file on another filesystem. I think it's called a loopback device or somesuch.

Then get your "traditional" backup system to backup the whole filesystem. Assuming the "traditional" backup system compresses files it backs up then you'll have one big file to save and it'll compress away the empty areas.

Another good option: compare the cost of all this development and IT mucking around versus buying a better backup system.