Say you need to protect access to a full website that’s already built. The easy way to do that is to stick it behind a standard HTTP password protection thingamajig, like this: http://httpd.apache.org/docs/2.0/howto/auth.html

That works all well and dandy for a few users, but what if you’ve got 10,000 users and you just get a flat-file full of them on a daily basis?

I did the simplest thing possible at first, I just iterated over that flat file and ran htpasswd for each user. That went fine until a few days ago when suddenly it didn’t anymore.

Running htpasswd repeatedly means that for each user, htpasswd has to open the password file, scan through it looking for that user, and if it finds the user updating in place, otherwise appending to the end. My desktop PC can do that for 10,000 users in 5 minutes or so, but the web host we’re using suddenly had a huge IO bottleneck, and it wasn’t finishing within a day, when it’s supposed to run again. Or at least until the hosting service killed it, I’m not sure whether it ever finished. Anyway, adding a new user to a 10,000 user file was taking somewhere around 30 seconds a user.

Two things let me fix this easily. First, I know ahead of time that I’ll just be recreating the password file from scratch each time. I get a fresh flat list of users each day; recreating means I don’t have to worry about tracking people who’ve moved out. Second, Apache HTTPD lets you use a bog-standard SHA-1 hash in the password file. So I can skip all that tedious scanning of the file and just write it out in one go.

In Python 2.3 (that’s what the host runs), that looks something like this:

#!/usr/bin/env python
import sha
import base64

# I'm skipping the reading in and parsing of the flat file;
# assume this is a list of (id, pin) tuples
user_list = [...]

# Open the file we're writing to
output_file = file('/path/to/output.passwd', 'w')

for user in user_list:
    hashed_password = base64.encodestring(sha.new(user[1]).digest())
    output_file.write("%s:{SHA}%s", user[0], hashed_password)

output_file.close()

And now the whole process finishes in about 0.7 seconds on a heavily loaded shared web host.