APEWS: Doing the Math

I'm guilty. I admit it. I've called APEWS listings "random," which isn't quite right. Arbitrary would be a better word for it. Not to mention broad, and questionable.

APEWS, the "anonymous" blacklist meant to be an early warning system for spam, generates a lot of worry from administrators and end users who find themselves listed by way of plugging their IP address into an online lookup tools like DNSStuff. Though it doesn't result in much (if any) of anyone's mail being rejected, as it's not widely used, some people still think they're being labeled a spammer, and don't know what to do about it.

They've usually done nothing to warrant the listing; the simple fact of the matter is that they happen to have an IP address on the internet, and there's more than a 1/3 chance that this IP address will be on the APEWS blacklist.

As I've indicated previously, APEWS has IP address entries accounting for about 42% of the raw numerical depth of V4 IP address space, though I'm not excluding non-routable space and overlap between some listings. When one takes those factors into consideration, APEWS seems to list somewhere around 38% of currently routable IP4 network space.

Time for an experiment. What if I take a large chunk of address space, say, 42%, and blacklist it all? I've got detailed records of spam and ham, and it's easy to bump my corpus up against an imaginary blacklist I've just made up right here on the back of this napkin.

Here's what happens when I do that: Over the past ten days or so, my 42% blacklisting of IP space would've captured 62.8% of spam, but also incorrectly captured non-spam 31.5% of the time.

When I skinny my imaginary blacklist down to 38% of IP4 space, I get a 62.21% hit rate on spam and 31.15% false positive rate against non-spam. (In other words, just about the same numbers.)

To me, this is evidence that APEWS seems to be blocking some spam based on the "stopped clock is right twice a day" principle. List a large chunk of IP address space, and you're going to catch a significant amount spam, though inaccurately.

It further suggests to me that if I added a few rules to start my focus points with a bit of accuracy, I could probably tune this to get a hit rate close to what I see from APEWS, with its 73% hit rate against spam, and 26% false positive rate against non-spam (21 day average ending on 9/2/2007).

The conclusion I draw from this exercise is that only the barest thought has been given to the processes by which APEWS decides which IP addresses to list and for what reason. If I can get more than halfway there with a couple hours of sloppy bar napkin math, then perhaps they haven't thought it through too deeply.