Monday, November 19, 2007

False Positives

One of the things I measure over at the Blacklist Statistics Center is false positives. What are false positives? How do I use the term, exactly?

Ultimately, there are three different ways to define false positives, depending on whom you ask and who they are. Allow me to explain.

Here's what I think of as a false positive(1) in the context of DNSBLs: You did not receive a mail message you signed up for, and wanted to receive, because it was blocked by your use (or your ISP's use) of that DNSBL.

This is what I consider a false positive. If you signed up to receive news alerts and wanted to receive those alerts, but you couldn't receive them because your spam filter blocked that email, that's what I would call a false positive. This is very end-user focused, or recipient system focused.

That can be quite a bit different than what a blacklist calls a false positive(2). The example above might not be a false positive as far as the blacklist operator is concerned. Maybe somebody sent mail to their spam traps from that IP address. Or maybe the blacklist's policies are such that they choose to list an entire net block because of spam issues elsewhere in that net block.

I used to be a blacklist operator myself. Back then, what I considered a false positive(2) was a blacklisting that shouldn't have taken place, by my own reckoning. I primarily dealt with open relaying mail servers. If I had accidentally listed an IP address, even though it wasn't an open relay, that probably would be considered a false positive.

But, to the person whose mail is getting blocked as a result, that could constitute a whole other kind of false positive(3). Getting deep into that kind of false positive is a bit beyond the scope of what I'm doing here. Anybody whose mail has been blocked for any reason can feel it's unwarranted. Sometimes I would agree, sometimes I would not. But, that's not a debate for right here and right now.

Instead, I focus on the definition of false positives I think is most applicable to end users of DNSBLs: Mail you (or I) wanted to receive, but didn't receive, because that mail was blocked by that DNSBL.

That's the only kind of false positive I'm measuring and reporting on.

Saturday, November 10, 2007

The Union of UCEPROTECT

The folks behind UCEPROTECT asked me what it would look like if I were using all three UCEPROTECT backlist zones together. I thought it was a neat idea and decided to share the results publicly. Click here to take a look.

Spam & Ham: Overview & FAQ

A lot of people have asked how the spam and ham (non-spam) data is compiled for the Blacklist Statistics Center here at DNSBL Resource. Where does it come from? What senders does it represent? Here's an updated overview of what goes in to the spam and ham (non-spam) feeds here at DNSBL Resource.

On the spam side of things, the input comes from a series of spamtrap domains and email addresses.

  • When I first set this project up, I took a bunch of old, dead email addresses and domains that I have had for years but haven't been using lately. I turned them back on, reviewed long snaphots of incoming data, and weeded out a lot of “edge case” stuff – things that I probably did actually sign up for (like virus notifications, updates from my domain registrar, etc.). Anything that didn't look like something I might have signed up for was assumed to be spam.

  • I also have some filtering in place to try to keep out backscatter. Backscatter (or outscatter) usually consists of misdirected bounces received in response to somebody else's spam run, bounced back by a mail server that should know better. This is clearly a problem, but there is vast disagreement on the anti-spam front as to whether or not backscatter equals spam. Since few agree, and I want to focus on spam, I ignore this as much as possible. A little leaks through here and there, but I don't think it's enough to skew any stats.

  • I recently registered some new domains that I and others knew were already were on spam lists. Anybody sending to these new domains clearly is doing a bad thing – sending to very old addresses, ignoring bounces, forging header information, etc. These also feed into the spam results.

  • From all of these sources, I get an average of over twelve thousand spam messages a day.

On the ham (non-spam) side of things, here's what I've done:

  • First, I signed up for a bunch of email lists. Stuff that I think regular users sign up for. Some of it is commercial, some of it isn't.

  • By commercial, I mean newsletters from different retailers, ones where I have a pretty strong suspicion that people actually sign up for their mail. Clothing stores, electronics retailers, etc.

  • Restaraunts. Some national chains and etc., but mostly info from my favorites in and around Chicago, Minneapolis, and other places I travel to.

  • Lots of media-related things. By this I mean news alerts from different newspaper and TV stations. Weekly newsletters for my favorite public radio shows. International media, national media, some local media. Movie reviews, too.

  • Some travel-related things. Notifications from different travel sites on upcoming sales, airport delays, etc.

  • A bit of geek stuff. Virus alerts, some how-to newsletters, various tech and science newsletters, etc.

  • In addition to all of this, there's a lot of one-to-one mail in the loop now, too. Mail from users at AOL, Hotmail, Yahoo, Gmail, and other big ISPs.

Frequently Asked Questions about the Spam and Ham Sources

What happens if I receive both spam and ham from the same IP address?
There's no evidence that this is happening yet, but if it happens, the spam is going to show up in the spam bucket, and the ham is going to show up in the ham bucket. I'm calculating based on specific email messages received, not just the IP address of the sender. Under no circumstances have I ever taken spam and counted it as ham, or vice versa.

But big company X is sending you ham (desired mail) and sending other people spam!
I kick senders out of the hamtrap feed if I see them doing something bad, like sending spam or re-purposing email addresses. I don't, however, take a blacklist's word alone that somebody must be a spammer simply because they're blacklisted. Clearly, not every blacklist gets it right every time. Even a good blacklist might list somebody who is sending me wanted mail, perhaps because they're sending unwanted mail to someone else. My take on this is that the more often this happens, the more likely it is that the blacklist is overly aggressive or questionably accurate. It's up to readers of my site to decide if the data I report suggests the same to them. Not everyone is likely to come to the same conclusion.

But the big ISP mail servers also send spam – aren't you going to mislead people by counting a mail from AOL as a false positive hit if that same AOL server is also sending spam?
Sure, every network emits spam sometimes, to some degree. I think the big mail servers at the big ISPs are probably no different. But, can you safely block mail from these IP addresses? After all, they send millions of legitimate messages daily. If you care about not blocking mail that your users want, you are probably going to tread lightly when it comes to deciding whether or not to block servers like that. I suspect that blacklist publishes face similar challenges. Maybe this data reveals exactly how quick on the trigger a blacklist may be in that situation.

But this is too much ham (non-spam) for one person to receive; it's not reflective of normal mail.
Sure, it's a bit concentrated, and the volume is somewhat high, but it's not supposed to be reflective of one single person's mailbox. Instead, it's actually a combination of a bunch of kinds of desired mail, from a bunch of different sources, that regular users are (in my humble estimation) are likely to receive. A single user at an ISP is unlikely to receive the 12,000+ spam messages I receive every day – it's similarly a combination of spam sent to a bunch of different users.

Clearly, you must be gaming the ham and spam feeds to make blacklist X look good at the expense of blacklist Y.
No, I am not. I'm simply reporting how these blacklists intersect with my own mail streams. Your mail streams may be different than mine. The same goes for any blacklist – not all are created equal. Not all have access to the same amount, or same quality, of data from which to decide what to list. Some might work better in foreign countries (I am in the US), some might work better in a hobbyist or educational setting (I think my data is more reflective of what a small to midsize ISP might see.) I have had some blacklist operators tell me that my data nearly exactly matches theirs, and I have had other blacklist operators tell me that my data is nothing like theirs. As always, your results may vary.

You really need to show results based on unique IP addresses.
I don't dedupe (remove duplicates from) the results based on IP address because I'm not counting IP addresses; I'm counting email messages. This isn't about who has the biggest list with the most IP addresses; it's about how accurate it is against my own mail stream. Any regular user who finds that a blacklist blocked ten spams from the same IP address is going to call that ten hits; not one hit.

I don't like this data because of X, Y or Z.
The best recommendation I can give in this situation is that you should consider generating your own statistics and sharing them with the world. I know that my mail streams and results definitely match
what some people see – because in a lot of cases those people have contacted me and told me so. It's also exactly reflective of my own mail stream. Just because it's what we see doesn't mean that this is exactly what you'll see if you use the same blacklists. There are too many open variables, from the side of my spamtraps, to which spam lists I'm on, the composition of the mail your users sign up for, etc. As I said above, your results may vary.

Incidentally, I'm not above some friendly competition. I'd love to see more sites like this out there.

If you have any questions or comments about anything here, about the Blacklist Statistics Center, or anything on DNSBL Resource, please don't hesitate to contact me.

Saturday, November 03, 2007

Status of bl.csma.biz: ALIVE

McFadden Associates has been publishing two different, spamtrap-driven DNSBL zones since October 2003.

  • A primary zone, bl.csma.biz, containing only aggressive hosts that have spammed repeatedly during a short (recent) timeframe.
  • An additional zone, sbl.csma.biz, with more aggressive listing criteria. It lists hosts that have generated spam within a 45-day period. They recommend that this one not be used for outright rejections; indicating that it was instead more suitable for use in scoring systems like SpamAssassin.
I personally use these lists in my day job as one of many data points to vet potential clients. In late October, coworkers asked me to look into repeated timeouts in our DNSBL lookup tools. Investigation revealed that the McFadden blacklist name servers and website were no longer reachable on the Internet.

I contacted one of the administrators behind these lists on Saturday, November 3, 2007, who indicated to me that the issue was unintentional; related to problems with a new piece of hardware. The situation has been corrected, and these lists are now again responding to queries.

Status of rbl.spamhaus.org: NOT A BLACKLIST

My friend Mickey Chandler pointed out recently that he's been seeing some unusual bounces that look like this:

Host blacklisted - Found on Realtime Black List server blocklist.address.is.wrong.spamhaus.org

It turns out that you block all of your inbound mail with an error like this if you configure "rbl.spamhaus.org" in your mail server as a DNSBL zone for blocking purposes.

Why? Because there is no such zone as rbl.spamhaus.org. Looks like Spamhaus set up these DNS responses to queries of this non-existent zone so that people would quickly realize that they're querying the wrong thing.

Actual Spamhaus DNSBL zones include sbl.spamhaus.org, xbl.spamhaus.org, zen.spamhaus.org, and others. I recommend using them, the ZEN zone in particular, as they're very accurate. But, regardless of what zone you use, it's important to use one that actually exists.

Users of rbl.spamhaus.org will find no spam blocking value from use of this zone, and are likely to find all of their inbound mail rejected.