Updated: November 10, 2007.
A lot of people have asked how the spam and ham (non-spam) data is compiled for the Blacklist Statistics Center here at DNSBL Resource. Where does it come from? What senders does it represent? Here's an updated overview of what goes in to the spam and ham (non-spam) feeds here at DNSBL Resource.
On the spam side of things, the input comes from a series of spamtrap domains and email addresses.
When I first set this project up, I took a bunch of old, dead email addresses and domains that I have had for years but haven't been using lately. I turned them back on, reviewed long snaphots of incoming data, and weeded out a lot of “edge case” stuff – things that I probably did actually sign up for (like virus notifications, updates from my domain registrar, etc.). Anything that didn't look like something I might have signed up for was assumed to be spam.
I also have some filtering in place to try to keep out backscatter. Backscatter (or outscatter) usually consists of misdirected bounces received in response to somebody else's spam run, bounced back by a mail server that should know better. This is clearly a problem, but there is vast disagreement on the anti-spam front as to whether or not backscatter equals spam. Since few agree, and I want to focus on spam, I ignore this as much as possible. A little leaks through here and there, but I don't think it's enough to skew any stats.
I recently registered some new domains that I and others knew were already were on spam lists. Anybody sending to these new domains clearly is doing a bad thing – sending to very old addresses, ignoring bounces, forging header information, etc. These also feed into the spam results.
From all of these sources, I get an average of over twelve thousand spam messages a day.
On the ham (non-spam) side of things, here's what I've done:
First, I signed up for a bunch of email lists. Stuff that I think regular users sign up for. Some of it is commercial, some of it isn't.
By commercial, I mean newsletters from different retailers, ones where I have a pretty strong suspicion that people actually sign up for their mail. Clothing stores, electronics retailers, etc.
Restaraunts. Some national chains and etc., but mostly info from my favorites in and around Chicago, Minneapolis, and other places I travel to.
Lots of media-related things. By this I mean news alerts from different newspaper and TV stations. Weekly newsletters for my favorite public radio shows. International media, national media, some local media. Movie reviews, too.
Some travel-related things. Notifications from different travel sites on upcoming sales, airport delays, etc.
A bit of geek stuff. Virus alerts, some how-to newsletters, various tech and science newsletters, etc.
In addition to all of this, there's a lot of one-to-one mail in the loop now, too. Mail from users at AOL, Hotmail, Yahoo, Gmail, and other big ISPs.
Frequently Asked Questions about the Spam and Ham Sources
What happens if I receive both spam and ham from the same IP address?
There's no evidence that this is happening yet, but if it happens, the spam is going to show up in the spam bucket, and the ham is going to show up in the ham bucket. I'm calculating based on specific email messages received, not just the IP address of the sender. Under no circumstances have I ever taken spam and counted it as ham, or vice versa.
But big company X is sending you ham (desired mail) and sending other people spam!
I kick senders out of the hamtrap feed if I see them doing something bad, like sending spam or re-purposing email addresses. I don't, however, take a blacklist's word alone that somebody must be a spammer simply because they're blacklisted. Clearly, not every blacklist gets it right every time. Even a good blacklist might list somebody who is sending me wanted mail, perhaps because they're sending unwanted mail to someone else. My take on this is that the more often this happens, the more likely it is that the blacklist is overly aggressive or questionably accurate. It's up to readers of my site to decide if the data I report suggests the same to them. Not everyone is likely to come to the same conclusion.
But the big ISP mail servers also send spam – aren't you going to mislead people by counting a mail from AOL as a false positive hit if that same AOL server is also sending spam?
Sure, every network emits spam sometimes, to some degree. I think the big mail servers at the big ISPs are probably no different. But, can you safely block mail from these IP addresses? After all, they send millions of legitimate messages daily. If you care about not blocking mail that your users want, you are probably going to tread lightly when it comes to deciding whether or not to block servers like that. I suspect that blacklist publishes face similar challenges. Maybe this data reveals exactly how quick on the trigger a blacklist may be in that situation.
But this is too much ham (non-spam) for one person to receive; it's not reflective of normal mail.
Sure, it's a bit concentrated, and the volume is somewhat high, but it's not supposed to be reflective of one single person's mailbox. Instead, it's actually a combination of a bunch of kinds of desired mail, from a bunch of different sources, that regular users are (in my humble estimation) are likely to receive. A single user at an ISP is unlikely to receive the 12,000+ spam messages I receive every day – it's similarly a combination of spam sent to a bunch of different users.
Clearly, you must be gaming this data to make blacklist X look good at the expense of blacklist Y.
No, I am not. I'm simply reporting how these blacklists intersect with my own mail streams. Your mailstreams may be different than mine. The same goes for any blacklist – not all are created equal. Not all have access to the same amount, or same quality, of data from which to decide what to list. Some might work better in foreign countries (I am in the US), some might work better in a hobbyist or educational setting (I think my data is more reflective of what a small to midsize ISP might see.) I have had some blacklist operators tell me that my data nearly exactly matches theirs, and I have had other blacklist operators tell me that my data is nothing like theirs. As always, your results may vary.
You really need to show results based on unique IP addresses.
I don't dedupe (remove duplicates from) the results based on IP address because I'm not counting IP addresses; I'm counting email messages. This isn't about who has the biggest list with the most IP addresses; it's about how accurate it is against my own mail stream. Any regular user who finds that a blacklist blocked ten spams from the same IP address is going to call that ten hits; not one hit.
I don't like this data because of X, Y or Z.
The best recommendation I can give in this situation is that you should consider generating your own statistics and sharing them with the world. I know that my mail streams and results definitely match what some people see – because in a lot of cases those people have contacted me and told me so. It's also exactly reflective of my own mail stream. Just because it's what we see doesn't mean that this is exactly what you'll see if you use the same blacklists. There are too many open variables, from the side of my spamtraps, to which spam lists I'm on, the composition of the mail your users sign up for, etc. As I said above, your results may vary.
Incidentally, I'm not above some friendly competition. I'd love to see more sites like this out there.
If you have any questions or comments about anything here, about the Blacklist Statistics Center, or anything on DNSBL Resource, please don't hesitate to contact me.