Fighting Comment Spams - There Gotta Be A Better Way

Tagged in

SPAM! People usually associate spams with unsolicited commercial emails that try to either sell you the “little blue pill”, or Nigerians phishing for your bank account details. There are many techniques fighting email spams, either at the server side or at your email client. However if you run a blog or a forum on the Internet, you would also have experienced fighting comment spams (unless, of course, that you run a spam blog yourself :). I have been blogging since 2001 and have employed various techniques to keep the spams at bay. Some of them worked well — at the beginning — but sooner or later spammers got smartened up and they can almost slip in a few spammy comments.

When I launched this blog 2 years ago, it was running Akismet for Drupal, and recently changed to Mollom, one of Dries’ startup company/project. It has been effective (except for the last few days). Mollom is sort-of similar to Akismet that it (1) uses a classifier to determine the likeliness of incoming comment being a spam (2) acts as a centralised database to collaboratively identify spams. Mollom does a few extra things when the comment is in a “not-so-sure” state, but discussing this would be beyond the scope of this blog post.

Another interesting feature for Mollom is its Flex based statistics panel, showing the number of spams verses the number of legitimate comments. This is mine over the last 12 days:

Spam comments from Mollom

As you can see the ratio between noise and signal is huge — there are many more spams than real comments. By the way, even for many real comments I am still not so sure about their legitimacy especially those one liner generic comments. As you can see there was a big jump today, because quite a few spams slipped through.

Although the ratio seems to be inline with most studies online, it still surprises me, when I compare it with the SNR of my email spams. I am running my main MTA at home with Postfix 2.4, and spams are filtered with DSPAM, a fast and light-weight email classification system that yields pretty good result (99.12% currently for my account). Here is the analysis graph over the same period of time (generated by dspam-web).

Spam emails from DSPAM

As you can see from the graph — there are only around 20% more email spams than legitimate emails! Not huge difference in the case of this blog’s comment vs. spams.

Now, do we actually get relatively less email spams than comment spams? Not really. But what we do have is better email-spam fighting techniques that block out most email spams before they reach the classifying system (DSPAM in this case). Therefore what DSPAM gets is already a filtered subset of all the incoming email spams. A few techniques deployed:

  • Greylisting (see my previous article on this subject). I actually don’t put greylisting on my primary MX anymore due to undesirable delay. However I found by putting greylisting on my secondary MX it is just about as effective, as most spambots pick the last MX entry in DNS to send spam to.
  • DNS-related filtering. For example ensuring sender has a FQDN, and a valid hostname, etc. I am surprised to see how many spams are actually filled with invalid sender addresses.

That’s about it, but many people I know also employ RBL, domain-key, etc. At least in my case they have effectively reduced the work for the classifier, which also result less spam getting into my INBOX at the end.

There gotta be a better way to fight comment spams — something between the web server and the actual application that filters out the obvious spams. Bad-Behavior? Mod_Security? They also increase undesirable false negatives though.

Any more thoughts?

Comments

Gravatar

There is a few games you can play, you randomise the form names and store a lookup table in a session/cookie etc, if it comes out jibberish it most likely isn’t a real user. This alone is pretty effective since most spam bots are expecting people to have consistent form names, and usually using someone elses code, although if everyone started doing it I’m sure the bots would figure it out sooner or later.

I found a neat little php based captcha that is fairly uncommon, which is the trick, uncommon, the more common something is the more likely the bots will break it.

Gravatar

You are completely right — the game here is to block out as much bot as possible before getting them into the classifier. There are many captcha techniques to reduce automated attempts. I guess I would just like to see more of that built into an online publishing system…

Gravatar

Which would be self defeating, the point I was trying to make was to make things as hard as possible for the bots everyone has to do something different, the more the same everyone is the easier it is for someone to script it once type thing.

Gravatar

Well. As obscurity does not lead to security, an well explained system with full source in implementation does not necessarily mean compromise either. Greylisting for example is a well understood system but the core of it is to make sending spam as costly to the spammer as possible.

Instead of well known variables in the form, you can generate different (or random) challenge/response in the form as well (like captcha). While image-based captcha is pretty much broken these days, it is still costly for the spammers.

Gravatar

It has nothing to do with security, it does however have everything to do with making more work for the spammers, the more it ‘costs’ to implement by the spammers, the less likely you will get spam. As for greylisting same principal, it ‘costs’ spammers more to resend failed messages since it could be difficult to tell bounces from temporary failures etc, also it falls in line with my first sentiment, lack of use, greylisting isn’t widely used enough for spammers to try over comming.

Although they did break google and other captchas which let them spam via google, so the ‘cost’ at over coming greylisting was reduced in that sense.

Gravatar

Why not just use mod_security? Their are rule sets for spam and you can block anything remotely spammy before it even hits your wordpress blog!

Gravatar

Because

  1. Rules might be forever changing. We already have a classifier system that learns collaboratively and automatically (Mollom or Akismet), why should I have another rule-based system in front of it that requires constant updates?
  2. Mod_security is available in Apache.
Gravatar

If you find a false negative with Bad Behavior, I want to hear about it.

My strategy with Bad Behavior is to let things through if I can’t be certain if it’s a human being or not, thus keeping false negatives as close to zero as humanly possible.

Post new comment

The content of this field is kept private and will not be shown publicly.

More information about formatting options