Spam Filtering techniques
August 31st, 2008Spam is a problem of enormous proportions. Current estimates figure that over 80% of all email is spam.
Some time ago I wrote a post about some changes to the configuration of my mail server that cut down the spam drastically. I thought I might take a moment to talk about the various techniques that are used to combat spam.
Some terminology I’m going to use:
- spam - unwanted email
- ham - wanted email
- false positive - ham that is marked as spam
- client - mail client, eg Thunderbird, Outlook
- server - mail server, eg Exchange, postfix
- host - someone who hosts servers
- Joe Job - when spam is sent using the email address of someone else
Bayesian Filtering
This originated from Paul Graham. The idea was that you break a message up into tokens and then examine the tokens against a database of tokens. Each of the tokens in your database has a score as to how spammy the token is. The individual scores are combined to provide a score for an email. Emails are then rejected or allowed based on that score. This requires that you train your filter on collections of spam and ham.
Spammer responses
- replacing letters with numbers (v1arga) or adding in spaces. This is generally pretty ineffective.
- Attempt to poison the filters with random text
- Delivering their payload as an image
Advantages
- Generally cuts spam significantly (>75%)
- Can be configured and trained to specific needs
- Can be run on the client (eg Thunderbird) or the server
Disadvantages
- CPU intensive, a burden borne by the receiver of the email.
- Doesn’t tend to scale well, over an organisation. One person’s spam is another person’s ham.
Realtime Black List (RBL)
A RBL works by storing a known list of IP addresses or IP address blocks that send spam. When a server receives a HELO request, it checks the IP address of the sender against the RBL. If the IP address matches a known spammer IP address, it refuses the email. One issue with RBLs is that they are often easy to get on to and hard to get off. In addition some RBLs take the view that if even if just a single IP address is being used to send spam, they should ban the whole block to encourage the host not to allow spammers on their network. This tends to punish the innocent along with the guilty.
Spammer responses
- Find a host who will allow them to hop between IP addresses
- DDOS against the RBL
- Relay spam through zombies (generally home computers) on dynamic IP addresses
Advantages
- Can have a significant impact on the amount of spam received
- Runs at very little cost to the receiver of the email (no bandwidth spent receiving the email)
Disadvantages
- It can be hard to get off an RBL if you get on one
- The false positive rate can be quite high, depending on which RBL you choose
- If you have a false positive, you never know about it
Whitelisting
This works by storing a list of valid email addresses or IP addresses (generally just email addresses) that your server will receive emails from. In general this is not a terribly effective solution as it severly limits the list of people you can receive email from. This is typically to eliminate email from other testing criteria (eg to avoid running bayesian filters over it).
Spammer responses
- Joe job
Advantages
- Can have a significant impact on the amount of spam received
- Low requirements (badwidth, computation)
Disadvantages
- You can only receive email from email addresses/IP addresses on that list
Challenge - Response
This is really a variation on whitelisting for email addresses, with a dynamic white list. When someone who is not in your white list sends an email, an automatic email with a list goes back to them. Clicking on that link adds them to your whitelist.
Spammer responses
- Joe job
Advantages
- Can have a significant impact on the amount of spam received
Disadvantages
- Places a burden of work on the people sending ham emails
- Tends to work only if you have a small, known list of people who send you email
Greylisting
Greylisting is one of the more interesting ideas out there. Greylisting checks against an internal database to see if the combination of sender, recipient and sender IP address matches an IP address for an email that has been delivered. If there is a match, the email is received. If not, the receiving server sends a response to the sender to say that the server is unable to receive the email at the moment and to retry after a delay. This eliminates a proportion of spam by delivering mail only from MTAs that comply with the standards for email. The real power of greylisting comes when coupled with RBLs. If the email is part of a spam run, by the time the sending MTA resends the email, the IP address is likely to be in an RBL.
Spammer responses
- Run a complying MTA helps
Advantages
- Low bandwidth/CPU cost
Disadvantages
- Delays some emails from arriving immediately
SenderID and SPF
SenderID and SPF are two approaches to deal with one aspect of spam: Joe jobs. Both add records to the DNS records for the domain to list the IP addresses that can send emails for that domain. Of the two SenderID is technically a better tool, however Microsoft (the creator of SenderID) has patented parts of this. This makes it impossible for it to be implemented on most Open Source mail servers (postfix, qmail, sendmail, exim, etc), which make up a significant proportion of all mail servers. As a result we are unlikely to see SenderID implemented.
Spammer responses
- Run an MTA that supports this
Advantages
- Low bandwidth
- goes some way to deal with the Joe Job issue
Disadvantages
- Not supported by all MTAs, likely to drop some ham
Blue Frog
As far as I am aware there was only one implementation of this. The basic idea was to make a single http request to all links in all incoming emails. This would bring the sites hosting the products sold by the spam to their knees by the sheer volume of requests. Even if the servers could handle the load, the increased cost of bandwidth would make the spamming uneconomic. Please note that this is not a DDOS, as it is making just one request for each incoming email.
Spammer responses
- multiple DDOS
Advantages
- Hurts the spammers, adds costs to them in proportion to the emails they send
Disadvantages
- Not around any more
. Unfortunately the DDOSes brought the service to an end.