Spam Filtering techniques

August 31st, 2008

Spam is a problem of enormous proportions. Current estimates figure that over 80% of all email is spam.

Some time ago I wrote a post about some changes to the configuration of my mail server that cut down the spam drastically. I thought I might take a moment to talk about the various techniques that are used to combat spam.

Some terminology I’m going to use:

  • spam - unwanted email
  • ham - wanted email
  • false positive - ham that is marked as spam
  • client - mail client, eg Thunderbird, Outlook
  • server - mail server, eg Exchange, postfix
  • host - someone who hosts servers
  • Joe Job - when spam is sent using the email address of someone else

Bayesian Filtering

This originated from Paul Graham. The idea was that you break a message up into tokens and then examine the tokens against a database of tokens. Each of the tokens in your database has a score as to how spammy the token is. The individual scores are combined to provide a score for an email. Emails are then rejected or allowed based on that score. This requires that you train your filter on collections of spam and ham.

Spammer responses

  • replacing letters with numbers (v1arga) or adding in spaces. This is generally pretty ineffective.
  • Attempt to poison the filters with random text
  • Delivering their payload as an image

Advantages

  • Generally cuts spam significantly (>75%)
  • Can be configured and trained to specific needs
  • Can be run on the client (eg Thunderbird) or the server

Disadvantages

  • CPU intensive, a burden borne by the receiver of the email.
  • Doesn’t tend to scale well, over an organisation. One person’s spam is another person’s ham.

Realtime Black List (RBL)

A RBL works by storing a known list of IP addresses or IP address blocks that send spam. When a server receives a HELO request, it checks the IP address of the sender against the RBL. If the IP address matches a known spammer IP address, it refuses the email. One issue with RBLs is that they are often easy to get on to and hard to get off. In addition some RBLs take the view that if even if just a single IP address is being used to send spam, they should ban the whole block to encourage the host not to allow spammers on their network. This tends to punish the innocent along with the guilty.

Spammer responses

  • Find a host who will allow them to hop between IP addresses
  • DDOS against the RBL
  • Relay spam through zombies (generally home computers) on dynamic IP addresses

Advantages

  • Can have a significant impact on the amount of spam received
  • Runs at very little cost to the receiver of the email (no bandwidth spent receiving the email)

Disadvantages

  • It can be hard to get off an RBL if you get on one
  • The false positive rate can be quite high, depending on which RBL you choose
  • If you have a false positive, you never know about it

Whitelisting

This works by storing a list of valid email addresses or IP addresses (generally just email addresses) that your server will receive emails from. In general this is not a terribly effective solution as it severly limits the list of people you can receive email from. This is typically to eliminate email from other testing criteria (eg to avoid running bayesian filters over it).

Spammer responses

  • Joe job

Advantages

  • Can have a significant impact on the amount of spam received
  • Low requirements (badwidth, computation)

Disadvantages

  • You can only receive email from email addresses/IP addresses on that list

Challenge - Response

This is really a variation on whitelisting for email addresses, with a dynamic white list. When someone who is not in your white list sends an email, an automatic email with a list goes back to them. Clicking on that link adds them to your whitelist.

Spammer responses

  • Joe job

Advantages

  • Can have a significant impact on the amount of spam received

Disadvantages

  • Places a burden of work on the people sending ham emails
  • Tends to work only if you have a small, known list of people who send you email

Greylisting

Greylisting is one of the more interesting ideas out there. Greylisting checks against an internal database to see if the combination of sender, recipient and sender IP address matches an IP address for an email that has been delivered. If there is a match, the email is received. If not, the receiving server sends a response to the sender to say that the server is unable to receive the email at the moment and to retry after a delay. This eliminates a proportion of spam by delivering mail only from MTAs that comply with the standards for email. The real power of greylisting comes when coupled with RBLs. If the email is part of a spam run, by the time the sending MTA resends the email, the IP address is likely to be in an RBL.

Spammer responses

  • Run a complying MTA helps

Advantages

  • Low bandwidth/CPU cost

Disadvantages

  • Delays some emails from arriving immediately

SenderID and SPF

SenderID and SPF are two approaches to deal with one aspect of spam: Joe jobs. Both add records to the DNS records for the domain to list the IP addresses that can send emails for that domain. Of the two SenderID is technically a better tool, however Microsoft (the creator of SenderID) has patented parts of this. This makes it impossible for it to be implemented on most Open Source mail servers (postfix, qmail, sendmail, exim, etc), which make up a significant proportion of all mail servers. As a result we are unlikely to see SenderID implemented.

Spammer responses

  • Run an MTA that supports this

Advantages

  • Low bandwidth
  • goes some way to deal with the Joe Job issue

Disadvantages

  • Not supported by all MTAs, likely to drop some ham

Blue Frog

As far as I am aware there was only one implementation of this. The basic idea was to make a single http request to all links in all incoming emails. This would bring the sites hosting the products sold by the spam to their knees by the sheer volume of requests. Even if the servers could handle the load, the increased cost of bandwidth would make the spamming uneconomic. Please note that this is not a DDOS, as it is making just one request for each incoming email.

Spammer responses

  • multiple DDOS

Advantages

  • Hurts the spammers, adds costs to them in proportion to the emails they send

Disadvantages

  • Not around any more :( . Unfortunately the DDOSes brought the service to an end.

The dropping cost of hardware

August 23rd, 2008

One thing that never ceases to amaze me is the way that hardware continues to drop in cost. This really came home to me when I specced and built a couple of machines for my parents. My parents have the misfortune to have a son who knows his way around a computer and as a result has been able to keep their computers running far longer than they really should have. My mother’s computer was just over 11 years old this year when I replaced it, and had (from memory) 3 replacement power supplies, more RAM, 2 replacement HDD, 3 replacement DVD/CDRom drives, 1 replacement sound card, 2 replacement NICs.

My parents use their computers largely for email, surfing the web and editing the odd word and excel documents. In this part of the market the AMD chips win hands down in bang for your buck. In the end I go something like (monitors were not needed):

  1. AM2 4000
  2. nVidia chipset AT motherboard with integrated gfx & dual channel RAM
  3. 2xaGb DDR2 800 RAM
  4. DVD burner
  5. 160Gb 7200rpm seagate HDD
  6. antec case
  7. XP home

For a total of $485 (AUD) per machine.

All name brand parts, none really bottom of the market parts. To keep this in perspective, under 10 years ago I paid ~$800 (AUD) for a 700Mhz slot A Athlon for first computer I ever built, the total cost of the computer was.

The crazy thing about this is that these computers are quite frankly overpowered for their needs. There are people who need more: gaming, video editing, graphical work, programmers, however these computers are overpowered for most people’s needs. Even then, moving to a Core2 Duo and an ATX motherboard, adding a larger HDD and adding a gfx card would likely still keep the price under $1000 (AUD), you could probably get it below the price of my prized slot A Athlon processor.

Interestingly that processor is still running … it is in the machine that currently hosts this website.

The other interesting part of this purchase is that the OS makes up $109 of that $485, or 22% of that is the OS. For comparison the OS was under 10% of the cost for the machine this replaced. This should be warning to Microsoft, particularly when there are other credible alternatives.

The greatest engine of the air in WWII

August 19th, 2008

I love history, particularly the first 50 years of the 20th century. While reading about aircraft in WWII, I noticed something interesting, a good proportion of the best aircraft on the allied side were powered by the same engine: the Rolls Royce Merlin.

Why was the engine so important? It controls the speed that an aircraft could fly at, the range of the aircraft and to a lesser extent the ceiling. A faster aircraft can attack and quickly reposition for another attack, a faster aircraft can escape attacks.

Among the fighters: Spitfires, Hurricanes and probably the greatest fighter of the war, the P51 Mustang (powered by a Packard-built Merlin). Among the bombers: Lancaster, probably the greatest heavy bomber of the war (with the possible exception of the B29).

Most of the aircraft listed here were pivotal in their own way.

The Battle of Britain was won by Hurricanes and Spitfires (and to a certain extent the small fuel tanks of the German fighters crossing the channel). The Spitfire went through 34 revisions and was still in service by the end of the war, an icon of the battle of Britain.

The P51’s performance made it one of the best fighters of the war, but more importantly, with drop tanks, had the range to go all the way to Berlin from England. This enabled the allies to fly escort on the day bombing missions, drastically reducing losses.

The Mosquito was one of the most versatile aircraft of the war, remaining the fastest aircraft in Bomber command until the end of the war. The Mosquito was used extensively in reconnaissance, as a medium bomber, for marking targets and even as a nightfighter. The Mosquito could fly at close to the performance of most of the axis fighters and still deliver 4000lbs bomb of bombs.

The Lancaster was the backbone of the British night bombing offensive. The Lancaster was famous for the dam buster raid and the sinking of the Tirpitz.

The Rolls Royce Merlin was certainly the best engine from among the allied forces. It isn’t inconceivable that course of the war might have been different had the engine not been built. Of course there were other interesting aircraft to come out of the war.

Exception handling

August 9th, 2008

I read a recent post that complained about the lack of error handling in twitter.

My problem with this is while the author is unhappy with the error handling in twitter, no reasonable solution is provided.

In my (admittedly limited) experience of web applications, exceptions fall into three basic categories.

  1. The code is broken somewhere
  2. Platform instability: this might an issue in the hardware/software platform stack that application runs on. For example your server might have a bad stick of RAM or there might be a bug in php/.net/tomcat etc.
  3. Load issues: the app is overloaded, resulting in inability to connect to the database, file locking etc

All of these three items (although to a lesser extent 3) are not issues you can plan for. If you know where the bugs in your code are, you would fix them (duh). If there is an issue in the platform you would either code around it or replace the defective parts of the platform. As for the last, load does interesting things, and it is hard to predict exactly what will break under the load, in the end you can spend a lot of time writing code to handle expected load situations that do not occur.

My question is, what is the programmer are supposed to do with these exceptions? At the least the error should be logged (with enough data to replicate) for the development team so that they might be able to fix it.

You can take the tried and true option of throwing the whole thing into the users lap with a detailed error message. What is the user going to do with this? For a general user this is goobledegook, even for a user who is a developer this only makes sense if they understand the application itself.

Or you can do what twitter does, recognise that the information is essentially useless and simply apologise for the problem.

Jeff Atwood does get something right in this though: Twitter should try to let you know how long the site is going to be down for. However this is only really possible when the developers have assessed the situation resulting from the errors that have been logged and worked out how long the site/feature will be unavailable for.

Coding on whiteboards - interview procedure

August 2nd, 2008

Edited to improve the code samples slightly. Also still tweaking the CSS to get the code to display better.

Update 2: just found the preserve code formatting plugin. Fighting wordpress (which was completely screwing up the code tag) was no fun.

A lot of people recommend include a practical test as part of an interview for a programming position. Quite a few people, including some notable people, recommend doing this on a whiteboard.

I think that this stinks: somebody trying to write code on a whiteboard is no reflection on their abilities as a programmer. It isn’t just that it is so different to the way people normally write code: it penalises people who write code well. It is good programming practice to design the skeleton and then to put some flesh on those bones. For example, I have got into the habit of writing closing braces for blocks as soon as I write the opening brace. In my there is no question that this is a good idea, but this is based on the assumption that the space between the braces is effectively infinitely expandable, which is the case when writing normal code but not when writing code on paper or on a whiteboard.

Let’s take a simple function, that retrieves some data from the database (C#, illustrative purposes only, not tested), writes it to the screen. I write code in multiple passes. The first pass through might look something like this:


// TODO: retrieve data

// TOD: loop through data

// TODO: write totals

The next pass would fill some of that in:


// retrieve data
DataTable data = this.GetData();

// loop through data
foreach (DataRow row in data)
{
  TableRow row = new TableRow();

  TableCell cell1 = new TableCell();
  cell1.innerText = row["label"].ToString();

  TableCell cell2 = new TableCell();
  cell2.innerText = row["amount"].ToString();

  this.Results.Rows.Add(row);
}

// TODO: write totals

And some more in the next pass:


// retrieve data
DataTable data = this.GetData();

int total = 0;
// loop through data
foreach (DataRow row in data)
{
  TableRow row = new TableRow();

  this.AddCell(row, row["label"].ToString());

  this.AddCell(row, row["amount"].ToString());

  total += Convert.ToInt32(row["amount"].Value);

  this.Results.Rows.Add(row);
}

// write totals
TableRow total = new TableRow();
this. AddCell(total, "Total");
this. AddCell(total, total.ToString());
this.Results.Rows.Add(total);

private void AddCell(TableRow row, string value)
{
  TableCell cell = new TableCell();
  cell.innerText = value;
   row.Cells.Add(cell2);
}

And probably a final pass, to alternate colours on the rows and set some styling on the total:


// retrieve data
DataTable data = this.GetData();

int total = 0;
// loop through array
for (DataRow row in data)
{
  DataRow row = data.Rows[i];
  string style = "background-color:" + (i % 2 == 0 ? "#FFFFFF" : "#CCCCCC") + ";";

  TableRow row = new TableRow();
  this.AddCell(total, row["label"].ToString(), style);
  this.AddCell(total, row["amount"].ToString(), style);
  total += Convert.ToInt32(row["amount"].Value);

  this.Results.Rows.Add(row);
}

// write totals
TableRow total = new TableRow();
this.AddCell(total, "Total", "font-weight:bold");
this.AddCell(total, total.ToString(), "font-weight:bold");
this.Results.Rows.Add(total);

private void AddCell(TableRow row, string value, string style)
{
  TableCell cell = new TableCell();
  cell.innerText = value;
  if (style.Length != 0) cell.Attributes["style"] = style;
   row.Cells.Add(cell2);
}

And normally this would have been broken out into a number of functions, but I think the point is clear. One of the most frustrating experiences of my life, technology-wise, was hand-writing code as part of an exam.

In more complex code this is even worse: when you are writing the code it is not clear how long a block is.

Multiple failures suck

May 17th, 2008

So, someone might have noticed that my site has been down for a little while.

First my mail/web server died hard, complete hard drive failure. It was about this point that I discovered that my backup scripts were somewhat lacking. It was about this time that my file server died.

Everything backs up to the fileserver, which then backs up to a machine offsite and occasionally to a local desktop. However this doesn’t occur as regularly as it should. Piecing together the files remaining on my fileserver, the files from the offsite backup and the onsite backup got me most of my data back.

I almost lost 6 months of email, but managed to restore this from local stores of email.

All this has taken some time to get things back together, but I’m back online now.

Unfortunately I lost all the photos hosted here. I still have the actual photos, but the resized images and html pages are gone. And I’m somewhat disinclined to bring most of them back, given the work involved. So I’m going to have to delete most of the posts related to that.

MSDN Pricing

October 13th, 2007

I was looking to purchase a couple of new MSDN subscriptions for work recently and discivered something interesting. We were looking to purchase Visual Studio Professional MSDN Professional subscriptions.

The first interesting thing is that the pricing is the same for DVD and online distribution. This doesn’t make a whole lot of sense, the DVDs cost something to produce. The more interesting thing is the pricing around the world.

Go here and select Australia in the top right corner. Click on Buy direct from Microsoft and select the appropriate subscription. Price: $2084.

Now, go back to the first page and change to United States. Select “How to Buy”, select “Buy or renew MSDN Subscriptions directly from Microsoft” under “Buy from Microsoft” (you might need to clear cookies for this). Select the appropriate subscription level. Price $1199 US.

Converting that to AUD comes to $1,326.33. Even adding GST on only brings it up to $1458.96. Given that this is distributed online, there is no difference between distributing this product in Australia or the US. This is flat out ripping people off by over $600. Now it is possible that Microsoft hasn’t caught up with the freefall of the US dollar, however in a global market this seems pretty silly.

I rang Microsoft (I do this occasionally, it doesn’t make any difference but I think it should be done) and a rather helpful guy couldn’t come up with a explanation. He did say I could buy a US subscription, but I would need to use support from the US.

Site re-arrangement

September 7th, 2007

I’ve rearranged the site a bit so it is a little more logical, putting the blog in a directory if its own rather than the root directory for the site.

Long time…

August 20th, 2007

It has been some time since I’ve written a post, largely due to being very busy with work. One thing of note has happened.

New laptop

I got a new work laptop, retiring the much loved Thinkpad T41.

The new laptop is a macbook pro, my first mac. There were a number of really good reasons to get a mac, but the major one was we needed a mac to test websites we build with safari. Also with Parallels (and bootcamp), I can work on windows just fine. That is a good thing as I primarily write code that runs on windows.

Some initial impressions after working with it for a little while:

  • The hardware is very pretty
  • OS X is really, really slick. Very user friendly. Just picking one example, when you’ve used the wireless confguration tools in OS X, anything else seems clumsy.
  • Parallels is really good at virtualising windows. I write code for windows, running IIS, SQL Server, Visual Studio and a host of other basic applications. The single issue I’ve run across to date is that SQL Server Profiler doesn’t seem to resume when you pause it.
  • The keyboard on the laptop is really, really annoying if you write code. No single page up/down keys. Function keys don’t work as single keys (issue for windows). The killer is that the home/end keys require two keys. Any and every programmer would hit these keys all the time..
  • Wide screen sucks for code. I want longer screens, not wider screens.

The keyboard and screen were almost enough for me to wish for the thinkpad back.

Blue Soliel

May 27th, 2007

I bought a USB bluetooth adaptor for Vic to sync her phone with her desktop. The adapter shipped with Blue Soliel, used to run the adapter.

About 5 minutes after installing it, I got an error saying:

A pirate copy is in use!
BlueSoleil will run in evaluation mode!

This was interesting because I’d just bought it. After googling for the issue, I found the following forum posts. One of the problems they suggest is that the MAC address of your hardware is already in use (eg 11.11.11.11.11.11)

Now I have some sympathy for IVT in this situation. They write software, they want to ensure that people pay them for using their software. However it appears that their means for identifying pirates is pretty poor. It would seem to be similar to WGA, where it associates some sort of hardware hash with the MAC address of the bluetooth adapter. This breaks in numerous situations (moving between machines, change hardware etc). It assumes that people will not want to move between machines, which is certainly not true for us as I will probably “borrow” this from Vic. IVTs anti-piracy measures penalise legitimate customers, customers like myself.

You see, the adapter I bought certainly was licensed. However I think what may have happened is that someone at the store I bought it from tested it before selling it to me. The end result being it has turned up as pirated. In addition for many users they are buying the hardware and they see software like Blue Soliel tied to the hardware, not to one specific machine. In my case I would also want to use the adapter on my desktop as well as Victoria’s, which would likely flag the software a pirated again.

Undoing the damage

All in all this made me pretty annoyed. I installed a firewall on Vic’s desktop to find out what it was trying to connect to (in order to check whether it was pirated or not). It turns out this was the following IP address:
211.94.168.252

This closed down my first line of though, which was to add an entry to the hosts file, redirecting the domain name to 127.0.0.1. With a direct IP address this isn’t possible. I was also disinclined to run a firewall on Vic’s machine permanently (I think the security they add is minimal), and I don’t have anything set up to run egress filtering from my network.

My next approach was to consider patching the binary files to either bypass or redirect the online check. This isn’t something I have done before, however I was feeling annoyed enough to think it might be a good idea. I ran Filemon to see what handles it opened up and sure enough, at the point where it popped up its error message, there was a call to one of the winsock dlls. I ran strings (a tool on linux to pull out all human readable strings out of a binary file) to see if I could find the actual message. The fact that I couldn’t suggested to me that they may obfuscate the strings binaries. I also ran a quick search for the hex, int or dotted decimal representation of that IP address, which found nothing also. At that point I decided it might get a little hard to track down exactly where the problem lies.

I also toyed with the idea setting up a separate login account for BlueSoliel, one which denied permissions to the dll. I’d then set BlueSoliel to run on startup using runas.