The intelligence of machines and the branch of computer science which aims to create it

Artificial Intelligence Journal

Subscribe to Artificial Intelligence Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Artificial Intelligence Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Artificial Intelligence Authors: Jason Steinbach, Yeshim Deniz, Liz McMillan, Pat Romanski, Zakia Bouachraoui

Related Topics: Artificial Intelligence Journal

Artificial Intelligence: Article

Tokenization: The Building Blocks of Spam

Heuristic components of a statistical spam filter

It is probably better to use an exclusionary list rather than an inclusionary one. You're more likely to miss a few tags or possibly to fail to name certain tags you never thought could be used in spam (for example, the object tag has recently become popular). If this happens, at worst the tag will sit and collect dust in the dataset with some neutral value or will fill up a decision matrix slot in error. If you fail to add a tag to an inclusive list, though, you're bound to ignore an important data point and may not even realize it.

Some of the HTML tags commonly used by spammers (which a filter should definitely be looking at) include the following:

Some filters like to mark the tokens generated from HTML tags with an "HTML" identifier, while others go so far as to mark the particular tag the text belonged to (for example, "BODY:BGCOLOR=#FFFFFF"). Regardless of which tags the filter decides to keep and which get discarded, it's very important to handle HTML comments correctly. Spammers are using many tricks to obfuscate their text so that it's human readable, but not very machine readable. For example, the following may look like a complete mess in its machine-readable format:

Received: from (
Message-ID: <[email protected]>
From: "patsy stamm" <[email protected]>
Reply-To: "patsy stamm" <[email protected]>
Subject: Giving this to you
Date: Fri, 08 Aug 03 07:29:02 GMTX-Mailer: MIME-tools 5.503 (Entity 5.501)
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="AD0E55.76_15.C" X-Priority: 3 X-MSMail-Priority: Normal
Content-Type: text/html;
Content-Transfer-Encoding: quoted-printable

Yes you he<!lansing>ard about th<!crossbill>ese weird <!cottony>little pil<!domesday>ls

that are suppo<!=anabel>sed to make you bigger and of cou<!chord>rse you think they're b<!soften>ogus snake potion. Well, let's look at the facts:

has be<!waldron>en sold over 1.9 Mill<!audacity>ion times within the last 18 months</strong>...
With awe<!tapestry>some results for hun<!wield>dreds of thous<!locale>ands of men all over the planet! They all enjoy a seriously enhanced version of their manh<!rescind>ood and <b>why shou<!seoul>ldn't you</b>?

But when the user clicks the message to read it, the HTML comments won't be visible and the user will see this:

Yes you heard about these weird little pills that are supposed to make you bigger and of course you think they're bogus snake potion. Well, let's look at the facts: GRX2 has been sold over 1.9 Million times within the last 18 months... With awesome results for hundreds of thousands of men all over the planet! They all enjoy a seriously enhanced version of their manhood and why shouldn't you?

A simple way to ensure that the message is tokenized correctly is to remove the HTML comments and reassemble the message.

Word Pairs
Using word pairs, or nGrams, has recently become very popular among authors of statistical filters and adds a lot of benefits to standard single-token filtering. Pairing words together creates more specialized tokens. For example, the word "play" could be considered a very neutral word, as it could be used to describe a lot of different things. But pairing it with the word adjacent to it will give us a token that will inevitably stick out more when it occurs - for example, "play lotto." This approach helps improve the processing of HTML components by identifying the different types of generators used to create the HTML messages. Each generator, whether it's a legitimate mail client or a spam tool, has its own unique signature, which joining tokens together can help to highlight. Tokenizers that implement these types of approaches are referred to as concept-based tokenizers, because they identify concepts in addition to content.

Sparse Binary Polynomial Hashing
Bill Yerazunis originally introduced the concept known as SBPH, or sparse binary polynomial hashing. SBPH is an approach to tokenization using word pairs and phrases. If it wasn't so effective at what it does, it would probably be a terrible idea, but Yerazunis has repeatedly astonished the spam-filtering community with the leaps in accuracy made by SBPH tokenization. Graham refers to SBPH with the same mixed feelings regarding its ingenuity and need for medication.

Another project I heard about . . . was Bill Yerazunis' CRM114. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing.

SBPH tokenizes entire phrases, up to five tokens across, and allows for word skipping in between. It led the way in terms of accuracy for a long period of time, but it also created an enormous amount of data, which is one of the reasons it presently functions only in a train-on-error environment. SBPH provides the benefit of using the simplest, most colloquial tokens but giving special notice to more complex tokens as well, which are usually much stronger indicators of spam when they appear.

More Stories By Jonathan A. Zdziarski

Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.