The intelligence of machines and the branch of computer science which aims to create it

Artificial Intelligence Journal

Subscribe to Artificial Intelligence Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Artificial Intelligence Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Artificial Intelligence Authors: Elizabeth White, Pat Romanski, Liz McMillan, Corey Roth, Yeshim Deniz

Related Topics: Artificial Intelligence Journal

Artificial Intelligence: Article

Tokenization: The Building Blocks of Spam

Heuristic components of a statistical spam filter

If the tokenizer we're using considers underscores, dashes, and slashes to be token delimiters, then instead of ending up with four one-word tokens, we'll end up with 14 single-character tokens. Many filter authors believe it's healthy to allow these individual characters to tokenize, while others believe that the resulting information is too generalized to be a good indicator of anything, at least without the risk of false positives.

Filter authors who share the latter philosophy can use token reassembly to join the original tokens back together. Token reassembly isn't a perfect science, but it provides more useful tokens to work with. The tokens "VIA" and "GRA" are much more useful than individual characters and are definitely more indicative of spam. Token reassembly basically concatenates single-character tokens that are adjacent to one another, looking for larger amounts of white space amidst the slicing and dicing to make an educated guess about what words go together. Since statistical filtering involves machine learning and not human learning, tokens like this are very useful to the computer, even though they may not make much sense to us. For example, the token "VIA" really doesn't mean much, which is exactly why it makes a great indicator of spam - you'd rarely see the word "VIA" in a legitimate message unless you were talking about motherboards. The word "GRA" is even more rare in legitimate mail. The fact that these tokens aren't necessarily comprehensible to a human makes it easier to identify them in spams. My dataset considers some of these fractional words to be extreme indicators of spam:

Agra S: 00030 I: 00000 P: 0.9999
Eacute S: 00021 I: 00000 P: 0.9999
Prematur S: 00020 I: 00000 P: 0.9999

Another solution Graham introduced into tokenization is called degeneration. Degeneration allows a token that hasn't been seen before to be reduced in complexity (location, case, and punctuation) until it matches a simpler token. If no tokens match a given token, we make it simpler until we find a match. For example, consider the use of the word "FREE!!!" in the subject. If it has never been seen before in the subject, degeneration has us reduce the phrase until it matches something we have seen before.


Degeneration has a lot of room for customization, including the order in which the tokens decrease in complexity. At the very least, degeneration of punctuation is a wise move. If the word "free!" doesn't exist in the dataset yet, it makes good sense to use the value from a similar token.

Header Optimizations
Most filter authors agree that a token in the subject header is very different from a token in the message body, and that a token that appears in two different headers is unique enough to warrant keeping track of. Header tokens are usually processed differently from body tokens in order to maintain the origin of each token. Let's look at an example of an e-mail with a lot of useful header information.

From: [email protected]
To: [email protected]
Reply-To: [email protected]
Subject: ADV: FREE Mortgage Rate Quote - Save THOUSANDS! kplxl X-Keywords: Save thousands by refinancing now. Apply from the privacy of your home and receive a FREE no-obligation loan quote.

Rates are Down. YOU Win!
Self-Employed or Poor Credit is OK!
Get CASH out or money for Home Improvements, Debt Consolidation and more. Interest rates are at the lowest point in years-right now! This is the perfect time for you to get a FREE quote and find out how much you can save!

In the spam shown here, several different tokens stand out. First, if my e-mail address happened to be [email protected], I wouldn't expect to be seeing it in the From: header, but it would be very normal in the To: header. Seeing my own e-mail address in the From: header would be a clear indicator of spam, since most people don't usually send e-mail to themselves unless they've had too much to drink.

Second, the word "Save" appears in both the subject line and the message body. I would expect to see it in the message body more frequently in legitimate mail - for example, "Save your files in the blue folder" or "Save me from this dreaded cubicle." Seeing the word "Save" in the subject header is much more suspicious, though, and it makes sense for me to have a different entry in the dataset for each of them.

More Stories By Jonathan A. Zdziarski

Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.