The intelligence of machines and the branch of computer science which aims to create it

Artificial Intelligence Journal

Subscribe to Artificial Intelligence Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Artificial Intelligence Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Artificial Intelligence Authors: William Schmarzo, Ed Featherston, Dan Blacharski, Corey Roth, Jnan Dash

Related Topics: Artificial Intelligence Journal

Artificial Intelligence: Article

Tokenization: The Building Blocks of Spam

Heuristic components of a statistical spam filter

Some types of punctuation are very useful; for example, the exclamation point makes a remarkable difference between "free" and "free!" and so you want to use some punctuation marks as constituent characters. One of the problems a filter author might run into when allowing these types of characters, however, is redundancy. Most would agree that there's no real difference between "free!" and "free!!!!" in a message, as both are equally condemning characteristics of spam. On the other hand, messages in which symbols are used to b!r!e!a!k up a word may behave a bit differently.

Some authors will view punctuation as part of a token only if it appears at the end of the token. If an exclamation point appears elsewhere, it will be treated as a delimiter in most cases. For those punctuation marks that are permitted, we should consider working some method of de-duplication into our tokenizer, where only the first occurrence of the punctuation is used. We essentially look at "free!!!", "free!!!!!!!!!!", and "free!" as the same token by truncating the extra chaff. I've found that using the exclamation point as a constituent character slightly improves accuracy, which is the opposite effect that question marks appeared to have. This is probably because more spams use an obnoxiously loud used-car-salesman type of pretense rather than actually posing questions. Perhaps one day, spammers will become more philosophical, and then question marks will become just as useful as exclamation points.

Some filters permit a certain window size before the token is truncated; for example, tokens may be allowed to have up to three exclamation points before being truncated, giving the filter three different meanings for "free!", "free!!", and the extremely guilty and shameless "free!!!" One of the advantages to doing this, other than measuring the three levels of unbridled fervor, is that it allows a really obnoxious message that uses all three tokens to fill up more slots in the decision matrix.

It's important to truncate extraneous characters at some level because spammers could easily use not truncating them as a way to hide very spammy tokens; for example, a spammer wanting to hide the word "porn" could send "porn!!!" in the first spam and "porn!?!?!" the next time, so that in both cases the token would be considered a new token. Truncating will reduce both of these tokens to "porn!" or even "porn" if exclamation points are ignored all together. Tokens should generally be limited to only one acceptable punctuation mark at the end, or to an N-sized window of homogeneous punctuations at the most.

Other Delimiters
Other delimiters used by many applications include the following:

  • brackets [ ]
  • braces { }
  • parentheses ( )
  • mathematical operators + - / * = < >
  • special characters | & ~ `
  • the at (@) sign
  • underscores and other rare characters
These delimiters frequently prevent the duplication of several different permutations of tokens, such as "when" and "(when". Other characters, such as the new line character, are also treated as delimiters. The nice thing about the way text is delimited is that it's going to result in unique tokens, even if the tokenization isn't perfect. This can be good or bad, but most of the time it's good. Even a token that isn't in human-readable format may be machine-readable and may occur with enough frequency to be a good identifier. In fact, Bayesian antivirus filtering uses an entirely different set of delimiters, because antivirus analysis involves the cataloging and analysis of several different binary sequences.

Some exceptions to the basic delimiters we've mentioned involve one-off instances where we actually want to preserve certain complete tokens. For example, IP addresses make for good spam markers, as do certain HTML characters like © and  . If you're reading this book, there is most likely no shortage of spam in your inbox (or quarantine). Often the best way to discover new approaches to tokenization is to take a look at some of the text spammers are using in their samples. It's very important that the tokenizing approaches being used aren't biased against present-day spam.

The tokenizing algorithm should be generic in such a way that it can easily break down any kind of natural language or new type of message style, but it shouldn't be so plain vanilla that the features it generates are likely to appear as common in all e-mail. It would be relatively easy to tokenize a message into individual characters, but that wouldn't be very useful, since the token "v" could occur in "viagra" or "violin". All-numeric tokens are generally not very useful on their own, but when combined with the proper punctuation (such as a dollar sign or exclamation mark) can make a significant distinction between "19" and "$19" or between "95" and "95!". Provide enough information to allow the token to be set apart from the rest, but not so much that it is unlikely to show up only a handful of times.

To some degree, this anal-retentive exercise is overrated. Any reasonable level of tokenization will most likely yield levels of accuracy above 99 percent, but making a mistake could cost a few misclassifications on occasion. I've found that using the question mark as a constituent character in my tests resulted in approximately three additional errors per 5,000. Experimentation and thorough testing is one of the best ways to decide on the tokenization approach that works best for the filter.

Token Reassembly
Occasionally, tokens will turn out to be a little too small due to attempts by spammers to obfuscate them. When this happens, reassembling individual letters into a token can help improve accuracy. Let's look at an example of obfuscated text:

C/A/L/L/ N-O-W - I/T/S F_R_E_E

More Stories By Jonathan A. Zdziarski

Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.