The intelligence of machines and the branch of computer science which aims to create it

Artificial Intelligence Journal

Subscribe to Artificial Intelligence Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Artificial Intelligence Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Artificial Intelligence Authors: Yeshim Deniz, Zakia Bouachraoui, Elizabeth White, Pat Romanski, William Schmarzo

Related Topics: Artificial Intelligence Journal

Artificial Intelligence: Article

Tokenization: The Building Blocks of Spam

Heuristic components of a statistical spam filter

A few filters, such as CRM114, perform this type of word skipping, which will tokenize something like "manh+<!rescind>+ood" and also help the filter "see" the original token by performing the word skipping: "manh+ood." Since tokenization is an imperfect process, approaches like this generally provide more machine-readable tokens to deal with, without necessarily requiring much work. The more permutations of machine-readable tokens are created, however, the larger and more spread out the dataset will become, possibly affecting accuracy. The amount of data generated by SBPH generally turns a lot of filter authors off to it in favor of simple functions such as HTML comment filtering.

The tokenization methods discussed thus far have covered only standard character sets. The issue of foreign languages will eventually require a solution. Most spam filters simply use wide characters as placeholders, such as the letter "z" or an asterisk. This functionality allows the filter to catch just about any messages written using a wide character set. Some users, however, may expect to receive e-mail from others speaking such a language, and for them this approach won't function well at all, filtering only based on header data. The rest of the body will look (to the filter) like this:




Some filters implement i18n internationalization, which lets their filter support some additional languages. To make matters more complicated, however, some languages don't use white space, making it very difficult to identify words at all. This commonly calls for more advanced solutions such as variable-length nGrams.

Final Thoughts
We've run the gamut of approaches to tokenizing in this article. Tokenizing strives to define content by defining the construct and, more important, what the root components of content are. This is a noble quest but, as with other areas of machine learning, is a function that may eventually be better left up to the computer. As new types of neural decision-making algorithms surface, the analysis of unformatted text may become one of the next forms of AI. Until this happens, tokenizing remains one of the few heuristic components of a statistical spam filter. It should therefore be respected and kept somewhat simple, so as not to require any maintenance in the years to come.

This article is an excerpt from Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification.
Printed with permission from No Starch Press. Copyright 2005.

More Stories By Jonathan A. Zdziarski

Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.