The intelligence of machines and the branch of computer science which aims to create it

Artificial Intelligence Journal

Subscribe to Artificial Intelligence Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Artificial Intelligence Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Artificial Intelligence Authors: William Schmarzo, Ed Featherston, Dan Blacharski, Corey Roth, Jnan Dash

Related Topics: Artificial Intelligence Journal

Artificial Intelligence: Article

Tokenization: The Building Blocks of Spam

Heuristic components of a statistical spam filter

The word "FREE" also shows up in both the subject line and message body but, in this case, they're both very guilty indicators of spam. The filter still benefits here because the tokens "FREE" and "Subject*FREE" now have the ability to take up two slots in my decision matrix, further condemning the spam. Header tokens are extremely useful for identifying both spam and legitimate mail.

Other types of header tokens are frequently found to be useful, and the set of delimiters used in the headers is usually slightly different from those used in the message body. For example, if I want to catch all of the IP addresses in the Received: headers, I would treat a period as a constituent character (part of the token) instead of a separator. If I wanted to tokenize the message-id, I'd also include the @ sign as a delimiter, as it is used to separate some pieces of the message-id.

Another advantage of including the header as part of the token is that it helps to create a virtual "whitelist" of users you trust. If I exchange a lot of correspondence with [email protected], tokens like "From*bobsmith" and "From*" will start to appear in the dataset, usually with very innocent values. This works equally well in identifying the hostnames of trusted mail servers in the Received: header too.

URL Optimizations
Everyday innocent-sounding words like "order" and "cgi" often appear in the body of messages I receive from legitimate mailing lists. Seeing them appear in a URL, however, is much more suspicious. URLs are the spammers' preferred means of contact. It's much easier to run a scam using a Web site as your point of contact than it is to pay for the overhead of a phone system or mail processing department. Spammers also like their privacy, since the rest of the free world hates them, and they prefer that even customers not know how to contact them or the companies they spam for. Whether it's a link to click to visit a site or the URL of an image inside the message, URLs provide a lot of useful information specific to their own kind. Even non-sensible numbers will frequently stand out in URLs. This makes really good data for identifying not only spam but some legitimate mailing lists that use URLs in their unsubscribe tag lines. Users who are subscribed to some mailing lists that frequently include embedded advertisements (such as Yahoo Groups) will notice some specific characteristics of the URLs used in these advertisements that help the filter distinguish between advertising and real spam.

URLs are frequently tokenized differently than the rest of a message. The only delimiters usually used when tokenizing a URL are the slash, question mark, equal sign, period, and colon, although some filter authors perform the same basic type of token separation as they do in the rest of the message body. Tokenizing using URL-specific delimiters is done because the individual tokens are more frequently found based on their path in the URL, rather than on a specific context inside the URL. Regardless of how they are tokenized, URLs, when analyzed, can yield a lot of useful information. They can be categorized as places you want to go and places you don't want to go. A spam containing places you don't want to go is just as informative as a legitimate message containing places you do.

Url*getitrightnowwholesale S: 00026 I: 00000 P: 0.9999
Url*thesedealzwontlast S: 00026 I: 00000 P: 0.9999
Url*biz S: 00008 I: 00000 P: 0.9998
Url*us S: 00000 I: 00050 P: 0.0001
Url*java S: 00018 I: 00000 P: 0.9999
Url*www S: 00000 I: 00030 P: 0.0001
Url*com S: 00000 I: 00033 P: 0.0001
Url*img S: 00066 I: 00000 P: 0.9999

Ironically, legitimate URLs seem to be rare among spammers, while the wild and obnoxious names always pop up, with the exception of "java," of course, which appeared as spammy only because this user doesn't use Java (not because Java programmers were spamming). The appearance of certain naming conventions, such as the extensive use of "img," makes the task of identifying malicious URLs pretty easy. If we wanted to, we could probably determine the disposition of the message based on the URL information alone.

Ironically, URLs containing well-known Web addresses are likely to appear as innocent or hapaxes. Not a single URL token containing the following words has ever appeared in my corpus as spammy:

  • Url*microsoft
  • Url*quicken
  • Url*whitehouse
  • Url*intuit
  • Url*sco
  • Url*_amazon
  • Url*linux
  • Url*fbi
  • HTML Tokenization
    One area that has plagued many filter authors is the decision as to what HTML to include and what other parts of the message to ignore - for example, should we ignore JavaScript? What about font tags? Most filters pay attention to all HTML tags except those on an exclusionary list, namely, a specific set of tokens that are common to all types of e-mail. This approach works quite well, but there's still room for improvement. Ignoring data is always something to be concerned about, and you shouldn't do it unless you have good reason. The justification for ignoring some HTML data is that many people normally converse only with senders who do not use HTML. This could cause any type of message with embedded HTML to be rejected as spam, which could be bad for the recipient if their boss suddenly started using an HTML-enabled mail client. The tags most filters ignore include

    • td
    • !doctype
    • blockquote
    • table
    • tr
    • div
    • p
    • body
    • Short tags, with fewer than N characters of content
    • Tags whose content contains no spaces

    More Stories By Jonathan A. Zdziarski

    Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.