| By Jonathan A. Zdziarski | Article Rating: |
|
| October 3, 2005 03:00 AM EDT | Reads: |
19,279 |
This article is an excerpt from Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. Printed with permission from No Starch Press. Copyright 2005.
Unlike older spam filters, in which the author programs the characteristics of spam, statistical filtering automatically chooses the characteristics (or "features") of spam and nonspam directly from each e-mail. Two years from now, when spam has evolved in content, statistical filters will have learned enough to continue doing their job. This is because unlike older spam filters, in which the author programmed rules to identify spam, statistical filters automatically identify damning features of a spam based on message content.
Tokenization is the process of reducing a message to its colloquial components. These components can be individual words, word pairs, or other small chunks of text. Data generated by the tokenizer is ultimately passed to the analysis engine, where it is interpreted. How the data is interpreted is important, but not necessarily as important as the quality of the data being passed. In other words, the way that a message is tokenized is more important than what we do with it later; even a simple change in tokenization can affect the accuracy of the filter. From a philosophical point of view, this raises the question, "What is content?" If content were just words on a page, then tokenizing only complete alphabetical words should be sufficient -but content is much more than that, as we'll see throughout this article.
Tokenizing a Heuristic Function
The one heuristic aspect of statistical filtering is tokenization. Even though the process of identifying features is dynamic, the way those features are initially established - how they are parsed out of an e-mail - is programmed by a human. Fortunately, languages change slowly, and only a few minor tweaks are necessary to adapt the tokenization process to handle some of the wrenches thrown at it by spammers. Tokenization is the type of heuristic process that is usually defined once at build time and rarely requires further maintenance. In light of its simplicity, many attempts are still being made to establish tokenization through artificial intelligence, to remove all sense of heuristic programming from the equation. Within a few years, filters should be able to efficiently perform their own type of "DNA sequencing" on messages, determining the best possible way to extract data. In fact, this is already being researched as a solution to filtering some foreign languages that don't use spaces or any other type of word delimiter.
Basic Delimiters
Besides deciding how best to break apart a message, there are many other issues to consider when tokenizing. For example, we need to determine what constitutes a delimiter (token separator) and what constitutes a constituent character (part of the token). Do we break apart some pieces of a message differently than others? What data do we ignore (if any)?
The fundamental goal of tokenization is to separate and identify specific features of a text sample. This starts with separating the message into smaller components, which are usually plain old words. So our first delimiter would be a space, since spaces commonly separate words in most languages. This makes it very easy to tokenize a phrase like the following:
For A Confidential Phone Interview, Please Complete Form & Submit.
which can be broken up into the following words:
For A Confidential Phone Interview
Please Complete Form & Submit.
As we've learned, each word typically is assigned one of two primary dispositions: spam or nonspam. The example above will cover a lot of text, but we're left with a few punctuation issues. For example, is the word "submit" on its own likely to have a different disposition from the word "submit." with a period after it? How about "interview" and "interview," containing a comma? In these cases, it makes sense to add some types of punctuation to the set of delimiters, as punctuation suggests a break in most languages. The following are some widely accepted punctuation delimiters:
- period (.)
- comma (,)
- semicolon (;)
- quotation marks (")
- colon (:)
Including too much punctuation in the makeup of tokens could result in five or 10 different permutations of a single word in the database. This can very rapidly diminish their usefulness. On the other hand, not having enough tokens can cause the tokens to become so common among both classes of e-mail that they become uninteresting. The trick is to end up with tokens that would stick out in one particular corpus. If there were 100 spams about warts in the user's corpus, but only one posing a question in which "warts?" was used, the filter is likely to overlook this feature in the one message.
Note: I've found that treating a question mark as a delimiter results in slightly better accuracy (on the order of a few messages) in my corpus testing, as opposed to treating it as a constituent character. This could likely change in the future, however.
Published October 3, 2005 Reads 19,279
Copyright © 2005 Ulitzer, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Jonathan A. Zdziarski
Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.
- Making Big Data into Small Data
- Challenges and Opportunities in Big Data From Industry and Academia Panel
- Oracle Unveils Oracle Enterprise Manager Ops Center 12c
- TIBCO Spotfire Brings the Power of Data Discovery to Big Data and Extreme Information
- EA and ESPN Announce Calvin Johnson, Jr. as the Fan-Voted Madden NFL 13 Cover Athlete
- WebMediaBrands’ SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE Returns to San Francisco, June 3 – 7, with Largest Event to Date
- Keynote Systems Selects AiNET for Data Center Colocation and Internet Connectivity
- Data Revolution for Wikipedia
- Yes, you need more than just R for Big Data Analytics
- GeckoSystems Continues Mobile Robot Technology Transfer With Premier Japanese Robot Company, ZMP
- Toyota Boshoku Exhibits at Auto China 2012
- Accenture Develops Innovative Mobile App for the “Lux in Arcana – The Vatican Secret Archive Reveals Itself” Exhibition Featuring Never-Before-Seen Vatican Documents
- Making Big Data into Small Data
- Challenges and Opportunities in Big Data From Industry and Academia Panel
- New Autodesk Gameware Products Announced at GDC 2012
- Oracle Unveils Oracle Enterprise Manager Ops Center 12c
- TIBCO Spotfire Brings the Power of Data Discovery to Big Data and Extreme Information
- EA and ESPN Announce Calvin Johnson, Jr. as the Fan-Voted Madden NFL 13 Cover Athlete
- WebMediaBrands’ SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE Returns to San Francisco, June 3 – 7, with Largest Event to Date
- Keynote Systems Selects AiNET for Data Center Colocation and Internet Connectivity
- Data Revolution for Wikipedia
- Suite de l'interview exclusive de thierry Ehrmann, PDG d'Artprice.com (7 mars 2012)
- Novità di Starwood Preferred Guest: l'app 'Go-To' riunisce tutti i marchi Starwood utilizzando un'innovativa interfaccia sempre aggiornata sullo status di ciascun socio
- Canon U.S.A. Announces the Highly Anticipated EOS 5D Mark III Digital SLR Camera
- Exclusive Q&A with Mike Milinkovich, Eclipse Foundation
- An Exclusive JDJ Interview With Sun's Jonathan Schwartz
- Programming Neural Networks in Java
- iPhone Is Safe No "gPhone" Coming From Google
- XML, Ontologies, and the Semantic Web
- Google Counter-Attacks AskJeeves With Emerging "Google Q&A" Service
- Service Discovery: Perspectives on the Past, Present, & Future
- A Profile of the Mad Prophet of Free Software
- Improve Your Coding Smarts with ColdFusion
- Smart Browser, Where Art Thou?
- Tokenization: The Building Blocks of Spam
- Implementing Business Rules in Java




















Ulitzer content is offered under Creative Commons "Attribution Non-Commercial No Derivatives" License.
For any reuse or distribution, you must make clear to others the license terms of this work.
The best way to do this is with a link to this web page.
Any of the above conditions can be waived if you get written permission from Ulitzer, Inc., the copyright holder.
Nothing in this license impairs or restricts the author's moral rights.