Paul Graham provides stunning answer to spam e-mails
Probability theory shows impressive results
Paul Graham, the co-creator of what is now the Yahoo Store, has published a strikingly effective new method of filtering spam.
I myself have long been critical of anti-spam and “family” filters, most of which are ineffective at best, and brain-dead at worst. One filter I evaluated stopped the users’ PC when the word “bomb” was encountered, supposedly to stop children from using the Internet to learn how to make bombs. Unfortunately, this also stopped students from researching the Unabomber or other legitimate topics. Dumb.
Graham’s new method, by contrast, is an intelligent application of the science of probability theory. In his latest iteration, his filter correctly flags 99.5 percent of spam, with 0.0 percent “false positives.”
False positives are important, personal messages that anti-spam efforts incorrectly filter out. This is a huge problem for ordinary methods. As Graham explains, “For most users, missing legitimate email is an order of magnitude worse than receiving spams, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient.”
Like many anti-spam crusaders, Graham himself started with an ordinary filter approach, looking for specific “bad words.” This initially showed some promise. Simply filtering out all e-mails that contain the word “click,” he says, correctly eliminates 79.7 percent of spam messages, while wrongly trashing only 1.2 percent of legitimate mail.
Those success rates, however, quickly degrade as more and more words are added to the “bad” list. This makes the crude filtering approach unusable.
The solution, Graham found, was to expand the technique in a statistically sophisticated way. Because all spam is trying to hype something, certain words have a high probability of indicating a spam message. Other words almost never appear in spam.
Words such as “though” and “apparently,” for example, increase the probability that a message is legitimate, because spam isn’t big on subtlety. At the same time, a genuine message isn’t rejected simply because it uses a single instance of a term that might also appear in an adult-oriented spam message.
Instead of mere “dumb” filtering, Graham’s elegant method analyzes the 15 “most interesting” words in each message. Through a technique known as Bayesian analysis, the weights of these 15 words are then used to compute the probability that a message is spam. This analysis is where his 99.5 percent accuracy rate comes from.
To get the weights, Graham ran the analysis on 4,000 spam messages and 4,000 legitimate ones. Statistically, this may not seem like many, but it’s proved to be very significant.
Graham proposes that his research be used to create a “seed filter” that would become part of users’ e-mail programs. Users would also be equipped with two Delete commands. One would be the regular Delete key, for genuine messages, while the other would be a Delete-As-Spam key, to be used when deleting spam messages. After a short time, each user would have an even more accurate filter, and spammers wouldn’t have a single seed file that they could easily figure out a way to work around.
I’ve long been an advocate of suing spammers out of existence, using state laws that prohibit false identities (employed by almost all spammers). Graham, too, supports the anti-spam laws, but mainly because they make spam easier to identify (by making certain terms predictably appear as spammers deny that their messages fall under the laws). Meanwhile, Graham has made a believer of me.
Probability theory finally makes a filter that works:
– – – – – – – – – – – – – – – – – – – – – – – – – – – –
Livingston’s Top 10 News Picks o’ the Week
1. E-business is enjoying double-digit growth again
2. New keyword tool calculates return on investment
3. Increase sales by handling your Web failures well
4. Lessig on tech: Ours is less and less a free society
5. Hollywood’s DVD “region coding” system is collapsing
6. Hackers find it easy to get into military PCs
7. Ten rules for writing for a dynamic World Wide Web
8. Pros share secret tricks of the new DreamWeaver MX
9. HTML tips: Importance of font sizing for usability
10. See great Flash animations: Solemates (plays music)
– – – – – – – – – – – – – – – – – – – – – – – – – – – –
Wacky Web Week: Apple’s “Ellen Feiss” video rocks
The latest hilarious video making the rounds of the Net involves a spacey — some say stoned — teenage girl explaining that Windows ate her homework.
The work is a TV spot in Apple’s “Switch” series, but it can easily be enjoyed by Windows users as well (most of whom can probably relate).
As explained by Ellen Feiss, the young woman in the ad, “It was, like, beep beep beep beep beep beep beep, and then, like, half of my paper was gone.” She adds it was “kind of a bummer.”
A small cult has sprung up around the subject, with numerous fan sites worshipping the non-actress, complete with video clips of her at MacWorld and altered versions of the Apple original. Wired News, below, has the best links to the classic video and its many imitators.
“Windows ate my homework, you know, like, buy an Apple”:
– – – – – – – – – – – – – – – – – – – – – – – – – – – –
E-Business Secrets: Our mission is to bring you such useful and thought-provoking information about the Web that you actually look forward to reading your e-mail.
About the Author: E-Business Secrets is written by InfoWorld contributing editor Brian Livingston ( Research director is Vickie Stevens. Brian has published 10 books, including:
Win a gift certificate good for a book, CD, or DVD of your choice if you’re the first to send a tip Brian prints. mailto:[email protected]