Subscribe

Warmed-over Bayesian ham and spam

Carel Alberts
By Carel Alberts, ITWeb contributor
Johannesburg, 16 Oct 2003

A white paper from security and messaging company GFI describes how to block over 98% of incoming spam using Bayesian filtering techniques.

The idea behind the Bayesian techniques was first developed in the 17th century by Thomas Bayes, a mathematician, and used by anti-spam technologists as early as 1998, with varying degrees of success.

However, GFI`s white paper says its implementation of the techniques is new. "Bayesian filtering is new in the format we are offering it - namely as a server-based anti-spam technology, rather than a client-based one," says Angelica Micallef Trigona, of GFI in Malta.

"The technology (as an anti-spam tool) has not been widely available - a number of end-user anti-spam solutions have incorporated this technology, but few enterprise-level products have introduced it," she says.

ITWeb`s reading shows that anti-spam expert Paul Graham is a main exponent of using Bayesian filtering to combat spam.

A talk given in January at the 2003 Spam Conference explains the divergent results between early tests on these tools, which claim just over 92% effectiveness (branded insufficient by Graham), and his own tests, which filtered out over 99.6% of spam, with little in the way of false positives or false negatives. GFI`s own tests reveal 98% effectiveness, but Graham concedes the size of the sample will affect results and efficacy.

How it works

GFI explains that techniques currently used - such as blacklist checking, databases of known spam and keyword checking - are static, making it fairly easy for spammers to evade them. "These technologies...cannot be used...effectively...if not combined with a new adaptive technique that remains familiar with spammers` tactics as they change over time. The answer lies in Bayesian mathematics, ...an adaptive, statistical intelligence technique that is much harder for spammers to circumvent," the press release states.

Before mail can be filtered using this method, the user must generate a tailor-made history for each word (or token) specific to the company. A probability value is assigned to each word or token, based on how often that word occurs in spam as opposed to legitimate mail ("ham"). Once the word probabilities have been calculated, the filter is ready for use.

If this sounds cumbersome, Trigona says: "The learning process is taken care of by our product, GFI MailEssentials. From a user point of view, all he needs to do is wait one week (depends on the mail quantities of that particular organisation). The rest is automatic.

"Once the filter has completed the learning process, it knows what you think is spam and ham, making it much more effective than any other spam-catching technology. Once the system is in place, users can easily review mail addressed to them that has been flagged as spam, as GFI MailEssentials permits spam to be forwarded to individual customisable folders in the end-users` inboxes."

To counter the problem of false positives, GFI MailEssentials includes an automatic whitelist management tool to enable administrators to ensure that mail from particular senders or domains are never flagged as spam.

Share