Akismet reports that over 90% of blog comments are spam. The University of Maryland reports that over 50% of all blog pings come from spam blogs (splogs). Messagelabs reports that nearly 60% of all email sent is spam.
These numbers paint a very dark image of the Web, one where over half of the "content" put forth is junk. Though creative filtering by search engines, email providers and software companies have kept much of this garbage away from end users, the problem isn’t going away any time soon. The lure of the easy dollar is too great to ignore.
But while these problems are annoying for the public at large, they are greater problems for owners and operators of legitimate Web sites, the people that have to deal with the problem both as end users and as targets for abuse.
However, while Webmasters are generally familiar with protecting their email addresses, filtering out comment spam and locking down their contact forms, many don’t realize that spammers might be abusing their site another way, by reusing their content.
An Endless Appetite For Content
While it’s impossible to know exactly how many spammers there are out there since spammers don’t provide the courtesy of a signature, the number is generally assumed to be small. Most estimates of the number of spammers put it somewhere between a few hundred and a few tens of thousands.
Even if we accept the highest spammer counts available, spammers make up less than one tenth of one percent (0.01%) of a Web with an estimated one billion users.
Yet, if we accept the numbers above, that small percentage of users is putting up half, or possibly more, of the Web’s “content” (Note: I use this term very loosely for the duration of this piece) every day. That’s at least 5,000 times their share of content, give or take, if we assume everyone is posting at least something to the Web.
With every spammer having to generate the equivalent content for thousands of people, the question becomes clear: How is this possible and where does it come from?
Spam Content 101
There are five basic ways you can obtain content for your site, email or other Internet message:
- Write it yourself
- Pay someone else to write it
- Generate it
- Steal it
- Any combination of the items above
Spammers, for a long time, wrote the bulk of content themselves. Like the bulk mail advertisers that came before them, spammers constantly tweaked and honed their sales pitches hoping to improve conversion rates and their own bottom line.
The “spam” element of spam was the endless duplication. Spam messages were real emails written by real people, just copied and sent to hundreds of thousands of people.
However, as people’s annoyance with spam grew, so did anti-spam measures. Spammers began to turn to generated content to help them around the newfound security measures, using bits of generated text to prevent emails from being identical or blocks of random words to, hopefully, trigger a whitelist filter that pushed the spam through.
But while spam was annoying and forced Webmasters to take precautions to prevent their systems from being misused (closing off open SMTP relays, using secure formmail scripts, etc.) spammers, by in large, avoided content theft.
Things changed though when anti-spam laws and advanced filters made life hard for the email spammer. Many traditional spammers began to change tactics, focusing more on the Web itself.
Now Webmasters found themselves square in the sights of the spam kings and became a target in new and frightening ways.
A New Target
While the first attempts at apamming the Web targeted users, as many today still do, most quickly realized the money wasn’t in trying to contact people directly, but in tricking the search engines to drive large volumes of targeted traffic, seemingly legitimately, to their door.
The scam was simple. Search engines rely heavily on inbound links to know which sites to rank high. All one had to do was create a lot of links to a site, wait for the search engines to pick them up and watch it climb in the rankings. Then, when it was on top, large numbers of targeted searchers would be beating down their door.
The scam worked on multiple levels. First, it was resilient, with enough buffers between the spam end and the money end to make shutting it down difficult. Second, people had no idea they were buying their products from a spammer, they got there via a search engines. Finally, anti-spam measures on the Web paled in comparison to those for email. More junk could feasibly get through.
The only problem with the scam was that the hunger for content grew exponentially. Duplication was no longer the answer as content repeated excessively on the Web is, at least theoretically, penalized by search engines. Computer-generated content was, by in large, detected by the search engines and blacklisted. Finally, there was no way to pay people to write enough content to fill all of these pages and ads.
Something had to change or the gravy train would stop. Needless to say, something changed.
This is Where You Come In
Spammers need content and lots of it. They need it reliably, they need it fast and, most importantly, they need it cheap.
Sadly, the cheapest content available is yours and mine, the material we put out there every day for free. It’s free to read, free to copy and free to paste. It might break an (additional) law or two, but there is no denying the cost/benefit relationship. For the spammer, stolen content is great.
However, stealing content comes with its own problems. First off, while copy and paste might be a quick and dishonest way to write a term paper, it’s too slow to fill thousands of sites (spammers aren’t known for their work ethic). Second, the threat of copyright infringement suits as well as DMCA notices and other complaints threaten to derail the entire scheme.
RSS solved the first problem for the spammers. With over 40 million feeds out there, according to Technorati, there was a vast amount of content placed in a standard format that could easily be scraped and reposted by a simple piece of software.
The second problem was addressed by returning to the spammer’s old favorite of generated content. By taking this scraped content and then modifying it, for example, by running it through a thesaurus, rearranging the words or pulling from dozens of different sources and rebuilding an article from that, they avoided simply reposting duplicate content and all of the potential hazards that came with that.
“Scrape, spin and spam” quickly become the spammer’s new mantra. Though the method produced mostly gibberish that could easily be identified as junk by human readers, it was good enough to fool the search engines and, thus, good enough to get by.
Almost overnight we found ourselves in a strange new world where our own site’s evil twin (or perhaps sinister cousin) is doing the dirty work for spammers.
It’s a discomforting thought, but it gets worse.
Welcome to the WWW (World Wide Wasteland)
The SSS method of spamming has long been associated with spam blogging (splogging) and other, static, spam sites. However, other kinds of spammers have picked up on its success and have begun using it in various ways with new generations of spam software to do the dirty work for them.
First comment spammers have begun scraping content from the entries they spam to make their comments appear legitimate and escape filtration.
Second, in a similar vein, message board and forum spammers, though not nearly as popular as they once were, have started using bits of earlier entries to escape detection.
Finally, even classic email spammers have started including blocks of text that look suspiciously like scraped and spun content. While this is impossible to confirm at this time, it appears that email spammers are using at least some Web content as well to keep their messages unique and, seemingly, legitimate.
Even though it’s hard to see how this fits in with a traditional marketing strategy, the purpose is pretty clear. All spam filters, regardless of medium, are designed to let legitimate content through. By mixing spam messages with real verbiage, they increase their chances of breaking through the invisible wall and reaching your eyes.
Back to Generation
In some positive news, at least for those posting their works to the Web, this SSS binge has been slightly blunted by more advanced content generators that at least have the potential to make scraping obsolete.
The problem is simple. Scraping, while great for the spammer, is far from ideal. Aside from the problems mentioned earlier, it’s always easier and faster to have the content just appear than to have to take it from somewhere else.
However, search engines have no desire to watch their databases fill up with random and useless content and have declared war on automatic generators, using a variety of methods to detect and blacklist offenders. These methods have, by in large, been effective at preventing “point and click” content from gaining ground online.
However, a new generation of automatic content creators has stepped forward to challenge the search engines using more advanced techniques that not only thwart classic detection methods, but more accurately mimic human writing (though still not enough to fool an actual human being).
The dangers, especially for search engines, are obvious. However, Webmasters will likely feel the pinch too as they deal with more active and difficult to detect comment spammers as well as a deluge or hard-to-filter email.
Even if scraping goes the way of the dodo, the war with spam won’t be over, it will just change fronts yet again.
So What Can We Do?
All of this begs a simple question, how do we stop this?
We can, and should, shut down obvious plagiarists. We can also truncate our RSS feeds and take a few other steps that make our content harder to access, but these are all temporary measures that simply frustrate legitimate users as well.
As long as there is a profit motive behind spamming, someone, somewhere, will do it. It’s unavoidable. Money and greed are just too compelling to some people.
Until we can find a way to remove the profit motivation from spamming, which would likely involve removing the profit motivation from the Web itself, people will continue to spam.
Despite this, we need to be aware of the problem so that we can look for ways to thwart spammers, protect our own homes and help others do the same. Spamming may not be going away, but we can minimize it and we can limit the impact it has on us.
It’s not a way to win the war, but a way to limit the casualties. In the end though, it’s probably the best that we can hope for.