Nearly every time I speak with someone at length about the issues of plagiarism and content theft online, the topic comes around to splogging and scraping at least briefly. As the particulars of how it is done and why it is done begin to come to the surface, many, if not most, ask the simple questions "Are blog feeds the problem?"
My answer as the same every time: No.
While I grant that Really Simple Syndication (RSS) has certainly not helped the scraping situation and definitely made content easier to steal, it’s not the cause of the problem. Not only does it make no sense to blame a new technology for an act of man, but content theft, even en masse, was possible well before RSS and would be equally possible without it.
My First Scraper
When I first started developing Web sites over ten years ago, HTML was in version 3.01, CSS was still years away from acceptance, as were most advanced Web languages, and the vast majority of Web development was done in programs that looked and acted like fancy versions of Notepad. The Internet, much like myself at that age, was young and still "finding itself" in many ways. Meanwhile, countless computer geeks like myself were working in darkened computer labs pounding out code for our own personal place on the Web.
Like many people making sites back then, I typed my content directly into my HTML editor, skipping the word processor altogether. While this worked fine and avoided many copy and paste problems that plagued software back then, it also created a challenge, how to get the text back for storage in an offline format. One could try to copy and paste from the browser, but, in those times, results were mixed.
It was then a friend of mine handed me a small piece of software called HTML2TXT (Note: I’d link to the application but there are so many with that name or similar ones I have no way of knowing which one it is or if it is still around). The function of HTML2TXT was very simple, it would take an HTML file, strip out all of the tags and spit out a regular text file that you could open and copy from with ease.
The algorithm was basic at best, it would grab all text in an HTML page, including headers, footers and navigation links. Though this produced a lot of unneeded content and necessitated a fair amount of editing, when combined with its spider that would allow it to extract text from an entire site with one click, it was still quicker than wrangling with the unwieldy browsers of the era and their limited copy/paste functionality.
Of course, back then, I had no way of knowing that this tool would be the great grandfather of tools later used to steal massive amounts of content. Though over the years terms such as "extraction" have turned into more vile ones such as "scraping" and advances in other areas of the Web greatly limited the legitimate use of this particular breed of software, back then it was just a friendly tool written to help make the everyday Webmaster’s life a little bit easier.
However, like so many well-intended creations before it, extracting tools would be put to nefarious use soon enough. All it took was the creation of an Internet where content was king and the cheats of the Web needed all of the material they could get their hands on in order to make their easy money.
How RSS Helps Scrapers
RSS offers two critical advantages to scrapers.
First, it provides content in a specified format. Though there are several specifications for RSS, it is significantly easier to program a scraper to pull content from a handful of formats than from the endless array of site templates. While it can be done, and at least one scraper has that ability already, it it’s much easier and faster to pull from a known, standardized format than from an actual HTML page.
Second, it provides a means for the scraper to automatically receive and grab new content, helping them keep their spam sites fresh and preventing them from having to constantly rescrape sites to gain new material.
These features, which were originally designed to help Webmasters syndicate their content for legitimate use, have been a boon for spammers and other scrapers, removing most of the work from stealing content. In short, the fast lane to splog Hell was paved with good intentions and millions of RSS feeds.
Despite that, blaming RSS for the problem is still short-sighted. RSS didn’t create the splog, the spammer did and, most likely, he would have done it even if RSS hadn’t been there at all.
The Road Less Traveled
As HTML2TXT shows, the technology to scrape content from Web pages has been around for at least a decade. Furthermore, the ability to automatically check Web pages for updates, even without RSS, has been around for nearly as long, as software applications such as Website Watcher prove. Though none of the apps mentioned here are necessarily "black hat" or even containing a significant usefulness for spammers, their functions are both based on well-known technology and can easily be recreated and reapplied.
Even if RSS had never gained acceptance or if all of our wonderful feeds disappeared tomorrow, spammers would only have to shift strategies. It might not be as easy or as neat as RSS scraping, but considering the quality of content put forth by scraped sites, perfection is clearly not a desired trait. The outcome only has to be good enough to get by.
How long would it take sploggers to switch away from RSS feeds? Probably not long. Some software already exists that scrapes from both Web sites and their feeds and retrofitting old scrapers to take advantage of Web page content rather than RSS feeds would, most likely, be a trivial matter.
As it sits right now, most sploggers use RSS feeds for three simple reasons: They are available, they are easy to scrape and there are a lot of them.
If ever that were to change, sploggers would change their strategies and hit us in a whole new way.
Some Good News
While RSS scraping is frustrating, it’s worth noting that the splogger’s obsession with RSS is also his/her weakness. As long as sploggers are dependent on feeds to make their ill-gotten living, Webmasters will have solutions at their disposal to help combat the plague, many of which would not be available if sploggers turned to other means to scrape content.
- Detection: RSS feeds are relatively easy to track. If one simply signs up for a Feedburner account, they’ll get a list of all the "uncommon" uses of their feeds, many of which will turn out to be sploggers. More tech savvy users have taken to adding their own copyright statements at the footer of their feed entries and others have been adding invisible images to track where content is reused.
- Truncating: Though the source of much debate, truncating an RSS feed can do a great deal to hamper RSS scrapers. By providing only a minimal amount of content, scrapers get almost no benefit from using your feed and are likely to move on to other targets. This helps minimize both the damage to your site and the potential benefit for a splogger.
- Advertising: Many have turned RSS scraping into an advertising opportunity, injecting ads into their feeds that aren’t on their main site and watching as they get blindly reposted by sploggers. While the legality of this is up for debate, it is a fascinating idea and simply would not work if scraping software had even an iota of intelligence.
The bottom is that, as talented as some scrapers are at manipulating content and reposting it, the actual scraping side of the splogger’s software of choice is generally pretty dumb and has been kept that way through a laziness fostered by RSS feeds. This gives Webmasters at least a temporary advantage when dealing with sploggers and offers us the chance to hit back, at least until spamming techniques evolve.
The Real Danger of RSS
It’s a scenario that’s played out many times. Someone approaches me wanting help and guidance with a plagiarism matter. After going over the various options they have to get the work removed, I explain that the plagiarist is likely using their RSS feed to steal their content as it goes live. I tell them that they might want to consider, at least temporarily, truncating their RSS feed to limit future theft.
The response is swift, "What’s an RSS feed?"
The truth is that most veteran bloggers take for granted that everyone knows what an RSS feed is. However, as my experiences have shown, many do not. In fact, many novice bloggers don’t even know that their service is publishing a feed since, most likely, it was set up without their knowledge. Blogger, Myspace, Xanga, LiveJournal and countless other blogging services/social networking sites provide RSS feeds with each account. However, few explain what an RSS feed is and none, that I know of, explain the risks in having one.
This has resulted in the creation of literally millions of RSS feeds in which the owners have no knowledge of the existence of the feed at all. Millions more are owned by individuals who don’t fully understand what RSS feeds do and still millions are owned by people completely unaware of the risks involved in having one.
While the overall percentage of unknown feeds is, most likely, relatively small compared to the over forty million feeds out there, there is still plenty of reason for concern. Feed owners without a good understanding of RSS can’t do anything to thwart or deter misuse. They don’t know that, right now, scrapers could be taking their content wholesale and reposting it to make an easy living off of their content.
If there is a problem with RSS feeds, it’s not the technology itself, but the fact that, as the technology has become ubiquitous and commonplace, many have been left in the dark about it. This has not only hampered the use of RSS feeds as a means to disseminate information, but also made it a much more appealing target for sploggers.
The bottom line is that, many bloggers learn too late the full power and that comes with RSS feeds, often not seeing the truth until after something goes wrong.
Though the history of RSS may be marked with controversy and disagreement, there is little arguing that feeds are a powerful and useful tool for distributing content on the Web. The simplicity of RSS, which is hinted at in the name itself, has revolutionized the way many get information and content delivered to them. However, it’s also made content theft and scraping significantly easier and helped bring about a revolution in spamming, one that uses other people’s content to fill tomes of useless Web pages.
Despite that fact, the technology behind RSS isn’t at fault for the splogging phenomenon, it’s just the current front of the war. Even if we decided to obliterate all RSS feeds tomorrow, Web spamming would still continue, just in a different format. In fact, having RSS be the current front of the war is somewhat fortuitous as there are easy ways to minimize scraping and protect feeds from theft.
Still, to do that, one must have a good understand of RSS, how it works and the dangers that come with it. In that regard, it’s like any other technology. However, such a basic understanding is severely lacking in many circles and more information is clearly needed. Much of this burden can be lifted by blog services that provide both the blogs and the feeds to novice users.
In the end, while RSS publishers may need to take a look at new ways to use the technology or take extra precautions to prevent theft, the technology itself is not to blame the splogging epidemic and, in truth, does far more good than bad, even in the worst cases. Reasonable precautions, coupled with better understanding and education about RSS can eliminate many of the problems that exist with it.
There’s no sense in throwing the baby out with the bath water. Blaming RSS for a problem that likely would exist even if it had never been invented makes little sense and does no good. Our energies are better invested in finding ways to use the technology we have to both free up works for legitimate distribution and hinder plagiarists, sploggers and others that seek to misuse our content.
[tags]Plagiarism, Content Theft, RSS, Scraping, Splogging, Blogging[/tags]