With “Really Simple Stealingfull versus abbreviated RSS feeds.
While advocates of full RSS feeds point to the improved convenience and readership having the entire article in the piece provides and believers of truncated RSS feeds talk of improved security along with greater potential for ad revenue, they are both ignoring a potential bombshell that could render the whole debate moot.
Simply put, the defense that truncated RSS feeds provide against plagiarism is, at best, temporary. The technology to defeat the technique already exists and is either waiting to be applied or is being used already, just under our collective radar.
What an RSS Feed Accomplishes
According to XML.com, RSS, which actually stands for “Really Simple Syndication”, is “format for syndicating news and the content of news-like sites.” It enables readers to use third-party applications to check for updates on any number of Web sites. This means that readers, rather than visiting dozens of sites individually, can have all of the new content laid out for them every day or whenever they check.
For surfers, the benefit is obvious, they can skip hours of surfing around in order just to find out what’s new at their favorite sites. For Webmasters, RSS feeds replace email newsletters, “What’s New” boxes and other, more traditional means of letting people know new content is up. It’s an automatic and immediate notification system.
Why Plagiarists Love RSS
Plagiarists love RSS feeds because they deliver raw content to their computer. With just one link, plagiarists can lift (or “scrape”) an entire site’s worth of content and automatically scrape any future content that goes up. This enables them to potentially run thousands of scraper sites, all controlled by automated software, that lifts and posts content from RSS feeds.
The same as RSS feeds save legitimate readers from surfing the Web for hours on end just to collect a few updated articles, they also save plagiarists the legwork of actually checking sites for updates that they can steal. In short, RSS feeds completely remove the human element out of plagiarism and make scraper empires simply a matter of “setting and forgetting”.
For plagiarists, this is a very lucrative and very easy game, for content creators and copyright holders, this is a frightening possibility.
How Truncated Feeds Protect
Fortunately, the current crop of RSS scrapers appear to be pretty stupid in nature. They copy and paste pretty much blindly. They aren’t capable of any deep digging or much critical thinking. Thus, if you don’t post your entire story in the feed, your average scraper can’t get at it. Where humans simply click the link to follow up, scrapers simply post a synopsis and a link to the original article, a much more preferable outcome than the one the scraper was designed for.
Since scraper sites are designed, primarily, to be search engine fodder, people setting them up tend to avoid using abbreviated feeds. Since plenty of sites offer full feeds, often without realizing it, the scrapers never run out of content and the software never has to get any brighter.
However, as more and more sites begin to truncate their feeds, many lured by the illusion of greater security, scrapers will slowly begin to develop a stronger interest in defeating the technique. When that time comes, a combination of preexisting technology and the very design of the RSS system will likely be used to trivially defeat this shield and escalate the battle between author and plagiarist.
The Achilles Heel and ‘Smart’ Scrapers
The greatest weakness in RSS is also one of its greatest design achievements. Nearly every entry in every feed, both full and truncated, contains a link to the original article. This encourages readers to visit the site, participate in a discussion via comments, create trackbacks on their site and, possibly, view advertisements that help support the efforts of the content’s creator.
The problem with that is the technology to extract text from HTML (or any other language) has been around for at least a decade, probably much longer. It’s a very trivial matter to pull the content out of the Web page and it’s only a matter of time before scrapers get smart enough to look at the truncated feed, compare the text with the page and extract the full article.
In short, with a simple extra step, all feeds, regardless of content, become ripe victims for plagiarists. Where once plagiarists needed full RSS feeds because their “dumb” scrapers couldn’t work with anything but raw content, a moderately intelligent scraper would just need the links, a starting point and a place to stop.
The bad news is that the technology is already here, and has been for some time, it’s just waiting to be applied. We’re all, more or less, sitting ducks and there’s not much we can do within the current bounds of the technology. Even my favorite technique, adding copyright footers to your feed, is easily defeated by this simple approach to stealing.
Plagiarists take advantage of RSS feeds because they’re easy and let them create hundreds, if not thousands, of sites quickly. The more difficult stealing from a site becomes, the less likely it is to be targeted. Scrapers need to grab a lot of content quickly to make their advertising clicks add up.
One potential solution would be to use random text, either as the footer or in the body of the piece. Though computers can be trained to ignore or omit text, if the software doesn’t know what text it needs to skip over, it is incredibly difficult to determine what to omit. Though humans will immediately recognize something as a copyright footer or a notice to them, computers won’t have such an easy time if it is different in every article.
Another possibility might be to tailor every article for the person reading it. By using advanced programming languages, it might be possible to date-stamp, timestamp and identify the machine reading the article. Combine that with some random text and all of this information will be copied by any scraper software that tries to steal the content.
However, the real problem is that these steps, if possible, would just again be escalations in the war between plagiarists and copyright holders. Real solutions are only going to come about when we take a look at the technology of RSS feeds and find ways to discourage automatic reuse.
After all, no one wants to prevent fair use or legitimate copying of work, at least no one I know, but until the technology can be somehow hardened against automatic reuse of content, automated plagiarism is going to be a fact of life and every solution we come up with is just going to be waiting the next great evolution in the software.
The old adage about what happens when you build a better mousetrap comes to mind, as I’m sure it does for the authors of the scraping software we’ve grown to despise.
[tags]Plagiarism, RSS, XML, Content Theft, Feeds[/tags]