Content Theft and Microformats

As the Web has evolved, the content on it has become more and more accessible. This results of this have been both very good and very bad.

Users have more choice than ever when it comes to how to view content, legitimate syndication is easier than ever and information, on the whole, travels much more quickly. On the other hand, content theft has become significantly easier (and more profitable) not only increasing the rate of plagiarism, but also the pace at which the Web fills up with junk.

The trend, however, is unyielding and microformats may very well be the next evolution in it.  

However, as with any advancement that increases one's ability to access and reuse a work, it comes with its own set of potential hazards and pitfalls that must be made clear. But, with the case of Microformats, it may be possible to address some of these issues before the standards become widely accepted, greatly alleviating the potential for damage.

What Are Microformats?

Getting a straight answer as to what microformats are is very difficult. Even the about page on the microformats site is of very little help, doing more to explain what the concept isn't rather than what it is.

However, the concept of microformats is actually fairly simple. Microformats are a set of standard conventions for marking up data within a regular HTML document. It lets you inform appliations that can read microformats what each part of a page is (title, body, attribution, etc.) without having to alter the design, flow or readibility of the page.

In short, Microformats allow Webmasters to embed much of the functionality of an RSS feed, which allows applications to process, reformat and redisplay its content easily, into a regular HTML page without making any visible changes.

There are many different kinds of microformats, including hCard, which is used to mark up contact information, hReview, which used to format product, movie and other kinds of reviews, and hCalender, which makes it easy to share events and appointments. Theoretically, one can take these formats, invisibly mark them up using hidden code in an already-existing HTML file, and then distribute the information much the same as one does an RSS feed today.

It's an exciting concept that could fascilitate syndication by eliminating the need for two versions of a document, one human readable and one machine readable, and opening up more types of material for syndication and providing better formatting for it. However, Microformats are yet to achieve the broad support of RSS and many of the actual formats are still in draft stage. Still, Technorati has created an experimental Microformats search engine and a site for pinging microformat content.

But with this great new potential comes great new dangers.

Scraping and Microformats

Since RSS feeds have many of the same functions as microformats, they come with many of the same perils, most notably scraping.

Since microformats identify individual elements of a created work (author, item, description, rating, etc.), microformatted content provides more information to a potential scraper, better enabling them to pick and choose the content that they want. Furthermore, since microformats can, and most likely will, be applied to any kind of material, they make content that previously wasn't available by RSS feed easily scraped.

In short, microformatted content can be scraped as easy, if not easier than RSS feeds and, sadly, such scraping appears to already have taken place. 

Ja, who's comment to my previous article about cloaking was the inspiration for this article, recently discovered someone reposting hReview content in a "fairly shady way" and drew attention to it with an hReview of his own. The site has since shut down but it serves as a proof of concept, that scraping through microformats is not only possible, but inevitable as the standards become more widely accepted.

However the creators of microformats have not been blind to the problem and have taken some steps to stave it off. Those steps, however, might not prove as useful as one would hope.

Defense

Unlike RSS and most other syndication formats, microformats has a built in license standard (Note: RSS has a proposal for a similar standard.). In fact, the default HTML code offered by the Creative Commons Organization includes the microformat inside of it, effectively tagging nearly every CC page with a small microformat (though not one that affects the text). 

However, since microformats are purely for markup, there is no way to actually enforce the license. While it provides a powerful tool for legitimate aggregators to seek out content they can use safely, unethical ones are able to simply ignore the license altogether and continue scraping. The most that it can do to prevent unwanted scraping is offer warning to scrapers that might be doing it out of ignorance rather than intentionally.

On the other hand, since microformats allow the creator as much, if not more, control than an RSS feed, it's easier to put copyright information and attribution into the work itself. There's no need to learn a new syntax, edit a template or any other complicated changes.

In that regard, microformats could actually strengthen the user's ability to protect his content, at least marginally.

Conclusions

Sadly, Microformats are likely to become a boon for sploggers and scrapers as they have both a new format to scrape content from and new resources to ping, obtain links from and be found in.  Worse still, the tricks and methods that can detect and stop scraping of RSS feeds would either need to be altered or discarded when trying to protect microformatted works.

Unless the standard is somehow guarded against scraping, which no open system can be, content creators will likely find that scraping, especially keyword scraping, will increase drastically.

This is not to say that microformats are a bad thing, they certainly are not. However, they will come with a heavy price and it is a price that will be paid by copyright holders, search engines (who have to fight the spam) and even end users.

As more and more big names, such as Yahoo, throw their weight behind microformats, it appears to be a technology that may start directly impacting our world soon. Most likely, we will all be using some form of microformats in the not-too-distant future.

The question is, are we really ready to pay the price?

[tags]Plagiarism, Content Theft, Copyright Infringement, Microformats, Scraping, Splogs, Splogging[/tags] 

Want to Republish this Article? Request Permission Here. It's Free.

Have a Plagiarism Problem?

Need an expert witness, plagiarism analyst or content enforcer?
Check out our Consulting Website