Stopping Self Content Theft

14
Office Space
Creative Commons License photo credit: WallTea

Feeding Google’s insatiable appetite for content is on of the main reasons why infringers scrape and plagiarize content and also one of the biggest reasons why it is important to monitor and, in many cases, defend against it.

The logic is simple enough, the more copies of a work that appear, especially without proper attribution, the less likely that the search engines will give credit to the original source. This can erode search engine performance, especially for smaller and less-established sites or those in highly-competitive fields.

However, duplicate content doesn’t just come from plagiarists and spammers, it also comes from oneself and our own actions when dealing with our own content. Some of it is errors within our site, some of it is in how we approach social networking and social news.

So even as we are enforcing our rights elsewhere, we have to be careful about how we use our own works. Though it might not be infringement, it can certainly have a very negative impact on you and your site and is worth dealing with all the same.

Starting at Home

The first steps to dealing with duplicate content have to start on your own site or blog. Many people don’t realize how many opportunities there are to create duplicate content on your site, even by pure accident.

Consider the following examples from a simple blog:

  1. Tag Pages: Tag pages have much of the same content as individual post pages and are generated by most blogging applications.
  2. Archive Pages: Monthly, yearly and other archive pages, similar to tag pages, have the same content, or significant portions of it, repeated.
  3. Category Pages: As with Archives and Tags, category pages repeat content.
  4. Printable Pages: Many themes include printable versions of content pages that can be indexed as duplicates.
  5. Comment Pages: Finally, depending on the way comments are set up, a separate page with duplicate content can be created for the version with comments.

Depending on how your blog is set up, it is entirely possible that your article appears six times or more on your site. Google, and other search engines, have to make a decision about which page is the best page and link to it. However, it doesn’t always make the right decision and, in extreme cases, can even decide that the site is spamming and either lower its ranking or remove it.

Thus, it is important to make sure that you keep this duplicate content to a minimum and do your part to let the search engines know what you want them to link to. Here are a few tips:

  1. Show Summaries: When possible, only use article summaries and link to the full article. There is no reason for your tag, archive or category pages to display the full text of every entry.
  2. Use Robots.txt: Use your robots.txt file to prevent search engines from indexing unneeded pages, such as your printable pages. However, use caution with this method.
  3. Use Canonical Tag: Google, Yahoo and Bing all support the canonical tag, which tells search engines which page is the best to include in the index.
  4. In short, be very clear what versions of your content are ideal and try to keep the duplicates to an absolute minimum. Doing so will greatly help search engines tell which page to link to, helping both you provide a better service.

    Away from Home

    The other problem with self-defeating content use lies away from the home site. Where once an entire person’s presence was in their home page, now it can be scattered all over the Web, including other sites they run and social networking sites that they integrate with and use.

    While it may seem like a great idea to post your content on every site you take part in, it can confuse the search engines. You want your efforts in social media to support your search engine strategy, not replace your original site. However, many people unwittingly do exactly that.

    Here are a few ways to avoid that:

    1. Unique Content for Each Site: If you run multiple sites, you need to have unique content for each. You can use snippets of content to cross promote and certainly link between them, but don’t repost everything. It confuses search engines and readers alike.
    2. Use Snippets: When posting your content on other sites, use snippets and link to the original works. The likelihood of this replacing your content, in human or search engine eyes, is slim to none.
    3. Require Links: Whenever any content of yours appears on another site, even in snippet form, request links back to the original, specifically SE-friendly ones.

    In short, be careful how you use your content. Though linked use isn’t likely to hurt you with the search engines, if you aren’t careful you can really eat up your own site by spreading your work too thin, too carelessly.

    Bottom Line

    When we think of content reuse, we think of what others do with our work. However, the fact is we are all the biggest reusers of our own work and, perhaps, the most important.

    Though we can and should track how others use our content, as well as prevent uses that are against our wishes, it is also important to keep an eye on ourselves and make sure that our actions are working for us and with our strategy.

    As with anything else in life, the best place to start your content strategy is by looking at yourself and your own actions, after all, you are your own biggest customer.

Want to Republish this Article? Request Permission Here. It's Free.

Have a Plagiarism Problem?

Need an expert witness, plagiarism analyst or content enforcer?
Check out our Consulting Website

14 COMMENTS

  1. Well, I think this has a lot to do with your other post "How to Correctly Use Creative Commons Works". What you say here is plain sane, but sadly many webmasters manage their contents in a way that they only seek to please google's policies (google brings the most visits). That makes me think, besides monopolies aren't good at all, what if those policies keep changing, being more restrictive every time?

    On the other hand, one point was not very clear to me: If an article under CC links directly to the original source article, in doesn't count as duplicate content, right?

    • The answer to your second question is that it does count as duplicate content. However, the theory is that by linking to the original source, the repost should be treated as the duplicate and the original as, well, the original.

      As the reply below points out that doesn’t always work and different sites seem to be vulnerable to it in different ways. Some seem to be immune to it, others find themselves hurt by it, literally, by accident.

      The comment below is right in that you should monitor where your content appears, stop uses that are against your terms and track closely how it affects you in Google. If you’re big enough and established enough, duplicate content from others might not hurt at all. If you aren’t, it can cripple you.

      Just keep on top of it and see what the situations is for you and your site.

  2. I got in a debate with someone the other day regarding showing summaries. They said it doesn't prevent your content from being scraped. I told them they were right, but it does prevent from my entire post showing up on the scrapers blog – site. Great post!

    • The problem with showing summaries in your main RSS feed is that, while it limits the impact of scraping, it also limits the usefulness of your feed and the number of people willing to use it.

      It has to be weighed carefully but you are right that it can be a valuable tool if appropriate.

  3. I got in a debate with someone the other day regarding showing summaries. They said it doesn’t prevent your content from being scraped. I told them they were right, but it does prevent from my entire post showing up on the scrapers blog – site. Great post!

    • The problem with showing summaries in your main RSS feed is that, while it limits the impact of scraping, it also limits the usefulness of your feed and the number of people willing to use it.

      It has to be weighed carefully but you are right that it can be a valuable tool if appropriate.

  4. Well, I think this has a lot to do with your other post “How to Correctly Use Creative Commons Works”. What you say here is plain sane, but sadly many webmasters manage their contents in a way that they only seek to please google’s policies (google brings the most visits). That makes me think, besides monopolies aren’t good at all, what if those policies keep changing, being more restrictive every time?

    On the other hand, one point was not very clear to me: If an article under CC links directly to the original source article, in doesn’t count as duplicate content, right?

    • The answer to your second question is that it does count as duplicate content. However, the theory is that by linking to the original source, the repost should be treated as the duplicate and the original as, well, the original.

      As the reply below points out that doesn't always work and different sites seem to be vulnerable to it in different ways. Some seem to be immune to it, others find themselves hurt by it, literally, by accident.

      The comment below is right in that you should monitor where your content appears, stop uses that are against your terms and track closely how it affects you in Google. If you're big enough and established enough, duplicate content from others might not hurt at all. If you aren't, it can cripple you.

      Just keep on top of it and see what the situations is for you and your site.

  5. @David – Yes it does count as duplicate content. It is a *huge* misconception that people have that if there is a link, you will get credit.

    Any time your content is taken and posted somewhere else it *is* duplicate content. It doesn't matter whether you authorized the duplication or not. A link can help, but it is *not* a guarantee that the original source will get credited properly.

    Yahoo and Bing do a decent job of sorting out the owner and providing proper credit and search engine placement. Google, on the other hand, is horrible at it.

    Older sites can withstand duplication more than newer ones. But the real factor is the size of the site. Very big, new sites, created with lots of stolen content rank very well in Google.

    On the other hand, the smaller the site, the more likely it is to have problems with Google — even if a link back to the original source is included with the copy.

    Here's the bigger issue: If enough content is taken from your site, especially in a short amount of time, Google can decide:

    1) not to rank any content that lies along the same path/within the same section of the site (and it bases this decision on your site structure and directory folders. This means Google will also throw out content that remains unique to your site just because enough other pages in the same section of your site were duplicated; and

    2) not to rank any new content you add after it decides your site is worthless.

    You also might lose your page rank for the duplicated pages, and you might not gain page rank for the new pages you add. This can put your site at a serious disadvantage, making it even easier for people to take your content and undermine your search rankings.

    If you are concerned about Google placement, monitor your content. Don't let others just take it from you without a fight, and make sure you are not duplicating things yourself, either internally or via postings on Facebook or any other social media site.

    I do not understand why Google behaves this way. But it does. So be vigilant.

    On the other side of things, if you are reposting other people's content, make sure you ask first. Please also don't get all angry about it when the response you get is "no" or "you can take a short excerpt, but not the whole document."

    Sorry, I know this is kind of a rant, but this touched a nerve, given the damage someone did to my site over the last two months.

    • To be clear, I agree with most of this but would say that Yahoo! and Bing are both prone to make mistakes too when judging original vs. duplicate content. No search engine is perfect and I don't really think Google is any better/worse in this area. I'm sorry that you got bit by it though.

      I would like to emphasize that the impact is different for every site. Some sites are not affected at all, others are crippled. It's just a matter of the specific case and there doesn't seem to be much rhyme or reason to it.

      Thank you for your comment and we don't mind if it's a rant. Let me know if there is anything I can do to help you with your site.

  6. @David – Yes it does count as duplicate content. It is a *huge* misconception that people have that if there is a link, you will get credit.

    Any time your content is taken and posted somewhere else it *is* duplicate content. It doesn’t matter whether you authorized the duplication or not. A link can help, but it is *not* a guarantee that the original source will get credited properly.

    Yahoo and Bing do a decent job of sorting out the owner and providing proper credit and search engine placement. Google, on the other hand, is horrible at it.

    Older sites can withstand duplication more than newer ones. But the real factor is the size of the site. Very big, new sites, created with lots of stolen content rank very well in Google.

    On the other hand, the smaller the site, the more likely it is to have problems with Google — even if a link back to the original source is included with the copy.

    Here’s the bigger issue: If enough content is taken from your site, especially in a short amount of time, Google can decide:

    1) not to rank any content that lies along the same path/within the same section of the site (and it bases this decision on your site structure and directory folders. This means Google will also throw out content that remains unique to your site just because enough other pages in the same section of your site were duplicated; and

    2) not to rank any new content you add after it decides your site is worthless.

    You also might lose your page rank for the duplicated pages, and you might not gain page rank for the new pages you add. This can put your site at a serious disadvantage, making it even easier for people to take your content and undermine your search rankings.

    If you are concerned about Google placement, monitor your content. Don’t let others just take it from you without a fight, and make sure you are not duplicating things yourself, either internally or via postings on Facebook or any other social media site.

    I do not understand why Google behaves this way. But it does. So be vigilant.

    On the other side of things, if you are reposting other people’s content, make sure you ask first. Please also don’t get all angry about it when the response you get is “no” or “you can take a short excerpt, but not the whole document.”

    Sorry, I know this is kind of a rant, but this touched a nerve, given the damage someone did to my site over the last two months.

    • To be clear, I agree with most of this but would say that Yahoo! and Bing are both prone to make mistakes too when judging original vs. duplicate content. No search engine is perfect and I don’t really think Google is any better/worse in this area. I’m sorry that you got bit by it though.

      I would like to emphasize that the impact is different for every site. Some sites are not affected at all, others are crippled. It’s just a matter of the specific case and there doesn’t seem to be much rhyme or reason to it.

      Thank you for your comment and we don’t mind if it’s a rant. Let me know if there is anything I can do to help you with your site.

LEAVE A REPLY

Please enter your comment!
Please enter your name here