Scraping Starts from the Very First Post

Rachel Radison is a New Orleans-based mortgage broker. After seeing the difficult home-buying climate in the city following Hurricane Katrina, she decided that she could help would-be buyers with her know how.

Radison, having heard about these new Web sites called “blogs” decided to create one herself and obtained a WordPress.com account. However, her ignorance about blogs quickly caught up with her. Shortly after her first post, she checked her feed statistics and found that a whopping eleven people had subscribed to her feed.

She then quickly learned they weren’t interested in her feed at all, just her content.

Fortunately, Rachel Radison does not exist. She is a figment of my own imagination. I created both her and her site as an experiment to see both how common scraping is and how long it would take for scrapers to find a blog.

The answer surprised even me.

Background

The idea for the experiment came from an article on A Daily Rant. There, the owner took an ongoing WordPress.com blog and shut it down, leaving only a “this blog has been moved” post to let subscribers know. He then tracked the subscribers to the feed and found that, even after most subscribers had moved over, eighteen still remained.

The experiment was interesting but flawed. Many humans change RSS readers and often forget to unsubscribe from old feeds everywhere they’ve been. For example, I am certain there are many dead feeds in my old Rojo account. It is entirely possible that, of the readers they had prior to the shut down, eighteen were simply old, but legitimate, RSS readers continuing to check the dead feed.

However, the idea seemed valid enough, take a dead feed and monitor the traffic on it. If you can set it up so that no humans should be subscribed to the feed, you can be reasonably sure that all traffic to the feed is from bots.

So, that is exactly what I did.

The Experiment

Using an old Gmail account, I created a new WordPress.com blog. I then gave the blog a theme and gave it a name “Rachel Radison’s Mortgage Blog”. I specifically chose mortgage a topic matter because it is both a spam-friendly keyword and it is a topic I have at least a little knowledge about after buying my new home.

In order to avoid being caught in WordPress.com’s spam filters, I decided to write the posts myself, using my extremely limited knowledge of the subject and a large amount of fluff. After a few moments of faking my way through the first post (being sure to add a footer disclaiming the site as rubbish), I made sure that the blog was going to ping all of the usual notification services and published it.

I then returned the next day to do another post but stopped off to check the feed stats, what I saw is below:

day12.png

Even I was stunned by this. I had expected the feed to be scraped. But to have eleven “subscribers” after less than 24 hours was stunning.

Some of these subscribers could be easily explained. Technorati and Google both picked up the feed almost immediately. Likewise, WordPress seems to have its own crawler. However, no others have picked it up as of this writing and that leaves at least eight “subscribers” unaccounted for.

But Wait, It Gets Worse

I posted again on the seventeenth but, due to difficulties with my move, was very late in posting on the eighteenth. However, before I posted, I checked the feed stats again, what I saw is below:

day3.png

The second day had seen a whopping sixteen subscribers, though no new search engines had picked up the feed. Even more strange, day three had dropped to only three subscribers, marking an over 80% drop in subscribers.

However, shortly after I posted the third, and final, article, the subscriber count more than doubled, reaching eight. Though still a marked drop from the day before, it showed that the feed subscribers were prompted by my posting and not creation of the blog.

To test this, I then let the blog lapse for a few days. Very quickly, the feed subscriber count completely flat-lined, reaching zero.

day7.png

This does not follow a “human” pattern. Though feed counts rise and fall, as anyone at FeedBurner will tell you, but they do not follow this pattern. This is, almost certainly, the work of bots, both good and bad.

The outcome is pretty damming, it is obvious that there are at least some scrapers waiting for your site from the very first post and that being an unknown blogger is no protection against RSS abuse.

Problems with the Study

This isn’t to say that there aren’t problems with the study. There are several.

First and foremost, the study, by itself, means nothing. It is just one site on one service and on one topic. A more complete study would try more blogs on a variety of topics and services.

But a bigger problem is that I can not account for all of the subscribers of the feed. I have done several searches for the scraper sites but have had no luck in locating them. Odds are, it is simply too early for them to have been picked up by the search engines. Even the original site is only in Technorati and Google as of this writing.

Also, with most search engines running spam filters, it is very likely that they would catch the scraped blogs before they were indexed. In fact, I have a feeling that several have even determined that the original blog is spam, which it technically is, and refused to index it as well, thus why Icerocket, Sphere, Yahoo! and others have not added the original either.

But even if we consider all of the legitimate blog search engines that would likely be looking at the feed, it doesn’t account for all of the subscribers.

That fact is further supported by the fact that WordPress could not identify most of the readers of the feed and, most that it did identify, were deemed “Web browsers”.

Also, as you can see in the image below, traffic to the site itself never reached anywhere near traffic to the feed (save today where I’ve been visiting the site). Most search engines, including Technorati, also visit the site when indexing the feed to ensure they get the full post (it is a way to guard against partial feeds limiting blog search engines).

day7visits.png

If the uses of the feed were legitimate, then traffic to the site would, initially, either meet or exceed the traffic to the feed. With no long-term subscribers that may not visit the site every day, the fact that over a dozen “subscribers” accessed the feed without accessing the site is very suspicious.

Conclusions

Though it is hard to draw any solid conclusions from this study, there are at least three things that are obvious:

  1. Suspicious use of a feed begins, literally, with the first post.
  2. Being an unknown blogger is no defense against scraping.
  3. Spammers are basing much of their scraping on the notification services most blogs ping. If a pinged post has a keyword they are targeting, it seems that they then visit the feed to grab the content.

What isn’t clear at this time, and likely won’t be until search engines update or drop their spam filtering, is how many of these suspicious visitors were truly scrapers. Almost certainly some were, but but some were also likely pinging services and search engines.

But if simple math is a clue and we believe that most legitimate services would also look at the site, it seems that the vast majority have less than honest intentions.

The bottom line is that something rotten is going on, it is just a matter of how rotten it is.

14 comments
Sort: Newest | Oldest
engtech @ internet duct tape
engtech @ internet duct tape

I think you're misinterpreting your data source. The wordpress.com feed stats always follow the ebb and flow of your posting frequency.

I have a popular wordpress.com blog, and my feed readers are split between the wordpress.com feed (http://engtech.wordpress.com/feed or http://internetducttape.com/feed) and the FeedBurner feed (http://feeds.feedburner.com/engtech). Wordpress.com doesn't let me redirect to my feedburner feed.

About 624 of my readers are in FeedBurner, there's another 400-700 who grab the feed directly.

Here are screenshots of my stats from wordpress.com and from FeedBurner. As you can see, there are serious discrepancies. I trust the FeedBurner stats much more.

http://i115.photobucket.com/albums/n296/engtechwp/special/feedburner.png
http://i115.photobucket.com/albums/n296/engtechwp/special/wordpresscom-feeds.png
http://i115.photobucket.com/albums/n296/engtechwp/special/post-freq.png

To make it worse, the wordpress.com stats seem to be pretty dumb in that they count feed reader hits even if it's just someone clicking on your link from another feed. Not an issue for this experiment, but something to note.

Bottom line: no conclusions can be drawn from using wordpress.com feed stats. Set up a blog somewhere that let's you use FeedBurner stats and you'll have a *much* better data sample.

Interesting idea, but the data you're basing it off of is so questionable to start with.

engtech @ internet duct tape
engtech @ internet duct tape

I think you're misinterpreting your data source. The wordpress.com feed stats always follow the ebb and flow of your posting frequency.

I have a popular wordpress.com blog, and my feed readers are split between the wordpress.com feed (http://engtech.wordpress.com/feed or http://internetducttape.com/feed) and the FeedBurner feed (http://feeds.feedburner.com/engtech). Wordpress.com doesn't let me redirect to my feedburner feed.

About 624 of my readers are in FeedBurner, there's another 400-700 who grab the feed directly.

Here are screenshots of my stats from wordpress.com and from FeedBurner. As you can see, there are serious discrepancies. I trust the FeedBurner stats much more.

http://i115.photobucket.com/albums/n296/engtech...
http://i115.photobucket.com/albums/n296/engtech...
http://i115.photobucket.com/albums/n296/engtech...

To make it worse, the wordpress.com stats seem to be pretty dumb in that they count feed reader hits even if it's just someone clicking on your link from another feed. Not an issue for this experiment, but something to note.

Bottom line: no conclusions can be drawn from using wordpress.com feed stats. Set up a blog somewhere that let's you use FeedBurner stats and you'll have a *much* better data sample.

Interesting idea, but the data you're basing it off of is so questionable to start with.

JB
JB

Elf's DH,

I'd heard of that but had not seen an actual case of it taking place. Sure I've gotten the spam with the text in it, but I've never seen my own work used in that way.

Sadly though, you may be very right. If that's the case, the odds of me finding this text is slim to absolutely none.

JB
JB

Elf's DH,

I'd heard of that but had not seen an actual case of it taking place. Sure I've gotten the spam with the text in it, but I've never seen my own work used in that way.

Sadly though, you may be very right. If that's the case, the odds of me finding this text is slim to absolutely none.

Elf's DH
Elf's DH

I have done several searches for the scraper sites but have had no luck in locating them.

Not all scraping is done for splogs. I've gotten (and I'm sure everyone else has gotten) spam emails that have scraped sentences from random websites in order not to be filtered out as gibberish by Bayesian filters. (A particularly amusing one I got reconstituted the descriptions of birds from the Audubon Society).

Elf's DH
Elf's DH

I have done several searches for the scraper sites but have had no luck in locating them.

Not all scraping is done for splogs. I've gotten (and I'm sure everyone else has gotten) spam emails that have scraped sentences from random websites in order not to be filtered out as gibberish by Bayesian filters. (A particularly amusing one I got reconstituted the descriptions of birds from the Audubon Society).

JB
JB

Morin,

As I said in the article, I'm not sure. There are two things that do disturb me, the first is that Wordpress could not identify most of the feed readers. I would thing that it would recognize one from an obvious source such as Weblogs.

Second, those it DID identify were listed as "Web Browsers" and there should not have been any human subscribers to the feed (I didn't even subscribe). Many scrapers hide their bots by having them identify themselves as Web browsers, it is a well-known trick.

I would say that about 80% of the subscribers were listed as "unknown" and the rest were Web Browsers. I wish I had taken a screenshot of that as well but I was in a rush due to the move. I might reignite the experiment later today and see what happens.

WillMacc,

Thanks for providing further confirmation to my theory. If you have any statistics on that, I would love to see them, perhaps we should work together and form a more thorough study? This was just quick and dirty to get a feel for the problem.

Obviously more research needs to be done as the problem is greater than even I imagined...

JB
JB

Morin,

As I said in the article, I'm not sure. There are two things that do disturb me, the first is that Wordpress could not identify most of the feed readers. I would thing that it would recognize one from an obvious source such as Weblogs.

Second, those it DID identify were listed as "Web Browsers" and there should not have been any human subscribers to the feed (I didn't even subscribe). Many scrapers hide their bots by having them identify themselves as Web browsers, it is a well-known trick.

I would say that about 80% of the subscribers were listed as "unknown" and the rest were Web Browsers. I wish I had taken a screenshot of that as well but I was in a rush due to the move. I might reignite the experiment later today and see what happens.

WillMacc,

Thanks for providing further confirmation to my theory. If you have any statistics on that, I would love to see them, perhaps we should work together and form a more thorough study? This was just quick and dirty to get a feel for the problem.

Obviously more research needs to be done as the problem is greater than even I imagined...

WillMacc
WillMacc

Also... :)
A lot of "pseudo" feeders are attached and monitor other ping services.
I've seen countless visits from known crawlers with "bad intentions" hit the site as soon as a ping is sent out.
If you have a blog hosted on your own domain, you can issue a ping (to only one service - say; pingomatic) and then sit back and watch who starts hitting the site.
You'll see quickly a boat load of crawlers come and a lot of them will not appear as crawlers, but as regular user-agents. If you follow the trends of the crawlers/visitors after a ping, you'll probably start noticing some visitors will not pull any graphics on the blog; or only pull one hit as where most visitors will have line upon line of various content, items, and graphics that's embedded into the blog themes and within the articles.
Those that do that are Usually bots and not legit users, but having said that, you'll have to be careful and pick out the rss readers from the bots and crawlers.

Thanks,
WillMacc

WillMacc
WillMacc

Also... :)
A lot of "pseudo" feeders are attached and monitor other ping services.
I've seen countless visits from known crawlers with "bad intentions" hit the site as soon as a ping is sent out.
If you have a blog hosted on your own domain, you can issue a ping (to only one service - say; pingomatic) and then sit back and watch who starts hitting the site.
You'll see quickly a boat load of crawlers come and a lot of them will not appear as crawlers, but as regular user-agents. If you follow the trends of the crawlers/visitors after a ping, you'll probably start noticing some visitors will not pull any graphics on the blog; or only pull one hit as where most visitors will have line upon line of various content, items, and graphics that's embedded into the blog themes and within the articles.
Those that do that are Usually bots and not legit users, but having said that, you'll have to be careful and pick out the rss readers from the bots and crawlers.

Thanks,
WillMacc

WillMacc
WillMacc

The moving of the blog and all that came after really wasn't to see who was scraping; but it did provide somewhat confirmation that probably a lot of those still visiting the site and the site feed was more than likely bots that are still habitually visiting the site/feed daily.
I understand that people don't regularly check RSS readers - as in Google Feed Fetcher - but I would dare to guess that's only a small percentage of the visits.
Moving the blog also provided me a chance to see who/what visits the site; as where with my WordPress.com blog, I could only see hits and not the actual visit information.
Since the move; probably 10 crawlers have been shotdown from scraping the content of the blog, BUT, scraping a blog like mine isn't that big of a deal since it's the information on the blog that's important. So, if my content gets scrapped and ends up up another blog - fine; the information is still valid and people still get to see who's doing what with what and whom.. :)

Thanks,
WillMacc

WillMacc
WillMacc

The moving of the blog and all that came after really wasn't to see who was scraping; but it did provide somewhat confirmation that probably a lot of those still visiting the site and the site feed was more than likely bots that are still habitually visiting the site/feed daily.
I understand that people don't regularly check RSS readers - as in Google Feed Fetcher - but I would dare to guess that's only a small percentage of the visits.
Moving the blog also provided me a chance to see who/what visits the site; as where with my WordPress.com blog, I could only see hits and not the actual visit information.
Since the move; probably 10 crawlers have been shotdown from scraping the content of the blog, BUT, scraping a blog like mine isn't that big of a deal since it's the information on the blog that's important. So, if my content gets scrapped and ends up up another blog - fine; the information is still valid and people still get to see who's doing what with what and whom.. :)

Thanks,
WillMacc

Randy Charles Morin
Randy Charles Morin

Are you sure the scraping was for splogging? If you ping weblogs.com with a new blog, then many RSS based search engines will pick it up and begin reporting 1 subscriber immediately. It's not that they are incorrectly reporting 1 subscriber, but rather that they don't report subscribers to FeedBurner, so FeedBurner assumes 1 subscriber.

Randy Charles Morin
Randy Charles Morin

Are you sure the scraping was for splogging? If you ping weblogs.com with a new blog, then many RSS based search engines will pick it up and begin reporting 1 subscriber immediately. It's not that they are incorrectly reporting 1 subscriber, but rather that they don't report subscribers to FeedBurner, so FeedBurner assumes 1 subscriber.

Trackbacks

  1. [...] get asked by reporters and bloggers alike exactly how bad scraping is on the Web. I discuss my past experiments on the topic and how, depending on your keywords, suspicious traffic starts showing up with the [...]

  2. [...] Scraping Starts from the Very First PostA fascinating experiment to see how quickly sploggers start scraping a feed. The results are rather shocking. On PlagiarismToday. [...]

Amateur