Webmaster Needs: Duplicate Content

Posted by GeoffreyF67

There's been quite a bit of discussion over the years about duplicate content.

For spammers the issue of duplicate content has become a pretty big problem because they're constantly scraping the serps rss feeds and even sites themselves to get content to put on their pages.

Of course spammers have attempted to come up with all kinds of ways to circumvent the issue including Markov Chains reording content on the pages adding rss feeds to the page etc.

A while back someone on the Syndk8 ran across a post about Google allegedly using "Shingles".

So after a little bit of research I came across this document. I can't explain the math behind any of this as it usually makes my eyes glaze over BUT the gist of what I read is that you can take any document and break it down into groups of words.

For instance take the sentence "the quick brown fox jumped over the lazy dog".

Breaking it up into shingles would look like:

the quick brown fox...
quick brown fox...
brown fox jumped...

and so on.

If you do this to two pages and then compare the hash from each of those shingles you can then tell if a page is similar to another.

Of course you do have the problem of it being a very intensive calculation because you're not comparing A->B you're comparing every document against all other documents. I think they call this a O(n²) problem.

An interesting tidbit to be gleaned from the article is that given some assumptions of only 4 fingerprints per page you can determine the probability that the document is a duplicate or not. And only changing 36 words would give it a very low probability (close to zero).

It's rather fascinating stuff and the author of the PDF points out that this was back in 2000 and Google may or may not be using this type of system to look for duplicate content or not.

This brings up some interesting questions - none of which I'm positive of the answer to them so your input is appreciated:

From a white hat perspective what happens when 50 spam sites scrape your feed. Will your content get penalized or will the spam sites get penalized?
What about wordpress blogs? Are they penalized for having their articles on the main page as well as on the individual post pages as well?
What happens to the black hat that gets sneaky and puts a random word every 10 words or so and moves it around with CSS so the user doesn't see it?
Anything else I'm missing here with this?

G-Man

P.S. Don't mess with Granny!

View Comments

Saturday, November 04, 2006

Duplicate Content - Revisited

0 Comments:

Search

Home

Previous Post

Archives

Partner Links

Best Offers

RSS Feeds

TOP Blog Links