One of the biggest concerns a small blog has is ranking their content in Google’s search results. It’s difficult enough to compete with all of the massive megasites dominating results, so the idea of content theft causing similar problems is devastating.
I’d like to reassure you that such things never happen, but the fact is, Google isn’t perfect. Let’s talk about the problem, how Google tries to prevent it, and what you can do if it happens anyway.
For as many various small business blogs as there are, there are equally as many spammy or thin blogs. These blogs are made by the dozens in private networks to sell backlinks, or in tiered link building schemes to boost the money site of one black hat marketer. For the most part, these schemes are firmly against the Google policies, and when Google identifies sites as being part of spam networks, they will generally deindex the sites.
Since these sites tend to be short-lived, there’s no sense in building them up with content the marketer pays for or writes themselves. Instead, they just plug in their search queries and find content published on the 10th page of Google or whatever, and just steal it. Sometimes, if the original owner is lucky, the black hat will spin the content so it’s “unique”, though that’s not always a guarantee Google won’t identify it as spun.
You can read about Google’s policies on scraped content here. Specifically, Google talks about sites that:
- Copy and republish content from other sites without adding any original content or value.
- Copy content from other sites, modify it slightly, and republish it.
- Reproduce content feeds from other sites without providing some type of unique organization or benefit to the user.
- Embed content such as video, images, or other media from other sites without substantial added value.
This is all to specify what is and isn’t content theft.
- If a site copies content or images from another site with or without attribution, as long as there is no significant added value, it’s theft.
- If a site copies content from another site and spins it, it’s still theft if it’s identified.
- If a site copies content from another site, but adds proper canonicalization and attribution, it’s not theft, it’s syndication.
- If a site copies content from another site but adds additional value, it’s more like a quote or an aggregator. For example, my previous set of bullet points are a word-for-word quote of part of that Google page; it’s not theft, because I properly attributed it and have made clear that it’s a quote.
So to be clear, if you publish a guest post on Yahoo and it’s syndicated to another site, Google isn’t going to penalize anyone for it, regardless of whether you published the content on your own site or not. If you publish a piece of content and then later find it on spamblog.wordpress.com, then Google is more likely to step in and deindex the spam blog.
Stolen content can hurt a site, but not frequently. Contrary to popular belief, duplicate content penalties for stolen content are virtually unheard of. The real threat comes from when stolen content ranks higher than the original copy.
When stolen content outranks the original, the majority of the traffic that content generates goes to the thief. The original loses out on traffic, potential conversions, reputation, name recognition, and all of the other benefits of running a blog and publishing that content. If this happens too frequently, it essentially destroys the value of the smaller blog. Obviously, this isn’t good for anyone.
Google does have some parts of their algorithm in place to help prevent this from happening.
The general thought is that Google goes by the publication date of the content. If you published a piece of content on January 1, and someone else copies that content on April 4, Google will give preference to the earlier copy.
This is sort of true, but it’s not entirely true. Google doesn’t trust your own publication date, for a number of reasons. For one thing, it’s trivially easy to backdate a piece of content so that it appears to be published earlier than the original. If you just go by publication dates visibly posted, it’s like blindly trusting some random guy on the street who tells you he’s the president. Maybe he is, but all signs point to a lie.
There’s also post-dating content. If I write a piece of content in 2015 and then update it in 2017, I might change the publication date to the 2017 date, to reflect the fact that I updated it. If someone stole the content in 2016, then the stolen version is “older” and would then appear as though it’s the original. Of course, if I significantly change the content it no longer matters since it’s not a direct copy, but it could still be an issue.
If Google can’t trust publication date, what do they do? The secret is in indexation date. Google records the first time they find a piece of content online. If they then find that content published elsewhere, they’ll still trust the one they found first.
It’s not that simple, of course. Nothing with Google is ever that simple. There are other factors they add onto the list. Maybe Google indexed the more active spam blog first; what then? Well, maybe the original content was shared on social media before the spam blog was published. If an older link points to the content they found second, they might choose to update the date. Of course, the older links have to be reputable, not just something the spammer could have edited in their favor.
So in 99% of cases, content theft is properly handled by Google’s algorithm. However, that’s not 100%, and in fact John Mueller confirmed that there are “edge cases” where stolen content can outrank the original content. Sometimes it’s for niche keywords no one uses, sometimes it’s because of quoted snippets, and sometimes it’s because of improper syndication. Combined with other edge cases where content was duplicated accidentally, and you end up with a tricky situation.
What to Do if it Happens to You
So what happens if it’s your content that was stolen, that now outranks your own version? What can you do about it? I’ve provided some options and alternatives.
First up, you need to determine if there’s any stolen content out there. If you’ve found your content scraped and republished on one site, do this step anyway; there could be more out there. I recommend making use of Copyscape. It, or any of the other major plagiarism checkers out there, will be able to scan and search for copies of your content. Simply plug in a snippet or a whole piece – or if you’re paying for Copyscape, your whole site – and let it scan.
If you don’t want to use a third party tool, you can run Google searches for snippets of your content. A lengthy quote of at least one full sentence that is both unlikely to be quoted and is unique enough to be unlikely to be created elsewhere is important. If you use too generic of a sentence, you’ll find other people who simply wrote the same thing.
If you find domains that are copying your content, write them down and then analyze them.
Secondly, you need to make sure it’s not actually your fault. This is where the analysis comes in. In some cases, you may have inadvertently copied the content. This used to happen a lot with product descriptions; stores would copy product descriptions from manufacturers, causing duplicate content.
It’s also possible that you submitted a post to be published as a guest post and heard nothing back, so you published it on your own. The original site might have published it after all, perhaps later down the road, and it was simply a failure to communicate that led to the duplication. This one is at least reasonably easy to sort out.
In other cases, it could be a scraper or spammer, as mentioned above. It could also be improper syndication. If you publish content in a location that allows for syndication on other sites within a network, they have the legal right to republish that content. However, it’s up to them to actually implement proper use of canonicalization. Ideal canonicalization will point to your content as the original. The “copy” might still out-rank yours, but at least your site gets credit for it.
The other possible causes is just simple theft. A freelance writer might submit an assignment that simply copies existing content. This still happens, despite how easy it is to double-check submitted content. This is why if you’re ever accepting work from a writer you don’t know or trust fully, you should run the content through something like Copyscape before you publish it. It’s also possible that the writer simply published their work in more than one location, which is functionally the same thing.
At this point, you should start documenting what you can. Copy the links of your content and the stolen content, but also take screenshots in case you end up getting into a legal dispute and the spammer attempts to hide the evidence. Dig into WHOIS information and see what you can come up with. Some black hats are dumb enough to attach their real names to their illegitimate dealings, making it easy to pursue them. All of this can be used in your case against them with a web host, with Google, or with lawyers.
Third, you can reach out to the owner of the content and ask for proper redirection or canonicalization. High quality sites that may have inadvertently published stolen content are very likely to respond positively. Once you prove that you own and published the content first, they are likely to either provide attribution and canonicalization, or remove the content. In both cases, the person responsible for the theft will likely be fired. It might be worthwhile contacting other publications that the writer works for to investigate their work there, if you feel like destroying the “career” of a spammer and thief.
In a lot of cases, though, the owner of the content is the spammer, and they’ll simply ignore you. After all, if you drop the matter, they don’t have to do anything. Maybe by ignoring the problem, it will go away. For that matter, they don’t care about the sites, since they generally have such a short shelf life that it’s not worth worrying about.
If contacting the owner doesn’t work, you can contact the host of the content directly. For example, if WordPress.com hosts the content, you can contact the WordPress admins through their report forms. WordPress and many other web hosts don’t want to be labeled as havens for shady site owners or black hat content – it reflects negatively on them and hurts their business – so they’ll take down the page or the site as a whole, after investigation.
If the web host ignores you or denies your request, you can report directly to Google. In fact, you should probably do this right away in conjunction with other options. Google allows you to report web spam through this form, and if you’ve somehow been hit with duplicate content penalties in your webmaster search console, you can file a reconsideration request with the evidence of stolen content.
If all of this has failed, you can consider taking official legal action. The reason I recommend against legal action right away is because a lot of companies will simply lawyer up rather than taking action right away. What was once a simple “oh sorry about that, we’ll take it down” because a lengthy legal exchange. Still, if you need to file a copyright claim, you can do so with the aid of a copyright lawyer. That will solve your problem for sure.