Yearlong SEO Case Study: What You Need To Know About Googlebot

Ann Yaroshenko
Published Aug 30, 2019 by Ann Yaroshenko in Google, optimization, SEO Tips & Resources
The views of contributors are their own, and not necessarily those of SEOBlog.com

Editor’s Note: CEO of JetOctopus crawler Serge Bezborodov gives expert pieces of advice on how to make your website attractive for Googlebot. Data in this article is based on yearlong research and 300 million crawled pages.

A few years ago, I was trying to increase traffic on our job-aggregator website with 5 million pages. I decided to use SEO agency services, expecting that traffic would go through the roof. But I was wrong. Instead of a comprehensive audit, I got tarot cards reading. That’s why I went back to square one and created a web crawler for comprehensive on-page SEO analysis.

I’ve been spying on Googlebot for more than a year, and now I’m ready to share insights about its behavior. I expect my observations will at least clarify how web crawlers work, and at most will help you to conduct on-page optimization efficiently. I gathered the most meaningful data that is useful for either a new website or one that has thousands of pages.

What you need to know about Googlebot

Are Your Pages Showing Up in the SERPs?

To know for sure which pages are in the search results, you should check the index-ability of the whole website. However, analyses of each URL on a 10 million-plus pages website costs a fortune, about as much as a new car.

Let’s use log files analysis instead. We work with websites in the following way: We crawl the web pages as the search bot does, and then we analyze log files that were gathered for half the year. Logs show whether bots visit the website, which pages were crawled and when and how often bots visited the pages.

Crawling is the process of search bots visiting your website, processing all links on web pages and placing these links in line for indexation. During the crawling, bots compare just-processed URLs with those already in the index. Thus, bots refresh the data and add/delete some URLs from the search engine database to provide the most relevant and fresh results for users.

Now, we can easily draw these conclusions:

  • Unless search bot was on the URL, this URL probably won’t be in the index. 
  • If Googlebot visits the URL several times a day, that URL is high-priority and therefore requires your special attention.

Altogether, this information reveals what prevents organic growth and development of your website. Now, instead of operating blindly, your team can wisely optimize a website.

We mostly work with big websites because if your website is small, Googlebot will crawl all your web pages sooner or later.

Conversely, websites with 100,ooo-plus pages face a problem when crawler visits pages that are invisible for webmasters. Valuable crawl budget may be wasted on these useless or even harmful pages. At the same time, bot may never find your profitable pages because there is a mess in a website structure.

Crawl budget is the limited resources Googlebot is ready to spend on your website. It was created to prioritize what to analyze and when. The size of the crawl budget depends on many factors, such as the size of your website, its structure, volume and frequency of users’ queries, etc.

Note that search bot isn’t interested in crawling your website completely.

The main purpose of the search engine bot is to give users the most relevant answers with minimal losses of resources. Bot crawls as much data as it needs for the main purpose. So, it’s YOUR task to help bot pick up the most useful and profitable content.

Spying on Googlebot

Over the last year we’ve scanned more than 300 million URLs and 6 billion log lines on big websites. Based on this data, we traced Googlebot’s behavior to help answer the following questions:

  • What types of pages are ignored?
  • Which pages are visited frequently?
  • What is worth attention for bot?
  • What has no value?

Below is our analysis and findings, and not a rewrite of Google Webmasters Guidelines. In fact, we don’t give any unproven and unjustified recommendations. Each point is based on factual stats and graphs for your convenience.

Let’s cut to the chase and find out:

  • What really matters to Googlebot?
  • What determines whether bot visits the page or not?

We identified the following factors:

Distance From Index

DFI stands for Distance From Index and is how far your URL is for the main/root/index URL in clicks. It’s one of the most crucial criteria that impacts frequency of Googlebot’s visits. Here is an educational video to learn more about DFI.

Note that DFI is not the number of slashes in the URL directory like, for instance:

site.com/shop/iphone/iphoneX.html – DFI – 3

So, DFI is counted exactly by CLICKS from the main page
https://site.com/shop/iphone/iphoneX.html

https://site.com iPhones Catalog → https://site.com/shop/iphone iPhone X → https://site.com/shop/iphone/iphoneX.html – DFI – 2

Below you can see how Googlebot’s interest in the URL with its DFI was reducing gradually during the last month and over the last six months.

Google Crawl graph 1

As you can see, if DFI is 5 t0 6, Googlebot crawls only half of the web pages. And the percentage of processed pages reduces if DFI is bigger. Indicators in the table were unified for 18 million pages. Note that data can vary depending on the niche of the particular website.

What To Do?

It’s obvious that the best strategy in this case is to avoid DFI that is longer than 5, build an easy-to-navigate website structure, pay special attention to links, etc.

The truth is that these measures are really time-consuming for 100,ooo-plus pages websites. Usually big websites have a long history of redesigns and migrations. That’s why webmasters shouldn’t just delete pages with DFI of 10, 12 or even 30. Also, inserting one link from frequently visited pages won’t solve the problem.

The optimal way to cope with long DFI is to check and estimate whether these URLs are relevant, profitable and what positions they have in the SERPs.

Pages with long DFI but good positions in SERPs have a high potential. To increase traffic on high-quality pages, webmasters should insert links from the next pages. One to two links aren’t enough for tangible progress.

You can see from the graph below that Googlebot visits URLs more frequently if there are more than 10 links on the page.

Links

Google Crawl graph 2

In fact, the bigger a website, the more significant the number of links are on the web pages. This data is actually from 1 million-plus pages websites.

If you discovered there are less than 10 links on your profitable pages, don’t panic. First, check whether these pages are high-quality and profitable. When you do that, insert links on high-quality pages without rush and with short iterations, analyzing logs after each step.

Content Size

Content is one of the most popular aspects of SEO analysis. Of course, the more relevant content is on your website, the better your Crawl Ratio. Below you can see how dramatically interest of Googlebot decreases for pages with less than 500 words.

What To Do?

Based on my experience, nearly half of all pages with less than 500 words are trash pages. We saw a case where a website contained 70,000 pages with only the size of clothes listed, so only part of these pages were in the index.

Therefore, first check whether you really need those pages. If these URLs are important, you should add some relevant content on them. If you have nothing to add, just relax and leave these URLs as they are. Sometimes it’s better to do nothing instead of publishing useless content.

Other Factors

The following factors can significantly impact the Crawl Ratio:

Load Time

Web page speed is crucial for crawling and ranking. Bot is like a human: It hates waiting too long for a web page to load. If there are more than 1 million pages on your website, search bot will probably download five pages with a 1-second load time rather than wait for one page that loads in 5 seconds.

What To Do?

In fact, this is a technical task and there is no “one-method-fits-all” solution, such as using a bigger server. The main idea is to find the bottleneck of the problem. You should understand why web pages load slowly. Only after the reason is revealed, you can take action.

Ratio of Unique and Templated Content

The balance between unique and templated data is important. For instance, you have a website with variations of pets names. How much relevant and unique content can you really gather about this topic?

Luna was the most popular “celebrity” dog name, followed by Stella, Jack, Milo and Leo.

Search bots don’t like to spend their resources on these kinds of pages.

What To Do?

Maintain the balance. Users and bots don’t like visiting pages with complicated templates, a bunch of outgoing links and little content.

Orphan Pages

Orphan Pages are URLs that aren’t in the website structure and you don’t know about these pages, but these orphan pages could be crawled by bots. To make it clear, look at the Euler’s Circle in the picture below:

Orphan Page sample graph 1

You can see the normal situation for the young website, the structure of which hasn’t been changed for a while. There are 900,000 pages you and the crawler can analyze. About 500,000 pages are processed by crawler but are unknown by Google. If you make these 500,000 URLs index-able, your traffic will increase for sure.

Pay attention: Even a young website contains some pages (the blue part in the picture) that aren’t in the website structure but are regularly visited by bot.

And these pages could contain trash content, such as useless auto-generated visitors’ queries.

But big websites are rarely so accurate. Very often websites with history look like this:

Orphan Page sample graph 2

Here’s the other problem: Google knows more about your website than you do. There can be deleted pages, pages on JavaScript or Ajax, broken redirects and so on and so forth. Once we faced a situation where a list of 500,000 broken links appeared in the sitemap because of a programmer’s mistake. After three days, the bug was found and fixed, but Googlebot had been visiting these broken links for half a year!

So often, your Crawl Budget is frequently wasted on these Orphan Pages.

What To Do?

There are two ways to fix this potential problem: First is canonical: cleaning up the mess. Organize the website’s structure, insert internal links correctly, add orphan pages to the DFI by adding links from indexed pages, set the task for programmers and wait for the next Googlebot visit.

The second way is prompt: gather the list of orphan pages and check whether they are relevant. If the answer is “yes,” then create the sitemap with these URLs and send it to Google. This way is easier and faster, but only half of the orphan pages will be in the index.

The Next Level

Search engine algorithms have improved for two decades, and it’s naive to think that search crawling could be explained with a few graphs.

We gather more than 200 different parameters for each page, and we expect this number will increase by the end of the year. Imagine your website is the table with 1 million lines (pages) and multiply these lines by 200 columns, the simple sample isn’t enough for a comprehensive, technical audit. Do you agree?

We decided to dive deeper and used machine learning to find out what influences Googlebots’ crawling in each case.

Google crawl sample graph 3

For one, website links are crucial while content is the key factor for the other.

The main point of this task was to get easy answers from complicated and massive data: What on your website impacts the indexation the most? Which clusters of URLs are connected with the same factors? So that you can work with them comprehensively.

Before downloading and analyzing logs on our HotWork aggregator website, the story about orphan pages that are visible for bots but not for us seemed unrealistic to me. But the real situation surprised me even more: Crawl showed 500 pages with 301 redirect, but Yandex found 700,000 pages with this same status code.

Usually, technical geeks don’t like to store log files because this data “overloads” disks. But objectively, on most websites with up to 10 million visits a month, the basic setting of logs storing works perfectly.

Speaking of the volume of logs, the best solution is to create an archive and download it on Amazon S3-Glacier (you can store 250 GB of data for only $1). For system administrators, this task is as easy as making a cup of coffee. In the future, historical logs will help to reveal technical bugs and estimate the influence of Google’s updates on your website.

Written by Ann Yaroshenko

Ann Yaroshenko
Ann Yaroshenko is a Content Marketing Strategist at JetOctopus crawler. She has a master’s degree in publishing and editing and a master’s degree in philology. Ann has two years experience in human resources management. Ann is also certified in technical SEO, content strategy and Google Analytics.

Join the Discussion

Leave a Reply

Your email address will not be published. Required fields are marked *

Featured Service
image of rize reviews
Trending Posts
Share3
Tweet
Buffer
Share