Editor’s Note: CEO of JetOctopus crawler Serge Bezborodov gives expert pieces of advice on how to make your website attractive for Googlebot. Data in this article is based on yearlong research and 300 million crawled pages.
A few years ago, I was trying to increase traffic on our job-aggregator website with 5 million pages. I decided to use SEO agency services, expecting that traffic would go through the roof. But I was wrong. Instead of a comprehensive audit, I got tarot cards reading. That’s why I went back to square one and created a web crawler for comprehensive on-page SEO analysis.
I’ve been spying on Googlebot for more than a year, and now I’m ready to share insights about its behavior. I expect my observations will at least clarify how web crawlers work, and at most will help you to conduct on-page optimization efficiently. I gathered the most meaningful data that is useful for either a new website or one that has thousands of pages.
To know for sure which pages are in the search results, you should check the index-ability of the whole website. However, analyses of each URL on a 10 million-plus pages website costs a fortune, about as much as a new car.
Let’s use log files analysis instead. We work with websites in the following way: We crawl the web pages as the search bot does, and then we analyze log files that were gathered for half the year. Logs show whether bots visit the website, which pages were crawled and when and how often bots visited the pages.
Crawling is the process of search bots visiting your website, processing all links on web pages and placing these links in line for indexation. During the crawling, bots compare just-processed URLs with those already in the index. Thus, bots refresh the data and add/delete some URLs from the search engine database to provide the most relevant and fresh results for users.
Now, we can easily draw these conclusions:
Altogether, this information reveals what prevents organic growth and development of your website. Now, instead of operating blindly, your team can wisely optimize a website.
We mostly work with big websites because if your website is small, Googlebot will crawl all your web pages sooner or later.
Conversely, websites with 100,ooo-plus pages face a problem when crawler visits pages that are invisible for webmasters. Valuable crawl budget may be wasted on these useless or even harmful pages. At the same time, bot may never find your profitable pages because there is a mess in a website structure.
Crawl budget is the limited resources Googlebot is ready to spend on your website. It was created to prioritize what to analyze and when. The size of the crawl budget depends on many factors, such as the size of your website, its structure, volume and frequency of users’ queries, etc.
Note that search bot isn’t interested in crawling your website completely.
The main purpose of the search engine bot is to give users the most relevant answers with minimal losses of resources. Bot crawls as much data as it needs for the main purpose. So, it’s YOUR task to help bot pick up the most useful and profitable content.
Over the last year we’ve scanned more than 300 million URLs and 6 billion log lines on big websites. Based on this data, we traced Googlebot’s behavior to help answer the following questions:
Below is our analysis and findings, and not a rewrite of Google Webmasters Guidelines. In fact, we don’t give any unproven and unjustified recommendations. Each point is based on factual stats and graphs for your convenience.
Let’s cut to the chase and find out:
We identified the following factors:
DFI stands for Distance From Index and is how far your URL is for the main/root/index URL in clicks. It’s one of the most crucial criteria that impacts frequency of Googlebot’s visits. Here is an educational video to learn more about DFI.
Note that DFI is not the number of slashes in the URL directory like, for instance:
site.com/shop/iphone/iphoneX.html – DFI – 3
So, DFI is counted exactly by CLICKS from the main page
https://site.com/shop/iphone/iphoneX.html
https://site.com iPhones Catalog → https://site.com/shop/iphone iPhone X → https://site.com/shop/iphone/iphoneX.html – DFI – 2
Below you can see how Googlebot’s interest in the URL with its DFI was reducing gradually during the last month and over the last six months.
As you can see, if DFI is 5 t0 6, Googlebot crawls only half of the web pages. And the percentage of processed pages reduces if DFI is bigger. Indicators in the table were unified for 18 million pages. Note that data can vary depending on the niche of the particular website.
It’s obvious that the best strategy in this case is to avoid DFI that is longer than 5, build an easy-to-navigate website structure, pay special attention to links, etc.
The truth is that these measures are really time-consuming for 100,ooo-plus pages websites. Usually big websites have a long history of redesigns and migrations. That’s why webmasters shouldn’t just delete pages with DFI of 10, 12 or even 30. Also, inserting one link from frequently visited pages won’t solve the problem.
The optimal way to cope with long DFI is to check and estimate whether these URLs are relevant, profitable and what positions they have in the SERPs.
Pages with long DFI but good positions in SERPs have a high potential. To increase traffic on high-quality pages, webmasters should insert links from the next pages. One to two links aren’t enough for tangible progress.
You can see from the graph below that Googlebot visits URLs more frequently if there are more than 10 links on the page.
In fact, the bigger a website, the more significant the number of links are on the web pages. This data is actually from 1 million-plus pages websites.
If you discovered there are less than 10 links on your profitable pages, don’t panic. First, check whether these pages are high-quality and profitable. When you do that, insert links on high-quality pages without rush and with short iterations, analyzing logs after each step.
Content is one of the most popular aspects of SEO analysis. Of course, the more relevant content is on your website, the better your Crawl Ratio. Below you can see how dramatically interest of Googlebot decreases for pages with less than 500 words.
Based on my experience, nearly half of all pages with less than 500 words are trash pages. We saw a case where a website contained 70,000 pages with only the size of clothes listed, so only part of these pages were in the index.
Therefore, first check whether you really need those pages. If these URLs are important, you should add some relevant content on them. If you have nothing to add, just relax and leave these URLs as they are. Sometimes it’s better to do nothing instead of publishing useless content.
The following factors can significantly impact the Crawl Ratio:
Web page speed is crucial for crawling and ranking. Bot is like a human: It hates waiting too long for a web page to load. If there are more than 1 million pages on your website, search bot will probably download five pages with a 1-second load time rather than wait for one page that loads in 5 seconds.
In fact, this is a technical task and there is no “one-method-fits-all” solution, such as using a bigger server. The main idea is to find the bottleneck of the problem. You should understand why web pages load slowly. Only after the reason is revealed, you can take action.
The balance between unique and templated data is important. For instance, you have a website with variations of pets names. How much relevant and unique content can you really gather about this topic?
Luna was the most popular “celebrity” dog name, followed by Stella, Jack, Milo and Leo.
Search bots don’t like to spend their resources on these kinds of pages.
Maintain the balance. Users and bots don’t like visiting pages with complicated templates, a bunch of outgoing links and little content.
Orphan Pages are URLs that aren’t in the website structure and you don’t know about these pages, but these orphan pages could be crawled by bots. To make it clear, look at the Euler’s Circle in the picture below:
You can see the normal situation for the young website, the structure of which hasn’t been changed for a while. There are 900,000 pages you and the crawler can analyze. About 500,000 pages are processed by crawler but are unknown by Google. If you make these 500,000 URLs index-able, your traffic will increase for sure.
Pay attention: Even a young website contains some pages (the blue part in the picture) that aren’t in the website structure but are regularly visited by bot.
And these pages could contain trash content, such as useless auto-generated visitors’ queries.
But big websites are rarely so accurate. Very often websites with history look like this:
Here’s the other problem: Google knows more about your website than you do. There can be deleted pages, pages on JavaScript or Ajax, broken redirects and so on and so forth. Once we faced a situation where a list of 500,000 broken links appeared in the sitemap because of a programmer’s mistake. After three days, the bug was found and fixed, but Googlebot had been visiting these broken links for half a year!
So often, your Crawl Budget is frequently wasted on these Orphan Pages.
There are two ways to fix this potential problem: First is canonical: cleaning up the mess. Organize the website’s structure, insert internal links correctly, add orphan pages to the DFI by adding links from indexed pages, set the task for programmers and wait for the next Googlebot visit.
The second way is prompt: gather the list of orphan pages and check whether they are relevant. If the answer is “yes,” then create the sitemap with these URLs and send it to Google. This way is easier and faster, but only half of the orphan pages will be in the index.
Search engine algorithms have improved for two decades, and it’s naive to think that search crawling could be explained with a few graphs.
We gather more than 200 different parameters for each page, and we expect this number will increase by the end of the year. Imagine your website is the table with 1 million lines (pages) and multiply these lines by 200 columns, the simple sample isn’t enough for a comprehensive, technical audit. Do you agree?
We decided to dive deeper and used machine learning to find out what influences Googlebots’ crawling in each case.
For one, website links are crucial while content is the key factor for the other.
The main point of this task was to get easy answers from complicated and massive data: What on your website impacts the indexation the most? Which clusters of URLs are connected with the same factors? So that you can work with them comprehensively.
Before downloading and analyzing logs on our HotWork aggregator website, the story about orphan pages that are visible for bots but not for us seemed unrealistic to me. But the real situation surprised me even more: Crawl showed 500 pages with 301 redirect, but Yandex found 700,000 pages with this same status code.
Usually, technical geeks don’t like to store log files because this data “overloads” disks. But objectively, on most websites with up to 10 million visits a month, the basic setting of logs storing works perfectly.
Speaking of the volume of logs, the best solution is to create an archive and download it on Amazon S3-Glacier (you can store 250 GB of data for only $1). For system administrators, this task is as easy as making a cup of coffee. In the future, historical logs will help to reveal technical bugs and estimate the influence of Google’s updates on your website.