Googlebot crawls and indexes first 15MB HTML content

An update to Googlebot’s help document contains confirmation that it will crawl the first 15MB of a webpage and that anything after that threshold will not be included in ranking calculations.

Google specifies in the help document:

“All resources referenced in the HTML such as images, videos, CSS, and JavaScript are fetched separately.

After the first 15MB of the file, Googlebot stops crawling and only considers the first 15MB of the file for indexing.

The file size limit is applied to uncompressed data.

This left some in the SEO community wonder if that meant that Googlebot would completely ignore text that falls below images at break in HTML files.

“It’s specific to the HTML file itself, as written,” John Mueller, Google Search Advocate, clarified via Twitter.

“Resources/embedded content extracted with IMG tags are not part of the HTML file.”

What this means for SEO

To ensure it is weighted by Googlebot, important content should now be included at the top of web pages.

This means that the code should be structured to place the SEO-relevant information with the first 15MB in a supported HTML or text file.

This also means that images and videos should be compressed and not encoded directly into HTML code, whenever possible.

SEO best practices currently recommend keeping HTML pages to 100KB or less, so many sites will not be affected by this change. Page size can be checked with a variety of tools, including Google Page Speed ​​Insights.

In theory, it might sound worrying that you could potentially have content on a page that isn’t being used for indexing. In practice, however, 15MB is a considerable amount of HTML.

As Google indicates, assets such as images and videos are fetched separately. From Google’s wording, it appears that this 15MB threshold only applies to HTML.

It would be difficult to break this limit with HTML unless you were publishing the text of entire books on a single page.

If you have pages that exceed 15MB of HTML, chances are you have underlying issues that need to be fixed anyway.

Editor’s note: An earlier version of this article stated that Google had just announced that this was a new practice. Google’s John Mueller clarified in a tweet: “It’s not a change, it’s just that it hasn’t been officially documented before…” and this article has been updated to reflect that.


Source: Google Search Center
Feature image: SNEHIT PHOTO/Shutterstock

Comments are closed.