Ensuring your site is fully crawlable can help you earn more revenue from your content. To make sure you’ve optimized your site for crawling, consider all of the following issues that might affect how crawlable you are.
Granting Google’s crawlers access in robots.txt
To ensure Google can crawl your sites make sure you’ve given access to Google’s crawlers. This means enabling Google’s crawlers in your robots.txt.
Providing access to any content behind a login
If you have content behind a login, ensure you’ve set up a crawler login. If you haven’t provided Google crawlers a login, then it’s possible that its crawlers are being redirected to a login page, which could result in a “No content” policy violation, or, that Google crawlers receive a 401 (Unauthorized) or 407 (Proxy Authentication Required) error, and thus cannot crawl the content.
Page not found
If the URL sent to Google points to a page that doesn’t exist (or no longer exists) on a site, or results in a 404 (Not Found) error, Google’s crawlers will not successfully crawl your content.
If you’re overriding the page URL in ad tags, Google’s crawlers may not be able to fetch the content of the page that’s requesting an ad, especially if the overwritten page URL is malformed. Generally speaking, the page URL you send to Google in your ad request should match the actual URL of the page you’re monetizing, to ensure the right contextual information is being acted on by Google.
Name serving issues
If the name servers for your domain or subdomain are not properly directing Google crawlers to your content, or have any restrictions on where requests can come from, then its crawlers may not be able to find your content.
If your site has redirects, there’s a risk that Google crawler could have issues following through them. For example, if there are many redirects, and intermediate redirects fail, or if important parameters such as cookies get dropped during redirection, it could decrease the quality of crawling. Consider minimizing the use of redirects on pages with ad code, and ensuring redirects are implemented properly.
Sometimes when Google’s crawlers try to access site content, the site’s servers are unable to respond in time. This can happen because the servers are down, slow or get overloaded by requests. We recommend ensuring your site is being hosted on a reliable server or by a reliable service provider.
Geographical, network or IP restrictions
Some sites may put in place restrictions that limit the geographies or IP ranges that can access their content, or have their content behind restricted networks or IP ranges (e.g., 127.0.0.1). If these restrictions prevent Google’s crawlers from reaching all your pages, consider removing these restrictions or making your content publicly accessible, to enable your URLs to be crawled.
Freshly published content
When you publish a new page, you may make ad requests before Google’s crawlers have had a chance to crawl the content. Examples of sites that post lots of new content include sites with user-generated content, news articles, large product inventories, or weather sites. Usually after the ad request is made on a new URL, the content will get crawled within a few minutes. However, during these initial few minutes, because your content has not yet been crawled, you may experience low ad volume.
Personalized pages (using URL parameters or dynamically generated URL paths)
Some sites include extra parameters in their URLs that indicate the user who is logged in (e.g., a Session ID), or other information that may be unique to each visit. When this happens, Google’s crawlers may treat the URL as a new page, even if the content is the same.
This could result in a few minute lag time between the first ad request on the page and when the page gets crawled, as well as an increase in the crawler load on your servers. Generally, if the content on a page does not change, consider removing the parameters from the URL and persisting that information another way. Having a simpler URL structure helps make your site easily crawlable.
Using POST data
If your site sends POST data along with URLs (for example, passing form data via a POST request), it’s possible that your site is rejecting requests that are not accompanied by POST data.
Note that since Google’s crawlers will not provide any POST data, such a setup would prevent the crawlers from accessing your page. If the page content is determined by the data the user inputs to the form, consider using a GET request.