What is a Web Crawler?

Web Crawler

Google Web Crawler

In the overall scheme of SEO, an important foundational area to keep an eye on is how a website is being crawled, and just how deep that crawl is. It is vitally important to make sure that all the pages you want the search engine to crawl, index and rank are visible to the web crawler.

Let’s look into what a web crawler is, how it works and what needs to be done to make a website crawlable.

What is crawling? It is an automated program, or script,  performed by a search engine to search for relevant sites to index. It is usually referred to as a web crawler, and also known as an Internet bot or web spider.

How does a crawler work? A web crawler visits a web page and looks or reads the code of the site, reading the text, hyperlinks, and the meta tags. The pages that it can read, it then indexes them for later use in determining rank.

How to make a website crawlable:

  • Good site architecture – It is important to make sure that your most important pages are easy to find, usually within a few clicks from the home page. A good basic structure starts with the home page, then category pages, sub-category pages down to the product pages. Keep in mind that a web crawler, such as Google, only allows so much of their crawling resources to a website. How they determine how much and how deep they crawl a site is part of their crawler program. So good site architecture is important to help the crawler to crawl as deeply as possible.
  • Use of robots.txt – It is a good practice to list pages in the robots.txt file that you do not want the search engines to crawl. For example, pages like the terms of service or privacy policy may be included if you don’t want them crawled and ranked.
  • Use of a sitemap – Sitemaps are a good way of letting the web crawlers know about the pages of a website. In Google Webmaster Tools Help section, Google says “Creating and submitting a Sitemap helps make sure that Google knows about all the pages on your site, including URLs that may not be discoverable by Google’s normal crawling process.” A Sitemap is also a good way of letting the search engines know about a site’s metadata, such as specific types of content on a site, including video, images, mobile and news.

Determining how many pages and what pages are being crawled should be in the site’s log files, as well as in Google’s Webmaster Tools. Checking these on a regular basis will help you keep an eye on the crawlability of a website.

Bottom line, if a website is not getting fully crawled, then it is not getting the full value of it’s pages being indexed and ranked.

Leave a Reply

Your email address will not be published. Required fields are marked *