Lesson 24: The Robots.txt – Website SEO Tutorial
The robots.txt file is located in the root directory of a webserver, or in the root directory of a webspace, and is used to give crawlers and bots instructions on how to handle the page.
Among other things, in the robots file are saved, which pages of the crawler may / should be visited, which index / index and / or the sitemap is. Above all, the latter is very important as search engines can thus specifically evaluate the Sitemap and get an overview of all indexable individual pages and content.
The exclusion of subpages can be useful especially for automatically generated pages such as keyword pages or category pages in content management systems. In WordPress, for example, tag pages and category pages are created automatically, but these pages have little value for the user and are often the cause of duplicate content, which means that the automatically generated pages can be excluded from the crawler using the Robots.txt file, To prevent duplicate content and make the most of the crawling resources.
Whether a robots file is accepted by the crawler and then forward the correct instructions to the bots can be checked in the Google Search Console using the “robots.txt tester”. An example robots file might look like this:
Disallow: / tag /
This Robots.txt file consists only of three lines, which tell all crawlers the location of the sitemap and that the keyword pages of the WordPress site should not be indexed. Here is a brief explanation of each line of this sample file:
- User-agent: * – This line indicates that all robots file information should apply to all types of crawlers and bots
- Disallow: / tag / – This line instructs the crawler not to index single pages that are within the tag directory. If you want index pages to be indexed, simply drop this line away
- Sitemap: … – This line describes the location of the sitemap of a web page, where relative paths, as well as absolute paths, can be specified
In principle, it is an advantage to have as many individual pages as possible in the Google index in order to increase the probability of good rankings. However bad pages, ie pages with duplicate content or pages with weak / too few contents, damage the rankings of the whole Domain much more.
Even if a subpage has been excluded by Robots.txt, it can be that this, only with the display of a URL, appears in the search results. This is because the search engine has ONLY banned crawling. If the bottom should not under any circumstances into the index, one can use the Noindex.
We are sorry that this lesson was not useful for you!
Let us improve this lesson!
Tell us how we can improve this lesson?