What is robots.txt?
Robots.txt is a simple text file that notifies search engine crawlers which URLs they can access on your site.
A robots.txt file comprises one or more rules. Each rule restricts or allows access to a specific file path on the website for a crawler.
Why do you need a robots.txt file?
A robots.txt file isn’t necessary for most websites. Because Google can usually identify and index all the important pages on your website.
However, using a robots.txt file is still advantageous for the following three reasons.
- When you don’t want to index some pages of your website, you might have a staging version of a page, for example. You need this page, but you don’t want random people to land on it. In this case, robots.txt can help you to block this page from search engine crawlers.
- If you have so many pages to be indexed, you can use robots.txt to block the pages that aren’t important and focus on the pages that actually matter.
- For preventing pages from being indexed, meta directives can be just as effective as Robots.txt. But meta directives are inefficient when dealing with multimedia resources like PDFs and images. That’s where the robots.txt comes in.
Here is a simple robots.txt file with two rules:
User-agent: Googlebot
Disallow: /nogooglebot/
User-agent: *
Allow: /
Sitemap: http://www.sample.com/sitemap.xml
The meaning of the above robots.txt file is as follows:
- The user agent named Googlebot is not allowed to crawl any URL that starts with http://sample.com/nogooglebot/.
- All other user agents are allowed to crawl the entire site.
- The site’s sitemap file is located at http://www.sample.com/sitemap.xml.
Create a robots.txt file using these 4 steps:
- Create a file with the name robots.txt.
- Add rules to the robots.txt file.
- Upload the robots.txt file to your website.
- Test the robots.txt file.
Create a file with the name robots.txt
A robots.txt file can be created with any text editor. Make sure to save the file in UTF-8 encoding.
- The file’s name must be robots.txt.
- Only one robots.txt file is allowed per website.
- The robots.txt file must be placed in the root directory of the website to which it applies. For instance, to control crawling on all URLs below https://www.sample.com/, the robots.txt file must be located at https://www.sample.com/robots.txt.
Add rules to the robots.txt file
Search engine crawlers need rules to know which portions of your site they can crawl. When adding rules to your robots.txt file, keep the following guidelines in mind.
- A robots.txt file includes one or more groups.
- Each group comprises several rules or directives (instructions), one per line. Each group starts with a User-agent line that describes the group’s target.
- A group provides the following information:
- To whom does the group apply? (the user agent).
- Which directories or files the agent can access.
- Which directories or files that the agent cannot access.
- Rules are case-sensitive. For instance, disallow: /file.asp applies to https://www.sample.com/file.asp, but not https://www.sample.com/FILE.asp.
- The # character indicates the start of a comment.
Google’s crawlers support the following robots.txt directives:
- user-agent: The directive indicates the name of the automatic client known as the search engine crawler that the rule applies to. This is the first line for any rule group. An asterisk (*) matches all crawlers, except for the various AdsBot crawlers, which must be explicitly mentioned.
# Example 1: Block only Googlebot
User-agent: Googlebot
Disallow: /
# Example 2: Block Googlebot and Adsbot
User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /
# Example 3: Block all but AdsBot crawlers
User-agent: *
Disallow: /
- disallow: You don’t want the user agent to crawl a directory or page relative to the root domain. If the rule refers to a page, the complete page name as displayed in the browser must be used. It must begin with a / character, and if it refers to a directory, it must close with the / mark.
- allow: You want the user agent to crawl a directory or page mentioned relative to the root domain. Specify the whole page name as it displays in the browser for a single page. End the rule with a / mark if it’s a directory.
- sitemap: The URL for the sitemap must be completely qualified. Sitemaps are an effective way to tell Google which content it should crawl.
Sitemap: https://sample.com/sitemap.xml
Sitemap: http://www.sample.com/sitemap.xml
Upload the robots.txt file to your website
You’re ready to make your robots.txt file visible to search engine crawlers once you’ve saved it to your computer. Contact your hosting provider or look through the documentation provided by your provider.
Test the robots.txt file
Open a private browsing window in your browser and navigate to the location of your newly uploaded robots.txt file to see if it’s publicly accessible. For example, https://sample.com/robots.txt. If you see the contents of your robots.txt file, you’re ready to test the file.
Test using the robots.txt Tester in Search Console. Use this tool for robots.txt files that are already accessible on your site. It shows your robots.txt file and any errors and warnings found.
Once you’ve uploaded and tested your robots.txt file, Google’s crawlers will automatically find it and start using it.
Setting up your robots.txt file doesn’t take much effort. It’s mostly a one-time setup, and you can make slight adjustments as needed.
By properly creating your robots.txt file, you will improve your SEO as well as the user experience of your visitors.