Google and others update and submit Robots Exclusion Protocol to create new standard

25 years after the Robots Exclusion Protocol was created by Martijn Koster, Google has teamed up with its creator to address edge cases and to have robots.txt formalized as an international standard.

IETF

Did you know that the robots.txt standard webmasters use isn’t a standard at all? Martijn Koster created the Robots Exclusion Protocol (REP) 25 years ago in an attempt to give sites more control over how bots access and crawl a site. However, REP was never turned into an international standard by the IETF.

Google representatives on the Webmaster Central Blog said that the current spec is too ambiguous and that it doesn’t address edge cases that occur on the modern web. That’s why they teamed up with Martijn Koster, webmasters, and competing (unspecified) search engines to update and submit REP to the IETF to become a standard.

The REP draft, which is authored by Koster and Google’s Gary Illyes, Henner Zeller, and Lizzi Harvey, addresses undefined scenarios that were initially left open for interpretation. There are four new scenarios, in particular, that have been added:

  1. It can now be used with any URI-based transfer protocol. For example, an open FTP connection would be able to support using a robots.txt file officially.
  2. Crawlers may impose a parsing limit that must be at least 500 kibibytes (KiB) to ensure connections aren’t open for too long.
  3. A directive that crawlers shouldn’t use a cached version of the robots.txt file for more than 24 hours.
  4. If a robots.txt file becomes inaccessible, pages that were previously found to be disallowed will continue not to be crawled for a reasonable amount of time. Reasonable was not defined in the draft.

In addition to the new rules, the full syntax was (re)defined in the proposal. It includes examples of how special characters and patterns are used.

REP Character Examples
CharacterDescriptionExample
“#”Designates an end of line comment.“allow: / # comment in line”
“$”Designates the end of the match pattern. A URI must end with a $.“allow: /this/path/exactly$”
“*”Designates 0 or more instances of any character.“allow: /this/*/exactly”
REP Pattern Examples
PatternURI
/path/file-with-a-%2A.htmlhttps://www.example.com/path/file-with-a-*.html
/path/foo-%24https://www.example.com/path/foo-$

Assuming this draft is accepted as the standard and no changes are made to it, this will become the de facto standard for writing directives for crawlers. You can read the draft online or download a PDF version from the IETF.