For Webmasters
You may be wondering about "ia_archiver" and be curious about why it is visiting your site, or you may want to block it from crawling your site. In either case, please read below.
Additional information regarding our privacy policy, web crawling philosophy, and technology can be found on the following pages Privacy Policy and Technology. If you wish to change the contact information for your site, please visit our contact information editor.
The Alexa crawler (robot), which identifies itself as ia_archiver in the HTTP "User-agent" header field, uses a web-wide crawl strategy. Basically, it starts with a list of known URLs from across the entire Internet, then it fetches local links found as it goes. There are several advantages to this approach, most importantly that it creates the least possible disruption to the sites being crawled.
We will not index anything you would like to remain private. All you have to do is tell us. How? By using the Standard for Robot Exclusion (SRE).
The SRE was developed by Martijn Koster at Webcrawler to allow content providers to control how robots behave on their sites. All of the major Web-crawling groups, such as Google, Yahoo, Bing and Baidu respect this standard. Alexa Internet strictly adheres to the standard:
The Alexa crawler looks for a file called "robots.txt". Robots.txt is a file website administrators can place at the top level of a site to direct the behavior of web crawling robots.
The Alexa crawler will always pick up a copy of the robots.txt file prior to its crawl of the Web.
To exclude all robots, the robots.txt file should look like this:
User-agent: *
Disallow: /
To exclude just one directory (and its subdirectories), say, the /images/ directory, the file should look like this:
User-agent: *
Disallow: /images/
Web site administrators can allow or disallow specific robots from visiting part or all of their site. Alexa's crawler identifies itself as ia_archiver, and so to allow ia_archiver to visit (while preventing all others), your robots.txt file should look like this:
User-agent: ia_archiver
Disallow:
To prevent ia_archiver from visiting (while allowing all others), your robots.txt file should look like this:
User-agent: ia_archiver
Disallow: /
For more information regarding robots, crawling, and robots.txt visit the Web Robots Pages at www.robotstxt.org, an excellent source for the latest information on the Standard for Robots Exclusion.