About Blocking the Internet Archive Crawler

The Internet Archive no longer seem to respect robots.txt, so what are the alternative methods for blocking it?

2189 views

By. Jacob

Edited: 2020-06-03 15:25

The Wayback Machine, ia_archiver user agent

I just realized that blocking the archive.org (The Wayback Machine) bot with robots.txt no longer works. It used to work, but then they started blatantly ignoring the rules set by website owners.

Of course, changing something that is so widely in use is a highly problematic thing to do. It is much worse than deleting content without explaining why it was deleted, or changing an URL without implementing a redirect; we just should not do such things.

The old page that explained how to block archive.org also disappeared, it used to be available at: http://archive.org/about/exclude.php, but an archived version can be found here: https://web.archive.org/web/20150322111536/http://archive.org/about/exclude.php

I have had a robots.txt file like this for a long time:

User-agent: ia_archiver
Disallow: /

Sitemap: /sitemap.xml

As it turns out, the ia_archiver user agent is not even used by archive.org anymore, and it now belongs to Alexa instead. If you have this in your robots.txt, you might have blocked Alexa unintentionally. The user agent is also misleading, since "archiver" strongly indicates that it belongs to archive.org; it makes no sense to connect it with Alexa.

To have a site removed, apparently we will now have to send an e-mail to info@archive.org and request removal.

The new user agent is archive.org_bot, but blocking it in robots.txt will probably not work. We can, however, use it from PHP to block the crawler. We could also just block the assoiated IP addresses instead—not very convenient for users that are less technical!

Mark Graham wrote a blog post about this in 2017 that explains the change: Robots.txt meant for search engines don’t work well for web archives

Why block archive.org

There is sometimes good reason to block the Internet Archive; for one, the Internet Archive will allow users to browse your content without ads, causing you to loose revenue; archive.org provides no compensation for such use.

Of course, this is not a huge problem, since the number of users thinking to use the IA for this purpose is very low. We face a far more challenging battle against ad blockers, since browser developers has made it far too easy to create malicious browser plugins, such as ad blockers. Firefox is, itself, now blocking most ads by default.

Other, but less convincing, reasons would be that your content might update often, or your might not want certain information to be indexed forever by other websites. The solution in this case is not to block IA altogether; IA is just one of many services that might index your content, so what can we do instead?

If the information is sensitive enough, we place it behind a login. Doing this will grant us absolute control over access to the content. Facebook either requires users to log in to see much of their content, or they tend to annoy users sufficiently that they decide to log in.

Blocking the Internet Archive

I prefer using a framework for this purpose myself, but I am sure you could also rely on Wordpress or Drupal plugins. In fact, Wordpress has a plugin called code snippets which makes it very easy to place specific pages behind a login—you do not even need to know much about plugin development.

However, I realize this is not going to be practical for most people reading this; so, what can you do instead?

The fact that there might be countless of services that crawl and index our websites should not defer us from blocking specific ones. The Internet Archive is a well known website, and probably the biggest of its kind. It is hard to imagine that it just goes away suddenly, but smaller services might come and go all the time, and focusing on the biggest ones of these services might actually still be effective.

To block the Internet Archive entirely, or to just block it from indexing specific pages, we can just block their IP addresses. Alternatively, blocking the new user agent in htaccess might also work:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} archive.org_bot [NC]
RewriteRule .* - [R=403,L]

It is also possible to contact them and request a removal of a website; but we might not want to do that if the block is only going to be temporary, or if we later want the freedom to quickly change our mind without their involvement. To request removal, write an e-mail to info@archive.org.

Only allow search engines

We could use some kind of opt-out method that only permits search engines to indexing our site; we do not yet have that, but if it is ever introduced, it could look similar to this:

User-agent: search_engines_only
Allow: /

User-agent: *
Disallow: /

Sitemap: /sitemap.xml

Note. This example will not currently work, but I can certainly see a need for something similar to this—for copyright and legal reasons.

It is not ideal for website owners that we are forced to manually make decisions on all kinds of different services, but potentially blocking something useful would also be bad. It is probably mainly scraping of content that is annoying most of us.

Conclusion

Obviously internet scraping does have valid purposes, but it can also violate the copyright of website owners and cause lost revenue.

The question then is, instead of making data publicly available, should we include it in .pdf files and sell it as e-books? I personally doubt anyone will be willing to buy an e-book or pay to access content when they can find the same information for free on other websites. Website owners are simply forced to make their content available for free in order to remain competitive. We would also loose a traffic from doing it.

Free in this context means sponsored by ads.

I also think archiving dead websites is completely fine; however, it can also be argued that content should not be made available in archive.org if the content on the live site is still identical to the archived version, and the site is responding to requests; it just would not serve any purpose to make it available in an archive.

Considering that the problems are minor, and most users actually seem to prefer to browse the live (up-to-date) version of content, probably we should not spend to much energy on this issue.