Poster:
|
blackduckhistorian |
Date:
|
September 21, 2012 02:00:45am |
Forum:
|
web
|
Subject:
|
Domain resellers blocking waybackmachine |
This is actually a problem with wider implications than my personal current problem.
The problem is that when a domain expires, a domain name re-seller frequently snaps up the domain name for re-sale, and amends the robots.txt to block the wayback machine from displaying even the historic pages that were there before this transfer of ownership. Potentially, this makes waybackmachine fairly useless for the future. Effectively only a temporary archive whilst that ownership continues.
So, what is the policy going forward?
For my personal problem, I have all the specific URLs from web.archive.org, but they are no longer available to me due to robots.txt. Had I known that all these pages would, in the near future, be no longer available, I would have copied them in various ways (the original author was please they were at the time, available on wayback).
So, with my list of web.archive.org URLs, is there any way we can recover these pages? There are 47 pages in all, beginning with:
http://web.archive.org/web/20091008020755/http://www.tcb.co.uk/gene/kelsall.htmIt is my intention, with the original author's permission (already granted) to replicate the pages and host them on my site. So help in recovering these pages (or even the html for these pages) would be greatly appreciated. I can provide all the URLs with the web.archive-specific URLs.
Poster:
|
randomdestructn |
Date:
|
September 26, 2012 03:27:18pm |
Forum:
|
web
|
Subject:
|
Re: Domain resellers blocking waybackmachine |
I just created an account to reply, as I found your post while googling a similar problem.
I just went to load an old copy of a website of mine, only to find out that the new owner of the domain has retroactively blocked access to the wayback machine.
I understand an update of robots.txt applying to all future scrapes, or even going back a few months. But how can a new owner of a domain block pages that were published more than a decade before they took ownership?
I really hope a solution is found, as I feel the current policy will greatly degrade the usefulness of the wayback machine as time goes on.
Poster:
|
blackduckhistorian |
Date:
|
September 27, 2012 05:06:16am |
Forum:
|
web
|
Subject:
|
Re: Domain resellers blocking waybackmachine |
Thank you for replying, it reassures me that I am not alone in this! I only created this account to make this situation public here.
It is the retroactive policy of wayback that is the problem, which sounds good in theory, but means that basically EVERY website that lapses in ownership will disappear from the archive. Effectively, wayback becomes just a temporary archive, which I am sure was not the vision when the project commenced. For example, on the FAQ page they say:
Can I link to old pages on the Wayback Machine?
Yes! The Wayback Machine is built so that it can be used and referenced. If you find an archived page that you would like to reference on your Web page or in an article, you can copy the URL.
What they don't say, or highlight, is that as soon as someone else buys that domain, the content (and the unique archive.org URLs) will be likely to be gone, and gone forever. So what is even the point of the project, archiving all of this, if it is only a temporary repository?
It is now fairly standard that when a domain name lapses, one of these domain name resellers purchases it, and installs the robots.txt to block archive.org. Therefore the archive is only secure, for that particular content, whilst that particular owner has it. Whether a new owner, or domain name reseller, all previous content is likely to disappear forever.
I really hope someone from archive.org is aware of this situation.
Had I known what was going to happen, I would have saved offline copies of all the relevant pages - it was a shock to discover the content gone. The original author and I are trying to recreate most of them from scans and other sources, it would have been so much easier to copy/paste all the text pages. I emailed archive.org about recovery from specific URLs, and they did not reply.