Articles & Presentations

Project Background

Building the Collection

The End of Term Web Archive collaboration began in the summer of 2008, when the project partners, all members of the International Internet Preservation Consortium (IIPC) and partners in the National Digital Infrastructure and Preservation Program (NDIIPP), agreed to join forces to collaboratively archive the U.S. Government web at the end of the Bush administration. The goal of the project team was to execute a comprehensive harvest of the Federal Government domains (.gov, .mil, .org, etc.) in the final months of the Bush administration, and to document changes in the federal government websites as agencies transitioned to the Obama administration. The archive includes Federal Government websites in the Legislative, Executive, and Judicial branches of government.

A broad, comprehensive crawl of sites began in August 2008, and project participants crawled, at varying frequencies, areas of the government web of particular interest to their organizations. The project team also called upon government information specialists, including librarians, political and social science researchers, and academics, to assist in the selection and prioritization of selected websites to be included in a focused crawl. These included sites identified as being potentially greater at risk of rapid change or disappearance. The prioritized URLs were collected in December 2008, and again after the inauguration in January 2009.

In Spring and Fall 2009 a final broad, comprehensive crawl was performed to document any final changes that had occurred. Each of the project partners then transferred the content they had collected to form a single consolidated archive. The Internet Archive hosts the public access copy of this archive, the Library of Congress holds a preservation copy, and the University of North Texas holds an additional copy for data analysis.

Our Collaborative Approach

By collaborating, the partners brought various skills and interests to the project. The Internet Archive performed the broad crawls and crawls of prioritized websites nominated by volunteers, and provides an access copy to the data for this site. UNT provided project management, created a Nomination Tool that allowed for a collaborative selection process and easily managed seeds, and provided subject expertise and performed some focused crawls. California Digital Library performed focused crawls based on interests of subject specialists, and has developed the front-end interface to the collection. Library of Congress coordinated volunteers, provided project management support, performed more focused crawls of the Legislative Branch, and supported transfer of data from one partner institution to another. The Government Printing Office administers the Federal Depository Library Program and has a strong interest in government publications so they joined project calls to stay informed about the work.

Driving Technical Innovation

The scale of the collection and the scope of collaboration behind the End of Term Web Archive itself drove new developments in web harvesting and access technologies. Here are some of the tools that were used, modified, or created in the process of carrying out this work:

  • To identify, prioritize and describe the thousands of U.S. Government web hosts, the University of North Texas built the Nomination Tool. This tool enables collaborative collection development for web archiving, and has since been used in other archiving efforts.

  • All of the partners who collected content used the Heritrix web crawler, developed by the Internet Archive with support from the IIPC.

  • To solve the challenges presented by transferring and aggregating the End of Term Content, the Library of Congress developed Bagit Library, an open source java library to support large-scale data transfer among many institutions. The Library also released Bagger, an open source desktop version of the Bagit Library. For further detail, see The End of Term was only the beginning.

  • The Internet Archive reconfigured existing in-house tools to automatically generate metadata records for the over 3,000 websites in the End Of Term Web Archive. With the California Digital Library providing input on the Dublin Core format, IA generated the records and thumbnail images you see when you browse the archive.

  • The University of North Texas has used the End of Term content as the data source for additional study of automatic classification of government agency web content. For further information on UNT's work, see Classification of the End of Term Archive: Extending Collection Development to Web Archives.

  • The California Digital Library modified its open source eXtensible Text Framework (XTF) +digital library access platform to provide a gateway to web-archived materials. This work is part of broader analysis underway at CDL to more fully integrate discovery systems and formats.

The challenges posed by the scope and scale of this collaborative effort have been met with innovation at each of the partner institutions, and have resulted in considerably more than the archive alone.