We are excited to announce the release of Heritrix 1.12.0. It is available for download on sourceforge.
Release 1.12.0 is the first of several planned releases enhancing Heritrix with “smart crawler” functionality. The smart crawler project is a joint effort between Internet Archive, British Library, Library of Congress, the Bibliothèque Nationale de France and members of the IIPC (International Internet Preservation Consortium). This is the first year of a multi-year project.
The first stage of smart crawler aims to detect and avoid crawling duplicate content when crawling sites at regular intervals. The new release of Heritrix addresses this in two ways. First by using a conditional get when fetching pages from http servers. Second, if the responding server does not support conditional get, Heritrix will compare the new content hash with what has previously been crawled. Additional de-duplication features will be added later this year.
Release 1.12.0 also includes updated WARC readers and writers to match the latest revision of the specification, 0.12 revision H1.12-RC1. WARC is the next generation archiving file format, a revision of the Internet Archive ARC file format. Please see the release notes for more information about these and other included features and bug fixes.
Subsequent phases of the smart crawler project will also focus on enhanced URL prioritization and crawling that is sensitive to the rate at which individual web pages change.
As always, all Heritrix code is open source. We are proud to help support the open source community. If you would like to get more involved or contribute code to Heritrix visit crawler.archive.org.