Jump to: navigation, search

July 15, 2003; updated November 22, 2004

Part of the role libraries play in society's information infrastructure is as memory institutions. The advent of the Web has made it difficult for libraries to play this role because they lack tools to collect, preserve and disseminate material published on the Web. The Andrew W. Mellon Foundation is funding Stanford Libraries to develop and deploy suitable tools. Over 80 libraries worldwide ranging from the Library of Congress to the University of Otago are preservin collections in the LOCKSS network of persistent, self-healing web caches.

LOCKSS peers audit each other and repair damage using a peer-to-peer protocol. With our ITR grant a collaboration between Stanford, Harvard, Intel Labs and HP Labs is investigating how these protocols can survive failures and attacks. Our award-winning paper "Preserving Peer Replicas By Rate-Limited Sampled Voting" at SOSP2003 described how to adjust the cost of protocol operations using provable computational effort so as to resist adversaries with unlimited resources whose goal was undetected modification of preserved content. Our next major paper "Resisting Attrition Attacks On A Peer-to-Peer System" applies similar techniques to raise the cost and slow the progress of both network and application level denial of service attacks.

Contents

[edit] Why does each institution need its own copy?

Under the DMCA, publishers must give permission for institutions to preserve copies of their copyrighted material. We designed LOCKSS so that each peer will only supply a copy to another peer if it remembers that peer proving that it had a copy in the past. On this basis, we have been able to reassure publishers that the system does not represent a threat to their business models, and many have agreed to use blanket license terms to permit institutions to use LOCKSS to preserve their access to the material to which they subscribe. It is clear that they would not agree to a system in which it was possible for non-subscribers to access material they had not paid for.

[edit] What are the risks to preservation LOCKSS is addressing?

There are many risks to preservation, but we believe the major one is economic. At some points in the history of any individual copy, there will be budget crises that will cause material to be lost through triage. This is particularly likely in the case of digital preservation, which involves regular spikes in cost as hardware is replaced, backup media must be converted, and formats are migrated. It is likely that some of these spikes will coincide with budget stringency. Our approach is three-fold. We focus on keeping the cost of each copy as low as possible by using low-cost hardware and minimal staff time. We spread the total cost of preservation across many separate budgets, reducing the risk from budget custs. And by randomizing the timing of the cost spikes at each copy, we reduce the impact of the spikes on the overall cost of the system.

[edit] Isn't keeping so many copies more expensive than keeping a single copy in a central archive?

There is as yet no experience on which to base this assessment - LOCKSS is not in service but neither are the proposed central e-journal archives. It is not obvious which approach is cheaper. Central archives have much higher costs than an individual LOCKSS cache - they have to use much more expensive hardware, they have to use media and staff time to back it up, they need staff time to manage the archive and bill their customers for the subscriptions, and so on. The cost curves for low-cost disk drives and generic PCs mean that the hardware cost for a LOCKSS cache is a few thousand dollars, and their high level of automation and peer-to-peer cooperation mean that the administration cost is very low. Further, the cost of providing access to the preserved material should be taken in to account. Central archives provide an access mechanism different from that provided by the original publisher. Readers have to be trained and helped to use it. Access in the LOCKSS system is transparent, the system behaves as a proxy cache and thus provides the same access mechanism as the original publisher. There are no training or support costs.

[edit] Why not just hash the documents when you first crawl them and keep the hashes on CD-ROM? In a storage room of CDs? Periodically, if a hash of a document does not match the hash on the CD-ROM, then repair from the CD-ROM or from a CD in your storage room.

First, storage is unreliable. Keeping the hash on the CD does not help if the CD goes bad. CD-R media manufacturers claim lifetimes of the order of 70 years subject to controlled room-temperature dark storage and careful handling. These numbers are for expensive media. Independent studies suggest that the methodology behind these estimates is flawed. We assume that low-cost generic media have shorter lifetimes. Further, the process of recording on low-cost generic CD-R media is not in practice reliable, as we observe when our beta sites download and burn a new version of our boot CD. Second, staff costs to keep the storage room of back-up CDs would be high, and our aim here is to provide low-cost and low-maintenance digital preservation. Third, our aim is to allow library caches to provide seamless access to material. Hashes that do not match would require searching the backroom and manually loading the appropriate CD. Unless the library keeps maintenance staff around the clock, a bad hash could occur at night or on a weekend, and not be tended to until the following business day.

[edit] Why not have each LOCKSS peer encrypt the documents and store the encrypted versions on other peers?

Encryption requires that the encryption key be kept secret. Our aim is to provide digital preservation over decades. It is very difficult to keep secrets for such a long period of time. We therefore choose not to rely on solutions that depend on long-term secrets.

[edit] Why not use Byzantine Fault Tolerance techniques to check one's documents against a random sample of the population?

The LOCKSS sampled voting process has a number of advantages over BFT for our purposes. BFT requires a fixed population of replicas; our techniques cope with a dynamic population. BFT gets expensive rapidly as the number of replicas increases, the cost of a LOCKSS poll is linear in the size of the quorum, which can be much smaller than the total population. BFT fails abruptly if the number of malign or failed peers exceeds one-third, whereas the LOCKSS polling mechanism fails gradually and slowly.

[edit] Why not just have peers call opinion polls where they only invite peers they trust? Stanford invites Harvard, MIT, and UC Berkeley and no one else because Stanford only trusts those four libraries and trusts they will not be subverted.

Every peer is free to invite whomever it chooses into its polls, so Stanford could always invite just its friends in a poll. What does Stanford do in the event that a majority of its trusted library friends are subverted? Moreover, if peers choose to invite only well-known or reputable peers into their polls, those well-known peers may quickly be bogged down with poll requests, and may be unable to respond when invited.

[edit] Why not just take the documents or hashes of the documents and store them across a RAID (Redundant Array of Inexpensive Disks)? As you say, storage is unreliable, and RAID is supposed to address this.

RAID is a useful technique for reducing the impact of disk failures, but it requires many local disks. We have many more disks but they are remote. Our techniques use the remote copies to make the system highly reliable, rather than having many local copies at each site which would be much more expensive. Second, RAID's predicted resilience depends on the assumption that disk failures are not correlated. There is good evidence that disk failures in the same system or room show significant correlation - as the number of war stories about second failures during RAID restoration demonstrate. Third, RAID doesn't protect at all against many causes of failure in actual systems, including operator error and buggy or malicious software.