Meeting Report: Long Term Archiving of Digital Documents in Physics

November 5-6, 2001
Lyon Villeurbanne

Report prepared by Dr. Arthur Smith, APS

Conference Sponsors:

  • Ministere de la Recherche (CCSD and CNRS)
  • American Physical Society
  • American Institute of Physics
  • EDP Sciences
  • Elsevier
  • Highwire Press
  • International Center for Theoretical Physics
  • Institute Of Physics
  • Institute for Pure and Applied Physics
  • IUPAP / ICSU Press
  • Japan Society of Applied Physics
  • Physical Society of Japan

The Lyon conference was well attended, with vigorous discussion among the physicists, librarians, and publishers present. The final program is appended, and rough transcripts of the meeting are also available.

From the discussion and summary in the final session, two draft recommendations for IUPAP were presented and generally agreed upon. These recommendations were as follows:

To assure that publishers and libraries will preserve for the long term through electronic entities (archives) that are and will be put online, we propose that IUPAP organize a registry of the electronic physics archives that currently exist, with information such as where it is held, what formats, what hardware is used, and so on. (The purpose of the registry is to enable early warnings that particular hardware or software is in danger of becoming obsolete or unreadable. Suggestions for migration could then be given to the publishers and/or libraries involved.)

We propose that IUPAP establish an expert subgroup on XML as a format standard for archiving, and that other similar expert subgroups on archiving be also considered.

The topic of archiving is a broad one, and even on the first question of "what is an archive?" there was some dispute. A brief summary of the major topics of discussion follows.

There was a general emphasis on getting started now with small steps, and working together as publishers, libraries, and end-users. Several pointed out the need for trust, and that there were both sociological and technical issues to be resolved. Not all publishers were ready, but there was a general feeling that we have reached a point where we can declare the electronic version of a journal as the "archival" authoritative version. And some of the libraries wanted to know, when can we just get rid of print?

Multiple copies are essential to long term preservation; they can also help improve short-term access. This is easiest if ownership problems are resolved (free access makes many things simpler). Along these lines, no single entity should be responsible for a digital archive of everything. Separate paths to "refreshing" the archive are needed, but you also need to avoid divergence in content. There should be at least 2 independent lines of authority, to reduce the possiblities of malicious actions destroying content.

Repeatedly the issue of what exactly is an archive was raised. Consensus seems to be forming around the Open Archival Information System (OAIS) model, developed by NASA. Mirror sites seem not sufficient to qualify as real archives, nor are "local load"s of content (they may improve access, but provide little additional guarantee of preservation). One suggestion was to form an organization to rate or certify digital archives; a milder form of this is the registry in the first proposal to IUPAP.

"Lit", not "dark" archives, with limited access, are essential to ensure the archived material is usable, and remains in use. So there must be access, in addition to preservation. And that access must be by people, not just automated software, at least at this time.

There was also the question of what to archive - there is not only "content" in an electronic journal, but also "services", and this should be clearly distinguished; the services seem more likely to be difficult to preserve, as they depend on technologies that change more rapidly than underlying data formats. But services are important: searchability and linking provide added value to journals now. There were participants who felt it was only essential to preserve the fixed "content", and others who wanted the services somehow preserved also.

National libraries have a role to play as a last resort in preservation, through depository laws or publisher agreements. In many countries the law and procedures need to be enhanced to cover deposit of electronic content, including published databases. Two approaches that are being developed are targeted deposit of specific electronic content (journals etc.), and the harvesting of large portions of the web.

Access for developing nations was discussed at some length. The lack of reliability of electronic infrastructure is as much a problem as actual access. Innovations such as the new ICTP email program for providing access to journal content may help resolve some of these problems. Journals from developing countries will be more accessible to the rest of the world now, in digital form, but long-term preservation is not an issue viewed as important at this time by the editors and publishers of these journals.

Preservation of the gray literature, including preprints, technical reports, experimental documentation, and even email communications, is an important issue. Historians are very interested in preservation of this material, but physicists are often not. Authors may even not want this sort of thing archived, if asked. There was almost uniform consensus that we want all peer reviewed material to be archived permanently. But then who decides on whether we preserve the rest?

There is a real need for standards that will allow files to remain readable for the long term. This means ensuring that future migrations to new formats can be both lossless, and automatic. Is XML the answer? There is a need for authoring tools, so we can automate more conversions. But authors don't want to actually spend any more time preparing materials to conform to rigid rules!

Related to this issue of standards for file formats is the actual extraction of information; how to turn "bits" into "information". So part of the preservation process has to include preserving these procedures for extracting information from the "bits", particularly where raw data is concerned. But even for files associated with scientific articles, whether text, images, or multimedia, the formats used for archiving should be open and published, not proprietary secrets. Even PDF, while an open format, is perhaps too complex to be acceptable as a long-term format. Ideally we should have formats that can be decoded even if the format descriptions or programs that display them have been lost over time.

A further related issue is ensuring the integrity of the original archived material, and any copies. In particular compressed formats may be very susceptible to loss if minor, even single-bit, errors appear. There seems to be no current standard for ensuring this long-term integrity. A digital signature or "error-correction" standard is needed, along with appropriate software to ensure errors are detected and corrected, and corrections recorded, in a regular, automated, fashion.

Standards are also needed for versioning, and for electronic journals there seems a need to set a policy on retaining all versions. Changes may come about through published errata with a new article identifier, or through changes to the published article with the same (perhaps modified) identifier.

There were several discussions concerned with author copyright; there seem to be some underlying issues here that could block some preservation efforts; on the other hand it seemed prudent to just forge ahead under the assumption of author's approval (perhaps with some advertised description of what is being done) with the option to remove material if an author actually objects.

Finally there was the perennial question of how do we pay for archiving and preservation - what are the business models? Nobody had a real answer, but all the end-users were agreed that archives should be free to the researcher at the point of use. Even that caused considerable debate, over what exactly was meant by it. There did seem to be a general consensus that, while publishers are responsible for much of the electronic archiving work now, at least some of that direct expense will move to libraries to ensure longer-term preservation.

To return to the first point raised, continued interaction and cooperation between physicists, librarians, and publishers is essential to improving the state of long-term preservation of important electronic material. We now seem ready to make larger strides, establishing standards for archiving, and sharing best practices.


4 November 2001

Centre de Calcul de l’IN2P3
University Campus (Université Claude Bernard)
27 bd. du 11 Novembre 1918

17:00 - 20:00 Registration

18:00 - 20:00 Cocktails

5 November 2001

CNRS building, 2 av. Albert Einstein
University Campus (Université Claude Bernard)
27 bd. du 11 Novembre 1918

Session 1 9:00-10:00

Welcome and Housekeeping 9:00-9:15
Franck Laloë - CCSD Laboratoire Kastler Brossel

Introductory Talks: “What is the Problem?” 9:15-10:00
Martin Blume - APS & IUPAP Working Group on Communications in Physics
Ann Okerson - Yale University Library

Session 2 Archiving and Publishers 10:00-12:30

(Coffee Break at 11:00)

Roundtable 1
Session Chair - S. Ushioda - Tohoku University

  • Seiichi Kagoshima - IPAP
  • Kurt Paulus - IOP
  • Mark Doyle - APS
  • K.K. Phua - World Scientific Publishing
  • Hubertus v. Riedesel - Springer-Verlag

Lunch 12:30-14:00

Session 3 Archiving and Publishers 14:00 - 16:00

Roundtable 2 Session Chair - Robert Kelly - APS

  • Jean-Marc Quilbe - EDP Sciences
  • Pieter Bolman - Elsevier
  • Marc Brodsky - AIP
  • Paul Ginsparg - Cornell

Coffee break 16:00 - 16:30

Session 4 The View of End-User Physicsts 16:30-18:30

Session Chair - Ian Butterworth - Imperial College
  • Denis Jérome - Université Paris-Sud
  • Ana María Cetto - Instituto de Física, UNAM
  • Claus Montonen - European Physical Society
  • Eberhard Hilf - Institute for Science Networking - Oldenburg
  • John Enderby - IOPP

Banquet 19:30 - 23:00
At an “auberge” in Perouges, bus transportation will be provided.

6 November 2001

CNRS building, 2 av. Albert Einstein
University Campus (Université Claude Bernard)
27 bd. du 11 Novembre 1918

Session 5 Archiving and Libraries

Roundtable 1 9:00-10:30

Session Chair - Sarah Thomas - Cornell

9:00- 9:20 Digital Preservation Practices and Principles - Anne Kenney - Cornell and CLIR

9:20- 9:30 Discussion

9:30-10:00 The Mellon Archiving Projects, with specific examples from Cornell and Yale - Sarah Thomas and Ann Okerson

10:00-10:20 LOCKSS, the Dark Cave - Andrew Herkovic - Stanford University

10:20-10:30 Discussion

Coffee Break 10:30-11:00

Roundtable 2 - National Libraries 11:00-12:30

Session Chair - David Russon

11:00-11:2 0 The National Preservation Program of the Library of Congress - Winston Tabb - US Library of Congress

11:20-11:3 0 Discussion

11:30-11:5 0 Electronic publications: working on long term access - Johan Steenbackers - De Koninklijke Bibliotheek, The Netherlands

11:50-12:0 0 Discussion

12:00-12:2 0 The legal deposit of electronic publications in France - Catherine Lupovici - Bibliothèque nationale de France

12:20-12:3 0 Discussion

Lunch 12:30-14:00

Session 6 14:00 - 16:00

Round Up of What we have learnt and Next Steps and Recommendations
Session Chair - Roger Elliott - Oxford and ISCU Press

LTADDP Meeting Contributions:

