Language Trees and Zipping

Phys. Rev. Lett. 88, 048702 – Published 8 January 2002
Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto


In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification.


  • Received 29 August 2001
  • Revised 13 September 2001
  • Published 8 January 2002

© 2002 The American Physical Society

Authors & Affiliations

Dario Benedetto1,*, Emanuele Caglioti1,†, and Vittorio Loreto2,3,‡

  • 1“La Sapienza” University, Mathematics Department, Piazzale Aldo Moro 5, 00185 Rome, Italy
  • 2“La Sapienza” University, Physics Department, Piazzale Aldo Moro 5, 00185 Rome, Italy
  • 3INFM, Center for Statistical Mechanics and Complexity, Rome, Italy

  • *Electronic address:
  • Electronic address:
  • Electronic address:


References (Subscription Required)

Authorization Required




Log In



Article Lookup
Paste a citation or DOI

Enter a citation
  1. Enter a citation to look up or terms to search.

    Ex: "PRL 112 068103", "Phys. Rev. Lett. 112, 068103", "10.1103/PhysRevLett.112.068103"