Formally Verified Suffix Array Construction

Louis Cheung 📧 and Christine Rizkallah 🌐

September 25, 2024

This is a development version of this entry. It might change over time and is not stable. Please refer to release versions for citations.

Abstract

A suffix array (Manber and Myers, 1993) is a data structure that is extensively used in text retrieval and data compression applications, including query suggestion mechanisms in web search, and in bioinformatics tools for DNA sequencing and matching. This wide applicability means that algorithms for constructing suffix arrays are of great practical importance. The Suffix Array by Induced Sorting (SA-IS) algorithm (Nong, Zhang and Chan, 2009) is a conceptually complex yet highly efficient suffix array construction technique, based on an earlier algorithm (Ko and Aluru, 2005). As part of this formalization, we have developed the SA-IS algorithm in Isabelle/HOL and formally verified that it is equivalent to a mathematical functional specification of suffix arrays. This required verifying a wide range of underlying properties of lists and suffixes, that could be reused in other contexts. We also used Isabelle’s code extraction facilities to extract an executable Haskell implementation of SAIS. In particular, this entry includes the following: an axiomatic characterisation of suffix array construction; a formally verified encoding of a straightforward but inefficient suffix array construction algorithm (validating the specification); and a formally verified encoding of the linear time SA-IS algorithm.

License

BSD License

Topics

Related publications

  • Nong, G., Zhang, S., & Chan, W. H. (2009). Linear Suffix Array Construction by Almost Pure Induced-Sorting. 2009 Data Compression Conference. https://doi.org/10.1109/dcc.2009.42
  • Ko, P., & Aluru, S. (2005). Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2–4), 143–156. https://doi.org/10.1016/j.jda.2004.08.002
  • Wikipedia

Session SuffixArray