Managing Gigabytes - Preface

Managing Gigabytes
Compressing and Indexing Documents and Images
Second Edition

Preface to the Second Edition


The computer revolution has produced a society that feeds on information. Yet much of the information is in its raw form: data. There is no shortage of this raw material. It is created in vast quantities by financial transactions, legal proceedings, and government activities; reproduced in an overwhelming flood of reports, magazines, and newspapers; and dumped wholesale into filing cabinets, libraries, and computers. The challenge is to manage the stuff efficiently and effectively, so that pertinent items can be located and information extracted without undue expense or inconvenience.

The traditional method of storing documents on paper is expensive in terms of both storage space, and, more important, the time it takes to locate and retrieve information when it is required. It is becoming ever more attractive to store and access documents electronically. The text in a stack of books hundreds of feet high can be held on just one computer disk, which makes electronic media astonishingly efficient in terms of physical space. More important, the information can be accessed using keywords drawn from the text itself, which, compared with manual document indexing schemes, provides both flexibility, in that all words are keywords, and reliability, because indexing is accomplished without any human interpretation or intervention. Moreover, organizations nowadays have to cope with diverse sources of electronic information such as machine-readable text, facsimile and other scanned documents, and digitized graphics. All these can be stored and accessed efficiently using electronic media rather than paper.

This book is about how to manage large numbers of documents -- gigabytes of data. A gigabyte is approximately one thousand million bytes, enough to store the text of a thousand books. The term has gained currency only recently, as the capacity of mass storage devices has grown. Just two decades ago, requirements measured in megabytes (one million bytes) seemed extravagant, even fanciful. Now personal computers come with gigabytes of storage, and it is commonplace for even small organizations to store many gigabytes of data. Since the first edition of this book, the explosion of the World Wide Web has made terabytes (one million million, or one trillion, bytes) of data available to the public, making even more people aware of the problems involved in handling this quantity of data.

There are two challenges when managing such huge volumes of data, both of which are addressed in this book. The first is storing the data efficiently. This is done by compressing it. (We use data in the singular throughout this book. To us, data seems like a large -- often formidably large -- entity in itself, rather than a collection of "datums.") The second is providing fast access through keyword searches. For this, a tailor-made electronic index must be constructed. Traditional methods of compression and searching need to be adapted to meet these challenges. It is these two topics that are examined in this book. The end result of applying the techniques described here is a computer system that can store millions of documents, and retrieve the documents that contain any given combination of keywords in a matter of seconds, or even in a fraction of a second.

Here is an example to illustrate the power of the methods described in this book. With them, you can create a database from a few gigabytes of text -- each gigabyte is a thousand books, about the size of an office wall packed floor to ceiling -- and use it to answer a query like "retrieve all documents that include paragraphs containing the two words `managing' and `gigabytes'" in just a few seconds on an office workstation. In truth, given an appropriate index to the text, this is not such a remarkable feat. What is impressive, though, is that the database that needs to be created, which includes the index and the complete text (both compressed, of course), is less than half the size of the original text alone. In addition, the time it takes to build this database on a workstation of moderate size is just a few hours. And, perhaps most amazing of all, the time required to answer the query is less than if the database had not been compressed!

Many of the techniques described in this book have been invented and tested recently, and are only now being put into practice. Ways to index the text for rapid search and retrieval are thoroughly examined; this material forms the core of the book. Topics covered include text compression and modeling, methods for the compression of images, techniques for compressing textual images such as scanned or facsimile documents, and page layout recognition to separate pictures and diagrams from text.

Full-text indexes are inevitably very large, and therefore potentially expensive. However, this book shows how a complete index to every word -- and, if desired, every number -- in the text can be provided with minimal storage overhead and extremely rapid access.

The objective of this book is to introduce a new generation of techniques for managing large collections of documents and images. After reading it, you will understand what these techniques are and appreciate their strengths and applicability.

Accompanying software

A comprehensive public-domain system, mg (standing for "Managing Gigabytes"), has been created to illustrate the ideas in the book. Complete source code for mg is freely available on the Internet (start at http://people.eng.unimelb.edu.au/ammoffat/mg/). Written in ANSI C and running under Unix, it is an operational example of the techniques that we develop and explain. It compresses, stores, and accesses a collection of text, scanned documents, and images in a fully integrated fashion. Any Boolean combination of keywords can be used to retrieve all documents meeting the specification. Informal ranked queries, where the user merely specifies a list of words and relevant documents are retrieved in order, are also supported. Consider the query mentioned earlier, to retrieve all documents that include paragraphs containing the two words managing and gigabytes. With a database of 750,000 documents, amounting to two gigabytes of text, it took mg just one second to access and decode the index entries for these two words -- 159,458 and 961 appearances respectively -- and well under a minute to fetch and decompress the seven megabytes containing the 554 documents that match the query.

Audience

The book should be of interest to general readers who want to learn about this subject, to information professionals who need to become acquainted with the new technology for information management, and to those who wish to gain a detailed technical understanding of what is involved. It is written for an eclectic audience of information systems practitioners, programmers, consultants, librarians, information distributors, professors, students, developers, specification writers, patent examiners, and curious lay people who need an accessible description of this new technology. Distributors of CD-ROM databases such as books, encyclopedias, and even computer software will benefit directly from the techniques described. We have avoided requiring any specific theoretical or mathematical knowledge, except in some sections that are marked by a light gray bar in the margin. These contain material for the more technical or theoretically inclined reader, and may be skipped without loss of continuity. We have also highlighted "rules of thumb" in the body of the text.

There are several ways in which the book can be used as the basis of courses at the senior undergraduate, graduate, and professional levels. Each chapter deals with a different component of a full-text retrieval system, including compression methods for the text, index, and pictures; consequently, many of the chapters are suitable for short courses in their own right. For example, Chapter 2 is a comprehensive survey of text compression methods, and has been used for a short course on compression. Whole books could be written on this subject -- indeed, they have been, and one such book is Text Compression, by two of the authors of this book in collaboration with John G. Cleary -- but this chapter provides a self-contained, practical guide to the methods that are most commonly used in practice, giving the right amount of information to learn how each one works. Likewise, Chapter 6 is a self-contained account of current techniques, and international standards, for the compression of images. Chapter 4 covers the basic notions of information retrieval using Boolean and ranked queries, and gives a detailed discussion of the issues involved in their implementation.

The book has been organized so that two groups of chapters provide deep and extensive coverage of particular sub-areas. Chapters 1, 3, 4, and 5 have been used as the basis of a graduate course on information retrieval, while Chapters 6, 7, and 8 form a stand-alone module on the analysis and compression of document images. A more comprehensive senior undergraduate or graduate course on information systems and data compression would cover the entire book, or the text can be used to supplement general undergraduate courses on information systems or practical data structures.

Finally, readers who are more concerned with the concepts involved than the technical details will get the general message of the book by reading the first and last chapters. Chapter 1 introduces the problems tackled, and illustrates them with real-world examples. It looks at how concordances have been constructed in the past, and how they are being replaced by full-text retrieval systems. It also introduces the key ideas of the book: compressing and indexing large collections of text and images. Chapter 10 looks to future developments and applications for this new technology. One such development is the integration of audio and multimedia information into indexed retrieval systems. This is surprisingly straightforward; any kind of information that needs to be retrieved on the basis of a specified set of keywords can be incorporated into an indexed compression system, and any compression scheme for such information can be incorporated as well. It seems certain that such systems will see rapidly increasing use in the future to store large collections of disparate kinds of information.

Where more detail is required, readers can turn to the initial parts of other chapters. In all cases the more technical material does not begin until well into the chapter, and the first part provides a general overview of what is involved. Many of the chapters contain some sections that are "optional" (marked by a light gray bar) and need not be studied unless the material is felt to be particularly relevant to the reader's interests.

Updated and revised content

The first edition of this book was essentially complete by the close of 1993, and now, in October 1998, we are just polishing this second edition. What a lot has happened in the world of information during the intervening five years! -- the rise of the World Wide Web, the idea of digital libraries, the internationalization of information, Java and the "network computer," virtual reality in your living room (and, on the down side, pornography, virtual sex, gambling). Today, the largest information system that humanity has ever known is pushed "in your face" daily -- whoever you are -- by TV, magazines, advertisements everywhere. Today, information workers experience the power -- and the frustration -- of full-text searching of vast masses of data -- gigabytes -- on a routine basis, every working day. All this in just five years. One of the more esoteric topics this book covers (or so we thought), textual image compression, is being turned into an international standard and will soon be used in your fax machine. Some of the changes that were foreseen in 1993 have not happened (yet): for example, this second edition is not called Managing Terabytes as we rashly hinted it might be. So much for technology forecasting.

On the one hand, the world of information has become assimilated into everyday life to an extent that would have stretched our credulity in 1993. On the other, the topics covered in this book are not dated: indeed, they are ever more relevant to today's scene. The need to compress and index documents and images is even stronger. The basic ideas of compression, information science, full-text retrieval, and image representation are just the same. The idea of compressed full-text indexing is just as good, and almost as unusual. So far as we know, no commercial search engines yet use all of the techniques we espouse: they do it the hard way -- the hardware way -- with enormous disk and main memory installations. (And they don't store the text, only the index, making "404 Not Found: The requested URL was not found on this server" a catchphrase for technology failure that makes "Bus error: core dumped" seem quaint, almost friendly.) This book is just as timely and important now as when it was first published.

While the basic core of material in this second edition is the same as in the previous edition, we have made the most of our opportunity to update it to reflect the changes that have taken place over five years. Of course, there have been errors to fix, errors that we had accumulated in our publicly-available errata file. In fact, surprisingly few were found, and we are hopeful that there are even fewer in this second edition. (The errata for the second edition may be found via the home page at http://people.eng.unimelb.edu.au/ammoffat/mg/.) We have thoroughly edited the chapters and brought them up to date where appropriate, inserting many new references into the "Further reading" sections. The most enjoyable part has been adding new material: here are the major additions.

Chapter 2 has been updated to include recent developments in text compression, including the block sorting method ("Burrows-Wheeler transform"), approximate arithmetic coding, and fast Huffman coding algorithms. More details of some of the methods have been added, the performance comparison has been updated to include recent compression programs, and comparative results have been obtained using the more recent "Canterbury corpus" instead of the "Calgary corpus" as before.

Chapter 3 discusses indexing techniques, and a new section has been added on context-based index compression, including a description of the recently-developed technique of "interpolative coding." The material on signature files and their performance relative to inverted files has been extensively revised.

Four sections have been added to Chapter 4. The first describes the use of blocked inverted indexes, which provide fast Boolean querying. The second covers frequency-sorted inverted indexes, which improve performance for ranked queries. The third describes some of the operational issues associated with the World Wide Web search engines that we now -- millions of us -- take for granted. The fourth new section discusses distributed querying. The sections describing ranked querying and the TREC project have been extensively revised to incorporate the substantial developments that have taken place over the last five years.

Chapter 6 includes significant new material on lossless image compression, including de facto image compression standards that are in widespread use for pictures in Web pages (GIF and PNG), a high-performance lossless image compression algorithm called CALIC, and the JPEG Lossless or JPEG-LS scheme that has been proposed as a new international standard for lossless compression.

Chapter 7 includes a new section on JBIG2, an upcoming international standard for the compression of document images. Although the details will not be finalized until approaching the year 2000 -- well after this edition is published -- the form that the standard will take is reasonably clear, and it will incorporate many of the techniques described in Chapter 7.

The discussion in Chapter 9 has been revised and the various experimental results updated to reflect developments in compression technology and computer hardware that have taken place over the last five years. In particular, a detailed section (including new figures) has been added that describes length-limited Huffman coding.

Chapter 10 contains new information on the Internet and World Wide Web, a major new section on digital libraries, and new material on Web search engines and agent-based information retrieval.

An important new Appendix has been added describing a large-scale application of some of the key ideas presented in this book, the New Zealand Digital Library. This is a repository of public information, freely available on the World Wide Web, that uses mg as its kernel. It is intended to demonstrate the utility and flexibility of the mg software for information retrieval, and this Appendix explains the facilities that are offered and the mechanisms that are necessary to provide them.

Finally, from our experience in teaching this material in a variety of classroom settings we have put together an "Instructors' Supplement" that includes review and test questions for use when teaching from the book. It is a separate booklet, available to teachers on request to Tim C. Bell, Department of Computer Science, University of Canterbury, Christchurch, New Zealand.

Older but wiser, we are refraining from making any forecasts of what will have changed when we begin work on the third edition of Managing Gigabytes, probably around the year 2003.

Acknowledgements

Writing the acknowledgements is always the nicest part! A lot of people helped us, and it is lovely to have this opportunity to thank them. We have been fortunate to receive a great deal of encouragement and assistance in this book from many of our highly esteemed colleagues in the data compression and information retrieval areas, principally Abe Bookstein, Nigel Horspool, Tomi Klein, Glen Langdon, Timo Raita, and Jim Storer. We have learned a great deal from all these people, and part of what we have learned is reflected in this book. Worthy of special mention are John Cleary, Radford Neal, Ron Sacks-Davis, and Justin Zobel, with whom we have worked extensively over the years. The results we present here are theirs just as much as they are ours. Rob Akscyn and Bob Kruse gave valuable help and advice at critical points of this project, and Rod Harries and Todd Bender have provided helpful information and advice on concordances. Several other people have also contributed either directly or indirectly at some stage over the last five years: Gill Bryant, Sally Jo Cunningham, Tony Dale, Daryl D'Souza, Mike Fleischmann, Peter Gutmann, Jan Pedersen, Bill Pennebaker, Art Pollard, Marcelo Weinberger, and Ross Wilkinson. David Abrahamson was heavily involved in the early stages of the book: he helped us decide what should -- and should not -- be included. He also provided the motivation for our work on textual image compression, and contributed some material to Chapter 7. We also thank the reviewers who were there at the beginning to convince our publisher that we were worth supporting, and who were there again at the end, when the manuscript was nearly complete. Douglas Campbell provided particularly detailed and helpful comments for the second edition. Valuable input in the review process was also provided by Ron Murray, Rob Akscyn, Robert Gray, David Hawking, Yann LeCun, Michael Lesk, Darryl Lovato, Karen Sparck Jones, Jan Pedersen, and Peter Willett.

Jennifer Mann and Karyn Johnson of Morgan Kaufmann have worked hard to shape the second edition. Joan Mitchell of IBM has provided valuable practical assistance in helping us to define, refine, and produce this book.

Many of our students have been extremely helpful. Mary-Ellen Harrison and Mark James at the University of Calgary, Canada, and Hugh Emberson at the University of Canterbury, New Zealand, were instrumental in the development of our ideas on textual image compression. Stuart Inglis and Abdul Saheed at the University of Waikato, New Zealand have done sterling work on document layout recognition and template matching using model-based compression. Craig Nevill-Manning at Waikato and Calgary was involved in some of our early research on index compression and has been a great source of all sorts of practical help. Peter Thompson at the University of Melbourne, Australia, contributed to the material presented in Chapter 5. We are grateful to students who field-tested various sections of the book, including Tim A.H. Bell (a student at the University of Melbourne, and not to be confused with the Tim C. Bell who is an author of this book), Gwenda Bensemann, Mike Ciavarella, Craig Farrow, Andrew Kelly, Alex McCooke, Chris Stephens, John Tham, Bert Thompson, Lang Stuiver, Andrew Turpin, and Glenn Wightwick, who relished their role as guinea pigs and provided valuable comments.

In producing the second edition, we are most grateful to Peter Fenwick of the University of Auckland for his assistance with the new material on the Burrows-Wheeler transform. Nasir Memon kindly supplied some information for Chapter 6, and the much of the section on JPEG-LS is derived from a paper that he wrote. Paul Howard reviewed our description of the embryonic new standard for textual image compression, JBIG2. Harold Thimbleby made valuable comments on Appendix B. Yvonne Simmons helped with the indexing. Andrew Turpin provided an improved implementation of the huffword program, together with some of the length-limited coding results listed in Chapter 9. Owen de Kretser and Lang Stuiver reran many of the results listed in Chapters 3 and 4. Tim A.H. Bell assisted with a perceptive proof reading service for almost all of the revised book, and Tetra Lindarto and Elizabeth Ng also helped in the proofreading effort. Nelson Beebe has been a source of useful information and encouragement since the first edition was published.

The first mg system was developed with the support of the Australian Research Council and the Collaborative Information Technology Research Institute, and with assistance from Lachlan Andrew, Gary Eddy, and Neil Sharman. Since then many people have contributed: Owen de Kretser, Tim Shimmin, and William Weber, to name just those directly involved. There are many others who have developed modifications for specific purposes, and space precludes us naming them all. The minimal perfect hashing routines were written by Bohdan Majewski at the University of Queensland, and his permission to incorporate these procedures is gratefully acknowledged. A lot of other people also went well beyond the call of duty in helping us with technical aspects of the book. All the images in Chapter 6 were produced by Neil Sharman, as were some of the figures in Chapter 2 and Appendix A. Many of the illustrations in Chapters 7 and 8 were produced by Kerry Guise and Stuart Inglis.

Parts of this research have been funded by the Australian Research Council and the Natural Sciences and Engineering Research Council of Canada. Furthermore, the departments of computer science at Calgary, Canterbury, Melbourne, and Waikato have all been very supportive of our work.

Last, but definitely not least, we are grateful to our families, who were quite aware of the implications of having an author in the house but let us go ahead and write the book anyway. Thank you Judith, Pam, and Thau Mee for your support. You have helped us in innumerable ways and this is your book too. We are grateful to our children, Andrew, Anna, Anne, Kate (who arrived shortly after the first edition), Michael, and Nikki, who are growing up immersed in the information age, for keeping us in touch with reality and continually reminding us that managing gigabytes is not all that important.


The MG home page is at http://people.eng.unimelb.edu.au/ammoffat/mg/.
The MG book can be purchased from Amazon.com.

Ian H. Witten, Alistair Moffat, Timothy C. Bell
Last edited: November 26, 1999

Mandatory disclaimer: This page, its content and style, are the responsibility of the author and do not necessarily represent the views, policies, or opinions of The University of Melbourne.