Managing Gigabytes

Reviews of the First Edition

The first edition of Managing Gigabytes: Compressing and Indexing Documents and Images, by Ian H. Witten, Alistair Moffat, and Timothy C. Bell, was published in 1994 by Van Nostrand Reinhold, New York.

The second edition was published in 1999 by Morgan Kaufmann Publishing, San Francisco, 1999. Details of the second edition are at http://people.eng.unimelb.edu.au/ammoffat/mg/.

This page contains reviews of the first edition by:

Joy Thomas (IBM, Watson)
David J. Craft (IBM Development Laboratory)
Mike Fleischmann (Optimus Corporation)
Tom Lane (organizer, Independent JPEG Group)

Joy A. Thomas

IBM T.J. Watson Research Center, Yorktown Heights, NY.
Published in IEEE Transactions on Information Theory, vol 41, no 6, pp 2101-2102, November 1995.

Review

Information theory is often confused with the management of information systems, even though there appears to be little in common between the two fields. One deals with the fundamental limits in the representation and communication of information, and the other deals with the efficient organization and searching of data. It is therefore very interesting to find an excellent book that lies on the interface between the two areas. As the title indicates, this book deals the problem of organizing, compressing, indexing and querying a large database of text and images. The book is written in a clear and witty style, and is a very readable and comprehensive introduction to the field.

Many of the ideas in the book have been implemented in a public domain document management system called mg, which is described in an appendix and is available on the Internet. The authors have tackled a number of practical issues in building such a system, and the book reflects their work. This is very much an ``engineering'' book; there are no theorems or proofs, and the mathematics is limited to simple back-of-the-envelope calculations analyzing the efficiency of some of the algorithms and heuristics proposed here. The authors describe some of the basic methods of compression and indexing, and provide practical examples of their use in real systems. Many of the methods are illustrated with numbers from sample databases, such as the TREC database, a large (2 GB) database with three quarter of a million documents and half a million terms in the index.

The book is essential reading for anyone interested in building or maintaining a large database for text or images. It is also a good introduction to some practical applications of information theory. During a course on information theory, I used material from the book to illustrate how ideas like Huffman coding or arithmetic coding are actually implemented in standard compression algorithms. As good applications inspire good theory, the book could also provide a source of interesting research problems, e.g., formalizing or improving some of the heuristics using information theoretic ideas.

The first chapter introduces the problem of indexing and compressing text. Examples of manual construction of concordances or indexes are use to motivate the need for efficient computer algorithms to do the same. While single books or a few megabytes of documents could be searched using simple intuitive algorithms, these methods would be infeasible for the large databases envisioned in this book. The current explosion in the availability of information necessitates efficient techniques for storing and searching this information. As an example, the authors discuss the notion of a ``memex'', a kind of generalized personal digital assistant envisaged by Vannevar Bush in 1945, which would store all the books, records and communications of an individual. Such a device is quite conceivable with current technology, but its usefulness would depend on efficient storage and search mechanisms, which are the main subject of this book.

The second chapter introduces the need for compression with Parkinson's law for space---despite increases in computer storage capacity or bandwidth, there is always more data to fill all the space available. The chapter is a brief survey of the fundamental techniques for lossless source coding, including Huffman coding, arithmetic coding and Lempel-Ziv coding, and modeling methods such as Dynamic Markov Chains(DMC) and Prediction by Partial Match (PPM). The presentation is necessarily brief, and much of the material is covered in greater detail in . Some comparisons of compression performance and speed are given for files in the Calgary Corpus.

The next three chapters deal with indexing. The first of these introduces the basic form of index, i.e., the inverted file, which is essentially a table, which for every word or term in the lexicon lists the document numbers in which the word appears. The authors discuss the problems such as correcting for upper-case and lower-case letters and the problem of stemming, which reduces similar words to a common root, e.g., reduce ``compressed'' and ``compression'' to the root ``compress''. They also consider other forms of indexing, such as signature files and bitmaps. The main focus of this chapter is on efficient storage of the index, and for this purpose there is a discussion of coding schemes for the integers such as the Golomb code. The index compression methods allow the index for the TREC database to be a little more then 4 of the size of the data it indexes (and yet index every word in the collection). Thus, when combined with compression of the original data, the combined database is only a fraction of the size of the original data, and yet can be queried efficiently without scanning the entire database.

The fourth chapter deals with querying. The initial portion deals with Boolean queries, which look for documents that match all the search terms. The later part of this chapter deals with ranked queries, where the objective is to find documents that match as large a fraction of the search terms as possible. In this process, it is advisable to give higher weights to infrequent query terms, and various heuristics to find a ranked set of matching documents are described.

The next chapter discusses the construction of indexes. Conceptually, the problem is very simple, since all one has to do is to count the occurrences of each term in each document. However, practical implementation for large databases requires careful algorithm design to avoid using too much memory. For example, for the TREC database, the sophisticated approach using a modified mergesort algorithm reduces the time needed to construct the index from more than a hundred years to about 5 hours on a typical workstation.

The sixth chapter is a survey of techniques for image compression, with the initial part focused on bilevel images and the the latter part on gray scale images. It includes a very good description of the Group 3/4 FAX compression standards and the JBIG standard, which provide interesting examples of the application of Huffman coding and arithmetic coding respectively. The authors also discuss resolution reduction and progressive transmission for images. The latter half of the chapter deals with the JPEG standard and the FELICS (Fast Efficient Lossless Image Coding System) algorithm. The chapter concentrates on algorithms that are already standards, and does not discuss current research in wavelet or fractal image compression. There is also no discussion of other kinds of data that might be found in a multimedia database, e.g., audio or video data.

Since most document images are images of text, the next chapter is devoted to the compression of textual images. The authors describe in detail a method involving the extraction of marks (contiguous regions of black pixels, which are presumed to be characters of the text), forming a library of marks, and representing new marks by finding the closest match in the library. If a good match is not found for the mark, it is assumed to be a new character and added to the library. Thus the page is represented by a sequence of pointers into the library and the corresponding horizontal and vertical offsets of the location of the mark on the page. Although this process is similar to Optical Character Recognition (OCR), the objective is to compress the text, not to decode the letters, and it avoids some of the imperfections of current OCR technology. The algorithm aptly illustrates the interplay between good models, good compression and good recognition. While the basic scheme is lossy (since the marks on the page would not match the marks in the library exactly), it is possible to make it lossless by also storing the residue image (the difference between the original and the reconstructed image). Even though the residue image has much fewer black pixels, it is less compressible then the original image, and applying a standard compression algorithm on the residue results in no savings relative to the original image. This is because the process of extraction of marks has extracted most of the structure out the original image, leaving mostly noise. However, if the residue is compressed relative to the reconstructed image (which is already available to the decoder) using arithmetic coding, it is possible to compress the residue efficiently. This combination of a lossy and a lossless scheme provides a natural two level coding scheme, where for most applications the lossy reconstruction would suffice, but the original image is available for archival purposes.

Chapter 8 deals with documents images that combine text with graphics or half tone images. Using different compression schemes for different parts of the page would be more effective than using a common scheme, so the authors discuss algorithms to segment and classify different regions of the page. They introduce methods such as the Hough transform, which allows one to find collinear marks on the page, and the docstrum or document spectrum, which provides the two-dimensional distribution of distances of neighboring marks. These techniques are combined to orient the page and to segment it into text and image regions.

Chapter 9 considers implementation issues, including the choices between the different text compression, indexing and image compression algorithms described in the book. A canonical Huffman coding algorithm is described which allows for fast decompression. Memory and time requirements for example databases are also given.

The last chapter provides a philosophical overview of the entire field of managing large quantities of information. The current growth of information on the Internet reminds one of the notion of a world brain, suggested by H.G.Wells more than 50 years ago as a means to provide universal access to all human knowledge. Although there are problems such as quality control and copyright for information on the Internet, one big issue is access---how does one find the information one needs? Many tools such as Archie, Gopher and the World Wide Web have made access easier, but the authors suggest that the techniques described in the book would allow one to automatically index and compress textual information in less space than the uncompressed data, and allow easy access and searching. The techniques could ultimately be extended to images as well. Although, as they admit, the techniques are still in their infancy, we can already envision a day when the ``memex'' of Vannevar Bush becomes a reality and we will have instant and comprehensive access to all the information we need!

It is not often that one comes across a book that succeeds so well in introducing a timely subject to a broad audience. Managing Gigabytes is an essential reference for anyone working with large text or image databases, CD-ROM's, digital libraries, etc. But it is also an excellent introduction for anyone who has ever grappled with the information explosion and wondered about automated means to tackle it.

References

T.C. Bell, J.G. Cleary, and I.H. Witten. Text Compression. Prentice Hall, Englewood Cliffs, N.J., 1990.

P.G. Howard and J.S. Vitter. Fast and efficient lossless image compression. In J.Storer and M.Cohn, editors, Proceedings of the IEEE Data Compression Conference, pages 351--360. IEEE Computer Society Press, 1993.

David J. Craft

IBM Development Laboratory, Austin, Texas

Review

One seldom encounters a book that can be recommended without hesitation, but this work belongs easily within such a category.

The three authors are all major contributors in the field of textual storage, compression, and retrieval systems, and their collaboration has resulted in an excellent and extremely well-structured book, capable of satisfying the needs of a wide variety of readers.

At one level, the first and last chapters provide an easily readable overview and historical perspective of the current information problem in its broadest sense, its past development, present magnitude, and likely future growth. Some comparatively recent advances in computer and telecommunications technologies have had a twofold effect. They have made access to specific information much easier, but have at the same time stimulated even more dramatic growth in the total volume of information available. The central problem, then, becomes not one of simple storage and access to the data, but the determination of what is useful or relevant to the purpose at hand, and where it is located.

The authors argue convincingly that indexing methods based on the use of full text retrieval, in conjunction with automatically generated concordances used as part of an overall compression scheme, offer one promising approach to the problem of storing and indexing such large amounts of information.

Having outlined their basic premise, the other chapters then discuss in depth the various components of such a tretieval system. There are chapters on text compression intself, on indexing, on querying, and on index construction. It is expected that large document databases in general may contain more than coded text, so subsequent chapters cover image compression, textual images (such as scanned original documents), mixed text and images, and implementation issues for such systems.

The chapters are extremely well structured, begining with an excellent first section to overview the content. This is followed by several subsections, to break out the topic and add more detailed discussion. Within a subsection, there is a further division achieved, by marking some of the text with a light grey bar in the margin. This material is intended for use by the more technical or more theoretically inclined reader, and can be skipped if necessary.

The end result is a book that can be used at several different levels. As an overview of significant problems facing our society, it is excellent yet it also describes in very readable detail how one can implement a system capable of solving a large part of that problem, giving well-reasoned arguments that justify the various individual elements of the system. Finally, the book can be used as a comprehensive reference, since it goes into sufficient depth for this purpose in almost every topic covered.

Each chapter ends with a section on further reading, which is in most cases a very useful literature source, full details of the referenced materials being gathered together in a special section at the end of the book.

An appendix discusses the mg system, an experimental, public domain, full-text retrieval system that runs under the UNIX** operating system and demonstrates some of the key ideas described in the book. It is written in ANSI C and can be obtained via anonymous ftp following the directions given.

Overall, this book is a pleasure to read, covers some very fundamental issues in a manner well suited for the nonspecialist, yet has sufficient detail to serve as a very convenient and compact reference for those working with information retrieval, compression, or related topics.

** Trademark or registered trademark of X/Open Co. Ltd

Mike Fleischmann

Optimus Corporation, Fort Collins, CO.
Published in Inform (published by the Association for Information and Image Management) vol 8, no 4, pp 12-14, April 1994.

Review

An old saying claims that no matter how big the hard disk, it will fill to capacity and you will need more space. Never has this been more true than in this information age. The average person is overrun with information from books, magazines, faxes, newspapers, and electronic sources. Just keeping up with your professional reading is difficult, let alone expanding into other technical areas. Not only is it a struggle to keep up but, inevitably, when you try to locate something it may take hours to find. Don't you wish there was a program that would find that article you read about yellow widgets or would scan through that current magazine and locate the articles you would be interested in and save you time? This book is for those people that will eventually write those programs - the system designer or experienced programmer that needs to implement a full text system. I will warn you that a large part of this book is not for the novice; however, some areas are good overviews for managerial personnel. I found the book refreshing in that the contents went far beyond the basic algorithms but kept a strong eye on real world restrictions.

The book is about 450 pages and consists of ten chapters and an appendix. Each chapter deals with a different component of a full text retrieval system and is fairly independent from the others. The book has a logical succession of topics, starting with an overview. Following this are chapters that cover text compression, indexing and the use of several querying techniques. The next three chapters cover image compression, textual images and mixed text and graphic images. The book then winds down by covering real world implementation considerations and finally gives an overview of the information explosion itself. The appendix is a user guide to mg (Managing Gigabytes), the authors' public domain full text system.

I found a few chapters stood out, chapter three, which covers various techniques of indexing and index compression, never lost sight of the need to keep the indexing road map small and time-saving. Chapter four, which covers various methods of querying, was clear, concise, and enlightening. I actually lost track of time while I was reading it (much to the chagrin of my wife as I got to sleep at 1 A.M.). Also, of particular merit was chapter seven that covered textual images and gave information on how an OCR system identifies individual letters and on different ways of compressing text images.

These authors built a full text system called mg. Although developing software for researching a book is not new, what is unique is that the authors have made their system public domain and available on Internet with its full source. This gives the reader of their book a useful working platform for testing the concepts expounded in the book. The mg system is also a good start for developing a professional full text system. Although the authors developed mg for a UNIX platform, they were careful to develop it in ANSI 'C' - that means that the code should port to other platforms easily. At the time I wrote this article I did not have time to work with mg.

I liked the book, both personally and professionally. It has offered the most practical information on compressing and indexing documents of any book I have read. Although it may be a bit much for a novice without determination, I found the book's depth and practicality refreshing. I have several books on data compression and indexing in my personal library, and this is by far the best one of the group. The authors are attentive in giving reference to other books and papers at the end of each chapter for those who wish to do further reading. I would recommend this book to anyone who is serious about developing a system that requires retrieval and/or searching of text. I will be recommending several copies for my company and requiring my top level programmers to read it.

Mike Fleischmann is director of research and development for Optimus Corporation in Fort Collins, CO. Optimus specializes in helping large mainframe users put their databases on CD-ROM.

Tom Lane

Independent JPEG Group.

Review:

From: tgl@netcom.com (Tom Lane)
Newsgroups: comp.compression.research
Review: "Managing Gigabytes - Compressing and Indexing Documents and Images"
Date: 27 Apr 1994 05:06:17 GMT

This book covers the methods needed to store large document databases and to find things in those databases. The focus is on compressing large text collections (up to multiple gigabytes, ie, thousands of megabytes) and on indexing such collections so that relevant material can be retrieved. A secondary focus is on compressing scanned images of text documents. Compression of more general image classes is discussed, but not as thoroughly.

The book seems to be intended as either an introductory textbook, suitable for undergrad students, or as a design/implementation guide for seasoned programmers. The authors largely succeed in this dual goal: the book is clear and well written throughout, and yet it manages to present quite detailed information. There are some places where a would-be implementor would want more information, but the extensive references should serve that need. The worst compromise the authors had to make is in the introductory chapters on text and image compression; these are probably too brief for students with no prior exposure to data compression, while compression aficionados will find nothing new in them. Students might want to look at a general data compression textbook (such as the well regarded Text Compression, which two of these three authors contributed to) if they get lost in the early going.

The book's greatest strength is its clear, complete presentation of practical methods for constructing and using indexes for large text collections. The authors make a persuasive case for full text indexing, in which every word of every document is indexed and can be used for retrieval. They show how to use the index for both traditional keyword search with Boolean operators (for example "data AND (compression OR retrieval)") )and more general similarity matching (Internet users are likely to be familiar with this technique from its use in WAIS). Remarkably, both the text and the full index can be compressed into less than half the space that would be needed for uncompressed text alone. Querying and retrieval are said to be practical on PC-class machines, although initial construction of the index for a multi-gigabyte database requires a small workstation or large patience.

A second major area is specialized techniques for compression of textua images: bilevel (black and white) images in which much of the image is printed characters. This is a large and growing need. The authors show how the textual nature of an image can be exploited to give very good lossless or lossy compression, without depending on unreliable OCR technology, and without assuming that everything on the page is text. This is accomplished by segmenting the image and then matching individual marks (connected black areas) against a library of mark patterns. When fully lossless storage is required, the small discrepancies between a mark and its representative symbol can be coded in a separate pass.

As proof of concept, the authors have written software that implements the compression and indexing methods recommended in the book. They have made this software freely available --- Internet users can retrieve it via ftp from munnari.oz.au, file /pub/mg/mg.tar.Z (also see the other files in that directory). Users of mg might find Managing Gigabytes most useful as a guide to the capabilities and design rationale of that software.

In general, I liked the book a lot. It covers its intended topic area in a clear, easy-to-read fashion. To judge from the references, not much of this material will be new to real experts in the field, but it is pulled together and presented well. My only serious complaint is that the back cover blurb promises somewhat more than the book delivers: the blurb mentions JBIG and JPEG, which are covered only cursorily, and OCR which is not really discussed at all.

I would have liked to see more about OCR (optical character recognition, ie, converting a scanned text image into actual text). For one thing, OCR is an essential step if textual image databases are to be automatically indexed using the methods of the first part of the book. For another, the textual image compression process looks like it could share processing with OCR. But the authors had to stop somewhere, and OCR methods could easily fill another book.

If you have any interest in the design of full-text databases, I think you will find Managing Gigabytes well worth your time.

Table of Contents:

Chapter titles and brief summaries of contents:

1. Overview: The usual preliminary stuff.
2. Text Compression: Brief review of standard text compression techniques, including Huffman and arithmetic coding, LZ-family methods, etc. Reasonably complete for the methods advocated, but it helps if you've seen this material before.
3. Indexing: Now we get to the meat of the book. This chapter discusses what a full-text index should contain and how to compress it into a reasonable amount of space.
4. Querying: Efficient methods for processing keyword queries, Boolean combinations, and ranking queries (WAIS-style similarity tests), given the existence of an index file.
5. Index Construction: How to construct a full-text index for a large document collection in a reasonable amount of time.
6. Image Compression: Very brief review of standard image compression methods, including fax (CCITT Group 3 and Group 4), JBIG, and JPEG standards. Unlike the rest of the book, this chapter does not really provide enough info for a would-be implementor.
7. Textual Images: Image compression methods designed for bilevel images that contain mostly text, as would be found in many scanned documents.
8. Mixed Text and Images: Segmenting a mixed image so that subareas can be compressed with appropriate methods.
9. Implementation: Design rationale for the mg software, with performance figures for various alternatives.
10. The Information Explosion: A bit of history and speculation about the future.
A. Guide to the mg System: Obtaining, installing, and using the free "mg" software.

There is an extensive bibliography, and each chapter closes with brief descriptions of the references that are relevant to that chapter.

Dr. Thomas G. Lane
organizer, Independent JPEG Group

Disclaimer: I received a free review copy courtesy of the series editor. I have no connection with the authors or publisher.

Alistair Moffat
alistair/ unimelb.edu (then .au)
28 April 1999.

Mandatory disclaimer: This page, its content and style, are the responsibility of the author and do not necessarily represent the views, policies, or opinions of The University of Melbourne.