The second edition was published in 1999 by Morgan Kaufmann Publishing, San Francisco, 1999. Details of the second edition are at http://people.eng.unimelb.edu.au/ammoffat/mg/.
This page contains reviews of the first edition by:
Review
Information theory is often confused with the management of information systems, even though there appears to be little in common between the two fields. One deals with the fundamental limits in the representation and communication of information, and the other deals with the efficient organization and searching of data. It is therefore very interesting to find an excellent book that lies on the interface between the two areas. As the title indicates, this book deals the problem of organizing, compressing, indexing and querying a large database of text and images. The book is written in a clear and witty style, and is a very readable and comprehensive introduction to the field.
Many of the ideas in the book have been implemented in a public domain document management system called mg, which is described in an appendix and is available on the Internet. The authors have tackled a number of practical issues in building such a system, and the book reflects their work. This is very much an ``engineering'' book; there are no theorems or proofs, and the mathematics is limited to simple back-of-the-envelope calculations analyzing the efficiency of some of the algorithms and heuristics proposed here. The authors describe some of the basic methods of compression and indexing, and provide practical examples of their use in real systems. Many of the methods are illustrated with numbers from sample databases, such as the TREC database, a large (2 GB) database with three quarter of a million documents and half a million terms in the index.
The book is essential reading for anyone interested in building or maintaining a large database for text or images. It is also a good introduction to some practical applications of information theory. During a course on information theory, I used material from the book to illustrate how ideas like Huffman coding or arithmetic coding are actually implemented in standard compression algorithms. As good applications inspire good theory, the book could also provide a source of interesting research problems, e.g., formalizing or improving some of the heuristics using information theoretic ideas.
The first chapter introduces the problem of indexing and compressing text. Examples of manual construction of concordances or indexes are use to motivate the need for efficient computer algorithms to do the same. While single books or a few megabytes of documents could be searched using simple intuitive algorithms, these methods would be infeasible for the large databases envisioned in this book. The current explosion in the availability of information necessitates efficient techniques for storing and searching this information. As an example, the authors discuss the notion of a ``memex'', a kind of generalized personal digital assistant envisaged by Vannevar Bush in 1945, which would store all the books, records and communications of an individual. Such a device is quite conceivable with current technology, but its usefulness would depend on efficient storage and search mechanisms, which are the main subject of this book.
The second chapter introduces the need for compression with Parkinson's law for space---despite increases in computer storage capacity or bandwidth, there is always more data to fill all the space available. The chapter is a brief survey of the fundamental techniques for lossless source coding, including Huffman coding, arithmetic coding and Lempel-Ziv coding, and modeling methods such as Dynamic Markov Chains(DMC) and Prediction by Partial Match (PPM). The presentation is necessarily brief, and much of the material is covered in greater detail in . Some comparisons of compression performance and speed are given for files in the Calgary Corpus.
The next three chapters deal with indexing. The first of these introduces the basic form of index, i.e., the inverted file, which is essentially a table, which for every word or term in the lexicon lists the document numbers in which the word appears. The authors discuss the problems such as correcting for upper-case and lower-case letters and the problem of stemming, which reduces similar words to a common root, e.g., reduce ``compressed'' and ``compression'' to the root ``compress''. They also consider other forms of indexing, such as signature files and bitmaps. The main focus of this chapter is on efficient storage of the index, and for this purpose there is a discussion of coding schemes for the integers such as the Golomb code. The index compression methods allow the index for the TREC database to be a little more then 4 of the size of the data it indexes (and yet index every word in the collection). Thus, when combined with compression of the original data, the combined database is only a fraction of the size of the original data, and yet can be queried efficiently without scanning the entire database.
The fourth chapter deals with querying. The initial portion deals with Boolean queries, which look for documents that match all the search terms. The later part of this chapter deals with ranked queries, where the objective is to find documents that match as large a fraction of the search terms as possible. In this process, it is advisable to give higher weights to infrequent query terms, and various heuristics to find a ranked set of matching documents are described.
The next chapter discusses the construction of indexes. Conceptually, the problem is very simple, since all one has to do is to count the occurrences of each term in each document. However, practical implementation for large databases requires careful algorithm design to avoid using too much memory. For example, for the TREC database, the sophisticated approach using a modified mergesort algorithm reduces the time needed to construct the index from more than a hundred years to about 5 hours on a typical workstation.
The sixth chapter is a survey of techniques for image compression, with the initial part focused on bilevel images and the the latter part on gray scale images. It includes a very good description of the Group 3/4 FAX compression standards and the JBIG standard, which provide interesting examples of the application of Huffman coding and arithmetic coding respectively. The authors also discuss resolution reduction and progressive transmission for images. The latter half of the chapter deals with the JPEG standard and the FELICS (Fast Efficient Lossless Image Coding System) algorithm. The chapter concentrates on algorithms that are already standards, and does not discuss current research in wavelet or fractal image compression. There is also no discussion of other kinds of data that might be found in a multimedia database, e.g., audio or video data.
Since most document images are images of text, the next chapter is devoted to the compression of textual images. The authors describe in detail a method involving the extraction of marks (contiguous regions of black pixels, which are presumed to be characters of the text), forming a library of marks, and representing new marks by finding the closest match in the library. If a good match is not found for the mark, it is assumed to be a new character and added to the library. Thus the page is represented by a sequence of pointers into the library and the corresponding horizontal and vertical offsets of the location of the mark on the page. Although this process is similar to Optical Character Recognition (OCR), the objective is to compress the text, not to decode the letters, and it avoids some of the imperfections of current OCR technology. The algorithm aptly illustrates the interplay between good models, good compression and good recognition. While the basic scheme is lossy (since the marks on the page would not match the marks in the library exactly), it is possible to make it lossless by also storing the residue image (the difference between the original and the reconstructed image). Even though the residue image has much fewer black pixels, it is less compressible then the original image, and applying a standard compression algorithm on the residue results in no savings relative to the original image. This is because the process of extraction of marks has extracted most of the structure out the original image, leaving mostly noise. However, if the residue is compressed relative to the reconstructed image (which is already available to the decoder) using arithmetic coding, it is possible to compress the residue efficiently. This combination of a lossy and a lossless scheme provides a natural two level coding scheme, where for most applications the lossy reconstruction would suffice, but the original image is available for archival purposes.
Chapter 8 deals with documents images that combine text with graphics or half tone images. Using different compression schemes for different parts of the page would be more effective than using a common scheme, so the authors discuss algorithms to segment and classify different regions of the page. They introduce methods such as the Hough transform, which allows one to find collinear marks on the page, and the docstrum or document spectrum, which provides the two-dimensional distribution of distances of neighboring marks. These techniques are combined to orient the page and to segment it into text and image regions.
Chapter 9 considers implementation issues, including the choices between the different text compression, indexing and image compression algorithms described in the book. A canonical Huffman coding algorithm is described which allows for fast decompression. Memory and time requirements for example databases are also given.
The last chapter provides a philosophical overview of the entire field of managing large quantities of information. The current growth of information on the Internet reminds one of the notion of a world brain, suggested by H.G.Wells more than 50 years ago as a means to provide universal access to all human knowledge. Although there are problems such as quality control and copyright for information on the Internet, one big issue is access---how does one find the information one needs? Many tools such as Archie, Gopher and the World Wide Web have made access easier, but the authors suggest that the techniques described in the book would allow one to automatically index and compress textual information in less space than the uncompressed data, and allow easy access and searching. The techniques could ultimately be extended to images as well. Although, as they admit, the techniques are still in their infancy, we can already envision a day when the ``memex'' of Vannevar Bush becomes a reality and we will have instant and comprehensive access to all the information we need!
It is not often that one comes across a book that succeeds so well in introducing a timely subject to a broad audience. Managing Gigabytes is an essential reference for anyone working with large text or image databases, CD-ROM's, digital libraries, etc. But it is also an excellent introduction for anyone who has ever grappled with the information explosion and wondered about automated means to tackle it.
References
T.C. Bell, J.G. Cleary, and I.H. Witten. Text Compression. Prentice Hall, Englewood Cliffs, N.J., 1990.
P.G. Howard and J.S. Vitter.
Fast and efficient lossless image compression.
In J.Storer and M.Cohn, editors, Proceedings of the IEEE Data
Compression Conference, pages 351--360. IEEE Computer Society Press, 1993.
Review
One seldom encounters a book that can be recommended without hesitation, but
this work belongs easily within such a category.
The three authors are all major contributors in the field of textual storage,
compression, and retrieval systems, and their collaboration has resulted in an
excellent and extremely well-structured book, capable of satisfying the needs of
a wide variety of readers.
At one level, the first and last chapters provide an easily readable overview
and historical perspective of the current information problem in its broadest
sense, its past development, present magnitude, and likely future growth. Some
comparatively recent advances in computer and telecommunications technologies
have had a twofold effect. They have made access to specific information much
easier, but have at the same time stimulated even more dramatic growth in the
total volume of information available. The central problem, then, becomes not
one of simple storage and access to the data, but the determination of what is
useful or relevant to the purpose at hand, and where it is located.
The authors argue convincingly that indexing methods based on the use of full
text retrieval, in conjunction with automatically generated concordances used as
part of an overall compression scheme, offer one promising approach to the
problem of storing and indexing such large amounts of information.
Having outlined their basic premise, the other chapters then discuss in depth
the various components of such a tretieval system. There are chapters on text
compression intself, on indexing, on querying, and on index construction. It is
expected that large document databases in general may contain more than coded
text, so subsequent chapters cover image compression, textual images (such as
scanned original documents), mixed text and images, and implementation issues
for such systems.
The chapters are extremely well structured, begining with an excellent first
section to overview the content. This is followed by several subsections, to
break out the topic and add more detailed discussion. Within a subsection, there
is a further division achieved, by marking some of the text with a light grey
bar in the margin. This material is intended for use by the more technical or
more theoretically inclined reader, and can be skipped if necessary.
The end result is a book that can be used at several different levels. As an
overview of significant problems facing our society, it is excellent yet it also
describes in very readable detail how one can implement a system capable of
solving a large part of that problem, giving well-reasoned arguments that
justify the various individual elements of the system. Finally, the book can be
used as a comprehensive reference, since it goes into sufficient depth for this
purpose in almost every topic covered.
Each chapter ends with a section on further reading, which is in most cases a
very useful literature source, full details of the referenced materials being
gathered together in a special section at the end of the book.
An appendix discusses the mg system, an experimental, public domain, full-text
retrieval system that runs under the UNIX** operating system and demonstrates
some of the key ideas described in the book. It is written in ANSI C and can be
obtained via anonymous ftp following the directions given.
Overall, this book is a pleasure to read, covers some very fundamental issues in
a manner well suited for the nonspecialist, yet has sufficient detail to serve
as a very convenient and compact reference for those working with information
retrieval, compression, or related topics.
** Trademark or registered trademark of X/Open Co. Ltd
Review
An old saying claims that no matter how big the hard disk, it will fill to
capacity and you will need more space. Never has this been more true than in
this information age. The average person is overrun with information from
books, magazines, faxes, newspapers, and electronic sources. Just keeping up
with your professional reading is difficult, let alone expanding into other
technical areas. Not only is it a struggle to keep up but, inevitably, when you
try to locate something it may take hours to find. Don't you wish there was a
program that would find that article you read about yellow widgets or would
scan through that current magazine and locate the articles you would be
interested in and save you time? This book is for those people that will
eventually write those programs - the system designer or experienced programmer
that needs to implement a full text system. I will warn you that a large part
of this book is not for the novice; however, some areas are good overviews for
managerial personnel. I found the book refreshing in that the contents went far
beyond the basic algorithms but kept a strong eye on real world restrictions.
The book is about 450 pages and consists of ten chapters and an appendix. Each
chapter deals with a different component of a full text retrieval system and is
fairly independent from the others. The book has a logical succession of
topics, starting with an overview. Following this are chapters that cover text
compression, indexing and the use of several querying techniques. The next
three chapters cover image compression, textual images and mixed text and
graphic images. The book then winds down by covering real world implementation
considerations and finally gives an overview of the information explosion
itself. The appendix is a user guide to mg (Managing Gigabytes), the authors'
public domain full text system.
I found a few chapters stood out, chapter three, which covers various
techniques of indexing and index compression, never lost sight of the need to
keep the indexing road map small and time-saving. Chapter four, which covers
various methods of querying, was clear, concise, and enlightening. I actually
lost track of time while I was reading it (much to the chagrin of my wife as I
got to sleep at 1 A.M.). Also, of particular merit was chapter seven that
covered textual images and gave information on how an OCR system identifies
individual letters and on different ways of compressing text images.
These authors built a full text system called mg. Although developing
software for researching a book is not new, what is unique is that the authors
have made their system public domain and available on Internet with its full
source. This gives the reader of their book a useful working platform for
testing the concepts expounded in the book. The mg system is also a good
start for developing a professional full text system. Although the authors
developed mg for a UNIX platform, they were careful to develop it in ANSI 'C'
- that means that the code should port to other platforms easily. At the time I
wrote this article I did not have time to work with mg.
I liked the book, both personally and professionally. It has offered the most
practical information on compressing and indexing documents of any book I have
read. Although it may be a bit much for a novice without determination, I found
the book's depth and practicality refreshing. I have several books on data
compression and indexing in my personal library, and this is by far the best
one of the group. The authors are attentive in giving reference to other books
and papers at the end of each chapter for those who wish to do further reading.
I would recommend this book to anyone who is serious about developing a system
that requires retrieval and/or searching of text. I will be recommending
several copies for my company and requiring my top level programmers to read
it.
Mike Fleischmann is director of research and development for Optimus
Corporation in Fort Collins, CO. Optimus specializes in helping large mainframe
users put their databases on CD-ROM.
Review:
From: tgl@netcom.com (Tom Lane)
This book covers the methods needed to store large document databases and
to find things in those databases. The focus is on compressing large text
collections (up to multiple gigabytes, ie, thousands of megabytes) and on
indexing such collections so that relevant material can be retrieved.
A secondary focus is on compressing scanned images of text documents.
Compression of more general image classes is discussed, but not as
thoroughly.
The book seems to be intended as either an introductory textbook, suitable
for undergrad students, or as a design/implementation guide for seasoned
programmers. The authors largely succeed in this dual goal: the book is
clear and well written throughout, and yet it manages to present quite
detailed information. There are some places where a would-be implementor
would want more information, but the extensive references should serve
that need. The worst compromise the authors had to make is in the
introductory chapters on text and image compression; these are probably
too brief for students with no prior exposure to data compression, while
compression aficionados will find nothing new in them. Students might
want to look at a general data compression textbook (such as the well
regarded Text Compression, which two of these three authors contributed
to) if they get lost in the early going.
The book's greatest strength is its clear, complete presentation of
practical methods for constructing and using indexes for large text
collections. The authors make a persuasive case for full text indexing,
in which every word of every document is indexed and can be used for
retrieval. They show how to use the index for both traditional keyword
search with Boolean operators (for example
"data AND (compression OR retrieval)")
)and more general similarity matching (Internet users are
likely to be familiar with this technique from its use in WAIS).
Remarkably, both the text and the full index can be compressed into less
than half the space that would be needed for uncompressed text alone.
Querying and retrieval are said to be practical on PC-class machines,
although initial construction of the index for a multi-gigabyte database
requires a small workstation or large patience.
A second major area is specialized techniques for compression of textua
images: bilevel (black and white) images in which much of the image is
printed characters. This is a large and growing need. The authors show
how the textual nature of an image can be exploited to give very good
lossless or lossy compression, without depending on unreliable OCR
technology, and without assuming that everything on the page is text.
This is accomplished by segmenting the image and then matching individual
marks (connected black areas) against a library of mark patterns. When
fully lossless storage is required, the small discrepancies between a mark
and its representative symbol can be coded in a separate pass.
As proof of concept, the authors have written software that implements the
compression and indexing methods recommended in the book. They have made
this software freely available --- Internet users can retrieve it via
ftp from munnari.oz.au,
file /pub/mg/mg.tar.Z (also see the
other files in that directory). Users of mg might find
Managing Gigabytes
most useful as a guide to the capabilities and design rationale
of that software.
In general, I liked the book a lot. It covers its intended topic area in
a clear, easy-to-read fashion. To judge from the references, not much of
this material will be new to real experts in the field, but it is pulled
together and presented well. My only serious complaint is that the back
cover blurb promises somewhat more than the book delivers: the blurb
mentions JBIG and
JPEG, which are covered only cursorily, and OCR which is
not really discussed at all.
I would have liked to see more about OCR (optical character recognition,
ie, converting a scanned text image into actual text). For one thing,
OCR is an essential step if textual image databases are to be
automatically indexed using the methods of the first part of the book.
For another, the textual image compression process looks like it could
share processing with OCR. But the authors had to stop somewhere, and
OCR methods could easily fill another book.
If you have any interest in the design of full-text databases,
I think you will find Managing Gigabytes well worth your time.
Table of Contents:
Mandatory disclaimer:
This page, its content and style, are the responsibility of the
author and do not necessarily
represent the views, policies, or opinions of The University of Melbourne.
David J. Craft
IBM Development Laboratory, Austin, Texas
Mike Fleischmann
Optimus Corporation, Fort Collins, CO.
Published in Inform
(published by the Association for Information and Image Management)
vol 8, no 4, pp 12-14, April 1994.
Tom Lane
Independent JPEG Group.
Newsgroups: comp.compression.research
Review: "Managing Gigabytes - Compressing and Indexing Documents and Images"
Date: 27 Apr 1994 05:06:17 GMT
Chapter titles and brief summaries of contents:
There is an extensive bibliography, and each chapter closes with brief
descriptions of the references that are relevant to that chapter.
Dr. Thomas G. Lane
organizer, Independent JPEG Group
Disclaimer: I received a free review copy courtesy of the series editor.
I have no connection with the authors or publisher.
Alistair Moffat
alistair/ unimelb.edu (then .au)
28 April 1999.