SIGIR'98 posters: Automatically Locating, Extracting and Analyzing Tabular
Data
Automatically Locating, Extracting and Analyzing Tabular Data
William Kornfeld
3752 Red Oak Way
Redwood City, California 94061
U.S.A.
John Wattecamps
Gaithersburg, Maryland
U.S.A.
Abstract
Information retrieval of ASCII documents generally refers to retrieval
based on linear patterns (1-dimensional array of symbols) found in the
source documents. This paper describes a 2-dimensional IR application,
that of recognizing and extracting tabular data, in this case financial
tables. Tables are located and extracted using a version of the LR(k)
parsing algorithm adapted for this purpose. Because of sloppiness in the
construction of tables, somewhat less than 100% of the tables can be
retrieved completely automatically; a method has been found to integrate
the parsing algorithm into a user interface analogous to a program language
debugger that allows operators to quickly correct defects in the source
document allowing the parsing to complete successfully. This paper
describes a deployed application currently in commercial use.
SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.