SIGIR'98 posters: Automatically Locating, Extracting and Analyzing Tabular Data

Automatically Locating, Extracting and Analyzing Tabular Data


William Kornfeld
3752 Red Oak Way Redwood City, California 94061 U.S.A.

John Wattecamps
Gaithersburg, Maryland U.S.A.


Abstract

Information retrieval of ASCII documents generally refers to retrieval based on linear patterns (1-dimensional array of symbols) found in the source documents. This paper describes a 2-dimensional IR application, that of recognizing and extracting tabular data, in this case financial tables. Tables are located and extracted using a version of the LR(k) parsing algorithm adapted for this purpose. Because of sloppiness in the construction of tables, somewhat less than 100% of the tables can be retrieved completely automatically; a method has been found to integrate the parsing algorithm into a user interface analogous to a program language debugger that allows operators to quickly correct defects in the source document allowing the parsing to complete successfully. This paper describes a deployed application currently in commercial use.


SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.