SIGIR'98 papers: Discovering Typical Structures of Documents: A Road Map Approach

Discovering Typical Structures of Documents: A Road Map Approach


Ke Wang
Department of Information Systems and Computer Science, National University of Singapore, Singapore, 119260

Huiqing Liu
BioInformatics Center, National University of Singapore, Singpaore, 119260


Abstract

The structure of a document refers to the role and hierarchy of subdocument references. Many on-line documents are similarly structured, though not identically structured. We study the problem of discovering "typical" structures of a collection of such documents, where the user specifies the minimum frequency of a typical structure. We will consider structural features of subdocument references such as labeling, nesting, ordering, cyclicity, and wild-card references, like those found on the Web and digital libraries. Typical structures can be used to serve the following purposes. (a) The table-of-content for gaining the general information of a source. (b) A road map for browsing and querying a source. (c) A basis for clustering documents. (d) Partial schemas for building structured layers to provide standard database access methods. (e) User/customer's interests and browsing patterns. We present a solution to the discovery problem.


SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.