SIGIR'98 demonstrations: PWA: An Extended Probabilistic Web Algebra

PWA: An Extended Probabilistic Web Algebra


Dan Smith
Dan Smith, School of Information Systems, University of East Anglia, Norwich, NR4 7TJ, UK

Rattasit Sukhahuta
Rattasit Sukhahuta, School of Information Systems, University of East Anglia, Norwich, NR4 7TJ, UK


Abstract

PWA is an extended relational algebra that provides functionality to fetch and query information directly from intranet/internet data sources. In PWA, information extraction is combined with an extended relational algebra to deal with imprecise queries over semistructured information sources.

The information we access is either not organised as structured records, or is likely to be structured in an inconvenient or unhelpful form. Our information extraction approach is based on a document abstract model in which the document is hierarchically partitioned into regions that contain concepts of interest. Each region corresponds to a limited structural and semantic domain that allows us to simplify the concept identification process. This allows us to adopt an effective structured approach and to avoid the use of complex natural language understanding techniques during the concept recognition process. In PWA, concepts are viewed as complex values in a relational framework. Each concept has an associated set of values and weights, forming a probabilistic relation, which allows a PWA user to optionally specify the maximum and minimum probability range for any concept.

The implementation of PWA demonstrates how information extraction techniques, combined with an extended relational algebra, allows users to select and manipulate relevant subsets of information for a wide variety of semistructured data sources.