Similarity to a Single Set

Lee Naish

Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple but fundamental instance of the problem, where each data point consists of a vector of binary attributes. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Our primary application is spectral based fault localisation (SBFL), in which each point represents a test case for a computer program and the attributes are whether the program failed the test and whether certain parts of the program were used during the test. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets. Furthermore, there is one identified "base" set and only similarity to that set is considered---the other sets are just ranked according to how similar they are to the base set. For example, if the base set represents lung cancer sufferers, the set of smokers may well be high in the ranking. In SBFL the base set represents the failed tests and the ranking is used to help find bugs. Identifying set similarity or correlation has many uses and is often the first step in determining relationship or causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have and identifies important forms of symmetry which have not previously been considered. It gives alternative versions of various previously defined properties so they are no longer incompatible, defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.

Keywords: binary similarity measure, set similarity, STASS, data mining, clustering, classification, diagnostic test