Observational calculi and association rules (Q2392092)

From MaRDI portal





scientific article; zbMATH DE number 6193539
Language Label Description Also known as
default for all languages
No label defined
    English
    Observational calculi and association rules
    scientific article; zbMATH DE number 6193539

      Statements

      Observational calculi and association rules (English)
      0 references
      0 references
      31 July 2013
      0 references
      The object of study of the book is defined on the basis of tables of the form \[ \begin{matrix} & f_1 & f_2 & \dots & f_K \\ \\ o_1\quad & f_1(o_1) & f_2(o_1) & \dots & f_K(o_1)\\ o_2\quad & f_1(o_2) & f_2(o_2) & \dots & f_K(o_2)\\ \vdots\quad & \vdots & \vdots & \ddots & \vdots\\ o_n\quad & f_1(o_n) & f_2(o_n) & \dots & f_K(o_n) \end{matrix} \] where \(n\geq 1\), \(K\geq 2\) and for all nonzero \(i\leq K\) and \(j\leq n\), \(f_i(o_j)\in\{1,\dots,t_i\}\) for some \(t_i\geq 2\). The intended interpretation is that \(o_1\dots o_n\) are objects with \(K\) attributes and for all nonzero \(j\leq n\), \(f_1(o_j)\dots f_K(o_j)\) are (encodings of) the values of those \(K\) attributes for object \(j\). As a particular example, \(o_1\dots o_n\) could be market baskets and for all nonzero \(i\leq K\) and \(j\leq n\), \(f_i(o_j)\) could be equal to 1 if the \(i\)th item out of the \(K\) items sold by the supermarket is in basket \(o_j\), and to 2 otherwise. The description of these tables is carried out in a logical language built over \textit{basic Boolean attributes}, that is, formulas of the form \(A_i(v_1,\dots,v_p)\) with \(i\in\{1,\dots, K\}\) and \(1\leq v_1\leq\dots\leq v_p\leq t_i\); \(A_i(v_1,\dots,v_p)\) is satisfied by \(o_j\), \(j\in\{1,\dots,n\}\), iff \(f_i(o_j)\) is one of \(v_1,\dots,v_p\). So a basic Boolean attribute lists some of the values that can be taken by the associated attribute, and an object satisfies this formula iff the value of that attribute for this object is one of the the listed values. Basic Boolean attributes can be combined using Boolean operators to yield \textit{Boolean attributes}. When information is missing, the language is extended to accept \(X\) as a possible value for ``unknown'' and the usual 3-valued logic is used. Given two Boolean attributes \(\varphi\) and \(\psi\), count the number \(a\), \(b\), \(c\) and \(d\) of objects that satisfy \(\varphi\wedge\psi\), \(\varphi\wedge\neg\psi\), \(\neg\varphi\wedge\psi\) and \(\neg\varphi\wedge\neg\psi\), respectively; in picture: \[ \begin{matrix} & \psi & \neg\psi\\ \\ \varphi\quad & a & b \\ \neg\varphi\quad & c & d \end{matrix} \] Now the purpose of the book can be defined as restricting to the case where \(\varphi\) and \(\psi\) have no attribute in common, that is, no \(A_i\), \(i\in\{1,\dots, K\}\), occurs both in \(\varphi\) and \(\psi\), and studying particular relations between \(a\), \(b\), \(c\) and \(d\), called \textit{4ft-quantifiers}; when such a 4ft-quantifier holds then an \textit{association rule} between \(\varphi\) and \(\psi\) has been found, expressed with the notation \(\varphi\approx\psi\). Here are a couple of examples of 4ft-quantifiers: {\parindent=6mm \begin{itemize}\item[{\(\bullet\)}] The 4ft-quantifier of \textit{founded implication}, function of two parameters \(p\in(0,1]\) and \(\mathrm{Base}>0\), holds iff \[ \frac{a}{a+b}\geq p\wedge a\geq\mathrm{Base}. \] This 4ft-quantifier is the one used in the early 1990s in relation to market basket analysis: a proportion of at least \(p\) objects has to satisfy \(\psi\) amongst the objects that satisfy \(\varphi\) (this is the \textit{confidence} in the association rule), with at least \(\mathrm{Base}\) objects satisfying both \(\varphi\) and \(\psi\) (which divided by \(n\), yields the \textit{support} for the association rule). \item[{\(\bullet\)}] The 4ft-quantifier of \textit{lower critical implication}, function of three parameters \(p\in(0,1]\), \(\alpha\in(0,0.5)\), and \(\mathrm{Base}>0\), holds iff \[ \sum_{i=a}^{a+b}{a+b\choose i}p^i(1-p)^{a+b-i}\leq\alpha\wedge a\geq\mathrm{Base}. \] This 4ft-quantifier is derived from the statistical binomial test, on significance level \(\alpha\), of the null hypothesis \(P(\psi|\varphi)\leq p\) against the alternative \(P(\psi|\varphi)>p\). \end{itemize}} A \textit{prime association rule} is an association rule that cannot be derived from a simpler one (with appropriate notions of ``derivation'' and ``simpler''). For instance, with respect to the 4ft-quantifier of founded implication, \(A_1(1)\approx A_2(1)\vee A_3(1)\) is not prime as it is easily seen to logically follows from \(A_1(1)\approx A_2(1)\), which is naturally simpler. The ASSOC procedure, studied in~[\textit{P. Hajek} and \textit{T. Havranek}, Mechanizing hypothesis formation. Mathematical foundations for a general theory. Universitext. Berlin-Heidelberg-New York: Springer-Verlag (1978; Zbl 0371.02002)], aims at mining prime association rules. The book essentially takes material from~[loc. cit.], going more systematically over it, and expanding on it. After an overview of these concepts and the structure of the book, followed by a formal definition of those notions, the study really starts in Chapter 4, where 29 notions of 4ft-quantifiers are defined, most of which are inspired by concepts from probability and statistics. In Chapters 5, combinatorics arguments are used to prove properties of some 4ft-quantifiers, in particular 4ft-quantifiers related to the \(\chi^2\) test. Chapters 6 to 11 form the core of the book, where 4ft-quantifiers are classified into a number of classes. For instance, a 4ft-quantifier is said to be \textit{implicational} if the relationship \(R\) that defines it is such that for all \(a\), \(b\), \(c\), \(d\), \(a'\), \(b'\), \(c'\), \(d'\), if \(R(a,b,c,d)\) holds and \(a'\geq a\) and \(b'\leq b\), then \(R(a',b',c',d')\) also holds. Many results show that some class is included in another class, or that some 4ft-quantifiers belong to this or that class but not to this or that class. Chapter 12 tackles in more detail the issue of missing information. Chapter 13 investigates whether a 4ft-quantifier can be first-order definable. Chapter 14 describes the LISp-Miner system that implements the ASSOC procedure. The book ends with two short chapters dedicated to describing research projects and open problems. The main interest of the approach described in the book is that it combines probability, statistics and logic to discover interesting relationships between data. The weakness of the framework is that attributes can only take a finite number of values, but the book demonstrates that it provides the basis for a fruitful theory open to interesting challenges. The reader will appreciate the comprehensive treatment of the subject. The large number of 4ft-quantifiers introduced in the book, which sometimes only differ slightly from each other, makes the reader feel that the list could grow even further as there is no discussion or justification on whether or why these 4ft-quantifiers or some of them are fundamental. Having a large number of 4ft-quantifiers is both a strength and a weakness, as it provides plenty of work to the researcher, but this work can easily become mechanical, as witnessed by the fact that many concepts, results and proofs have been obtained by copying, pasting and adapting from a standard representative; if the text for that standard representative has a typo, then that typo will be reoccur in all variations, which is slightly irritating\dots\ The formalism could be leaner and the typography cleaner, but the author has still done a decent job, with the exception of Chapter 13, where too many intricate definitions could have been simplified and would have greatly benefited from being illustrated by at least one example.
      0 references
      0 references
      observational calculi
      0 references
      association rules
      0 references
      4ft-quantifiers
      0 references
      data mining
      0 references
      ASSOC procedure
      0 references

      Identifiers