A tutorial on statistically sound pattern discovery
From MaRDI portal
Abstract: Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries -- patterns that are found in the sample data but do not hold in the wider population from which the sample was drawn. Statistical tests can also be applied to filter out patterns that are unlikely to be useful, removing uninformative variations of the key patterns in the data. This tutorial introduces the key statistical and data mining theory and techniques that underpin this fast developing field. We concentrate on two general classes of patterns: dependency rules that express statistical dependencies between condition and consequent parts and dependency sets that express mutual dependence between set elements. We clarify alternative interpretations of statistical dependence and introduce appropriate tests for evaluating statistical significance of patterns in different situations. We also introduce special techniques for controlling the likelihood of spurious discoveries when multitudes of patterns are evaluated. The paper is aimed at a wide variety of audiences. It provides the necessary statistical background and summary of the state-of-the-art for any data mining researcher or practitioner wishing to enter or understand statistically sound pattern discovery research or practice. It can serve as a general introduction to the field of statistically sound pattern discovery for any reader with a general background in data sciences.
Recommendations
- Discovering significant patterns
- A statistical significance testing approach to mining the most informative set of patterns
- Pattern Discovery and Detection: A Unified Statistical Methodology
- Significance tests for unsupervised pattern discovery in large continuous multivariate data sets
- Layered critical values: a powerful direct-adjustment approach to discovering significant patterns
Cites work
- scientific article; zbMATH DE number 1696848 (Why is no real title available?)
- scientific article; zbMATH DE number 1817585 (Why is no real title available?)
- scientific article; zbMATH DE number 3624650 (Why is no real title available?)
- scientific article; zbMATH DE number 720689 (Why is no real title available?)
- scientific article; zbMATH DE number 1096628 (Why is no real title available?)
- scientific article; zbMATH DE number 821286 (Why is no real title available?)
- scientific article; zbMATH DE number 838305 (Why is no real title available?)
- scientific article; zbMATH DE number 1439198 (Why is no real title available?)
- scientific article; zbMATH DE number 3206656 (Why is no real title available?)
- scientific article; zbMATH DE number 3249395 (Why is no real title available?)
- scientific article; zbMATH DE number 3046042 (Why is no real title available?)
- A Comparison of Some Continuity Corrections for the Chi-Squared Test on 2 × 2 Tables
- A sharper Bonferroni procedure for multiple tests of significance
- A survey of exact inference for contingency tables. With comments and a rejoinder by the author
- An Application of Markov Chain Monte Carlo to Community Ecology
- Asymptotic optimality of the Westfall-Young permutation procedure for multiple testing under dependence
- Bayesian Testing and Estimation of Association in a Two-Way Contingency Table
- Bayesian inference for categorical data analysis
- Bayesian statistics. An introduction.
- Detecting group differences: Mining contrast sets
- Discovering significant patterns
- Efficient search methods for statistical dependency rules
- FDR- and FWE-controlling methods using data-driven weights
- Frequent Pattern Mining
- Frequentist Performance of Bayesian Confidence Intervals for Comparing Proportions in 2 × 2 Contingency Tables
- Genome-wide significance levels and weighted hypothesis testing
- HOW GOOD IS A NORMAL APPROXIMATION FOR RATES AND PROPORTIONS OF LOW INCIDENCE EVENTS?
- Interesting patterns
- Knowledge Discovery in Inductive Databases
- Layered critical values: a powerful direct-adjustment approach to discovering significant patterns
- Multiple Hypotheses Testing with Weights
- Multiple testing for exploratory research
- New upper bounds for tight and fast approximation of Fisher's exact test in dependency rule mining
- On permutation procedures for strong control in multiple testing with gene expression data
- Rectangular Confidence Regions for the Means of Multivariate Normal Distributions
- Redundancy, deduction schemes, and minimum-size bases for association rules
- Resampling-based multiple testing for microarray data analysis (With comments)
- Statistical significance of combinatorial regulations
- Statistics
- Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining
- THE MEANING OF A SIGNIFICANCE LEVEL
- Test of Significance for 2 × 2 Contingency Tables
- Testing Statistical Hypotheses
- The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?
- The \(2\times2\) table: A discussion from a Bayesian viewpoint.
- The control of the false discovery rate in multiple testing under dependency.
- Uses, Abuses and Misuses of Significance Tests in the Scientific Community: Won't the Bayesian Choice be Unavoidable?
Cited in
(12)- Significance tests for unsupervised pattern discovery in large continuous multivariate data sets
- A statistical significance testing approach to mining the most informative set of patterns
- Pattern Discovery and Detection: A Unified Statistical Methodology
- The minimum description length principle for pattern mining: a survey
- ROhAN: row-order agnostic null models for statistically-sound knowledge discovery
- Efficient search methods for statistical dependency rules
- Statistical inference and data mining: false discoveries control
- Discovering significant patterns
- SPEck: mining statistically-significant sequential patterns efficiently with exact sampling
- scientific article; zbMATH DE number 1943764 (Why is no real title available?)
- Layered critical values: a powerful direct-adjustment approach to discovering significant patterns
- Robust subgroup discovery. Discovering subgroup lists using MDL
This page was built for publication: A tutorial on statistically sound pattern discovery
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2218330)