Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries

From MaRDI portal
Publication:6416617

arXiv2211.04612MaRDI QIDQ6416617FDOQ6416617


Authors: Matteo Sesia, Stefano Favaro, Edgar Dobriban Edit this on Wikidata


Publication date: 8 November 2022

Abstract: A flexible method is developed to construct a confidence interval for the frequency of a queried object in a very large data set, based on a much smaller sketch of the data. The approach requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals for random queries using a conformal inference approach. After achieving marginal coverage for random queries under the assumption of data exchangeability, the proposed method is extended to provide stronger inferences accounting for possibly heterogeneous frequencies of different random queries, redundant queries, and distribution shifts. While the presented methods are broadly applicable, this paper focuses on use cases involving the count-min sketch algorithm and a non-linear variation thereof, to facilitate comparison to prior work. In particular, the developed methods are compared empirically to frequentist and Bayesian alternatives, through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.




Has companion code repository: https://github.com/msesia/conformalized-sketching









This page was built for publication: Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6416617)