Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries
From MaRDI portal
Publication:6416617
arXiv2211.04612MaRDI QIDQ6416617FDOQ6416617
Authors: Matteo Sesia, Stefano Favaro, Edgar Dobriban
Publication date: 8 November 2022
Abstract: A flexible method is developed to construct a confidence interval for the frequency of a queried object in a very large data set, based on a much smaller sketch of the data. The approach requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals for random queries using a conformal inference approach. After achieving marginal coverage for random queries under the assumption of data exchangeability, the proposed method is extended to provide stronger inferences accounting for possibly heterogeneous frequencies of different random queries, redundant queries, and distribution shifts. While the presented methods are broadly applicable, this paper focuses on use cases involving the count-min sketch algorithm and a non-linear variation thereof, to facilitate comparison to prior work. In particular, the developed methods are compared empirically to frequentist and Bayesian alternatives, through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.
Has companion code repository: https://github.com/msesia/conformalized-sketching
This page was built for publication: Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6416617)