Efficient binary embedding of categorical data using BinSketch

Computational aspects of data analysis and big data (68T09) Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science) (68P30)

Abstract: In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points - making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks such as RMSE, all-pairs similarity, and clustering when compared to working with the full dataset and other dimensionality reduction techniques.

Recommendations

Cites work

Cited in

(1)

On binary embedding using circulant matrices

Describes a project that uses

Uses Software

This page was built for publication: Efficient binary embedding of categorical data using BinSketch

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2134033)