Efficient binary embedding of categorical data using BinSketch
From MaRDI portal
Abstract: In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points - making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks such as RMSE, all-pairs similarity, and clustering when compared to working with the full dataset and other dimensionality reduction techniques.
Recommendations
- On binary embedding using circulant matrices
- Real-valued embeddings and sketches for fast distance and similarity estimation
- Fast binary embeddings with Gaussian circulant matrices: improved bounds
- Dimensionality reduction for \(k\)-means clustering and low rank approximation
- Selecting Sketches for Similarity Search
Cites work
- scientific article; zbMATH DE number 6381735 (Why is no real title available?)
- scientific article; zbMATH DE number 1775418 (Why is no real title available?)
- scientific article; zbMATH DE number 1775450 (Why is no real title available?)
- 10.1162/jmlr.2003.3.4-5.993
- A NEW MEASURE OF RANK CORRELATION
- A reliable effective terascale linear learning system
- Correspondence analysis and related methods in practice
- Database-friendly random projections: Johnson-Lindenstrauss with binary coins.
- Extensions of Lipschitz mappings into a Hilbert space
- Similarity estimation techniques from rounding algorithms
- The challenges of clustering high dimensional data
This page was built for publication: Efficient binary embedding of categorical data using BinSketch
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2134033)