Rare feature selection in high dimensions
From MaRDI portal
Abstract: It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.
Recommendations
Cites work
- scientific article; zbMATH DE number 1175490 (Why is no real title available?)
- scientific article; zbMATH DE number 845714 (Why is no real title available?)
- 10.1162/153244303322753670
- A Dirichlet-Tree Multinomial Regression Model for Associating Dietary Nutrients with Gut Microorganisms
- A logistic normal multinomial regression model for microbiome compositional data analysis
- Distributed optimization and statistical learning via the alternating direction method of multipliers
- Homogeneity pursuit
- Hypothesis testing for high-dimensional sparse binary regression
- Kernel-penalized regression for analysis of microbiome data
- Regression analysis for microbiome compositional data
- Sparse regression with exact clustering
- Structured subcomposition selection in regression and its application to microbiome data analysis
- The solution path of the generalized lasso
- Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping
- Variable selection and regression analysis for graph-structured covariates with an application to genomics
- Variable selection in regression with compositional covariates
Cited in
(19)- Multiresolution categorical regression for interpretable cell-type annotation
- Ensembling classification models based on phalanxes of variables with applications in drug discovery
- Multivariate Monotone Inclusions in Saddle Form
- Identifying Brain Hierarchical Structures Associated with Alzheimer's Disease Using a Regularized Regression Method with Tree Predictors
- Sentiment analysis with covariate-assisted word embeddings
- Hierarchical Regularizers for Mixed-Frequency Vector Autoregressions
- It's All Relative: Regression Analysis with Compositional Predictors
- Feature selection for high-dimensional data
- LOL selection in high dimension
- An effective framework for characterizing rare categories
- rare
- Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills
- Sparse principal component regression via singular value decomposition approach
- The geometry of monotone operator splitting methods
- Projective splitting with forward steps
- Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform
- Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data
- Word embeddings as statistical estimators
- Hedonic pricing modelling with unstructured predictors: an application to Italian fashion industry
Describes a project that uses
Uses Software
This page was built for publication: Rare feature selection in high dimensions
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q149281)