A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization (Q2193277): Difference between revisions
From MaRDI portal
Latest revision as of 08:15, 23 July 2024
scientific article
Language | Label | Description | Also known as |
---|---|---|---|
English | A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization |
scientific article |
Statements
A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization (English)
0 references
25 August 2020
0 references
Text corpora in natural-language research contain billions of words and the size is growing, which has created the problem of extracting smaller subsets with a minimally changed semantics. Let \(T=\{t_1,\dots,t_n\}\) be a set of tokens (e.g. words) in an annotated text corpus with real-valued unary and binary attributes and semantic relatedness relations \(S^1\in\mathcal{R}^n\), \(S^2\in\mathcal{R}^{n\times n}\), \(S^3\in\mathcal{R}^{n\times n\times n}\); \(X=\{x_1,\dots,x_n\}\in\{0,1\}^n\) be Boolean variables to denote subsets from \(T\). The problem of semantics relatedness preservation in corpora subset extraction is finding an `optimal' (minimal) subset \(X\subset T\) which maximizes \(\sum\limits_{i=1}^ns_i^1{x_i}+\sum\limits_{i,j=1}^ns_{ij}^2x_ix_j+\sum\limits_{i,j,k = 1}^ns_{ijk}^3x_ix_jx_k\) under constraints for attributes (here, one unary and one binary attribute constraint are considered). This NP-hard problem is transformed into the problem of finding the maximum flow in an equivalent graph and solved using the discrete Lagrangian iteration method.
0 references
semantic relatedness
0 references
subset extraction
0 references
language intelligence
0 references
pseudo-Boolean optimization
0 references
discrete Lagrangian method
0 references
0 references