A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization (Q2193277)

From MaRDI portal
scientific article
Language Label Description Also known as
English
A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization
scientific article

    Statements

    A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    25 August 2020
    0 references
    Text corpora in natural-language research contain billions of words and the size is growing, which has created the problem of extracting smaller subsets with a minimally changed semantics. Let \(T=\{t_1,\dots,t_n\}\) be a set of tokens (e.g. words) in an annotated text corpus with real-valued unary and binary attributes and semantic relatedness relations \(S^1\in\mathcal{R}^n\), \(S^2\in\mathcal{R}^{n\times n}\), \(S^3\in\mathcal{R}^{n\times n\times n}\); \(X=\{x_1,\dots,x_n\}\in\{0,1\}^n\) be Boolean variables to denote subsets from \(T\). The problem of semantics relatedness preservation in corpora subset extraction is finding an `optimal' (minimal) subset \(X\subset T\) which maximizes \(\sum\limits_{i=1}^ns_i^1{x_i}+\sum\limits_{i,j=1}^ns_{ij}^2x_ix_j+\sum\limits_{i,j,k = 1}^ns_{ijk}^3x_ix_jx_k\) under constraints for attributes (here, one unary and one binary attribute constraint are considered). This NP-hard problem is transformed into the problem of finding the maximum flow in an equivalent graph and solved using the discrete Lagrangian iteration method.
    0 references
    0 references
    semantic relatedness
    0 references
    subset extraction
    0 references
    language intelligence
    0 references
    pseudo-Boolean optimization
    0 references
    discrete Lagrangian method
    0 references
    0 references
    0 references