A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization (Q2193277)

Text corpora in natural-language research contain billions of words and the size is growing, which has created the problem of extracting smaller subsets with a minimally changed semantics. Let \(T=\{t_1,\dots,t_n\}\) be a set of tokens (e.g. words) in an annotated text corpus with real-valued unary and binary attributes and semantic relatedness relations \(S^1\in\mathcal{R}^n\), \(S^2\in\mathcal{R}^{n\times n}\), \(S^3\in\mathcal{R}^{n\times n\times n}\); \(X=\{x_1,\dots,x_n\}\in\{0,1\}^n\) be Boolean variables to denote subsets from \(T\). The problem of semantics relatedness preservation in corpora subset extraction is finding an `optimal' (minimal) subset \(X\subset T\) which maximizes \(\sum\limits_{i=1}^ns_i^1{x_i}+\sum\limits_{i,j=1}^ns_{ij}^2x_ix_j+\sum\limits_{i,j,k = 1}^ns_{ijk}^3x_ix_jx_k\) under constraints for attributes (here, one unary and one binary attribute constraint are considered). This NP-hard problem is transformed into the problem of finding the maximum flow in an equivalent graph and solved using the discrete Lagrangian iteration method.

0 references

zbMATH Keywords

semantic relatedness

0 references

subset extraction

0 references

language intelligence

0 references

pseudo-Boolean optimization

0 references

discrete Lagrangian method

0 references