Distance geometry and data science (Q2192022)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Distance geometry and data science
scientific article

    Statements

    Distance geometry and data science (English)
    0 references
    0 references
    0 references
    26 June 2020
    0 references
    Tha aim of this large paper is to discuss various approaches to the basic distance geometry problem and its applications in data science, thereby visiting many current highlights in optimization research. First the field of Mathematical Programming (MP) is formalized and classified, including reformulations, relaxations and approximations. Then Distance Geometry (DG) is defined as the NP-hard problem of embedding an undirected edge-weighted graph in Euclidean space of given dimension such that the weights equal the corresponding vertex distances, with applications in Engineering, Protein folding and data mining, the last being further developed. After showing how different kinds of data, such as process descriptions, text, databases and abductive inference, may be represented as weighted graphs, it is argued that the particular data science tasks of classification and clustering may be applied to graph-data after vectorization by DG, using e.g. \(k\)-means or Artificial Neural Networks (ANN), while clustering of the vertices of a graph may be done using spectral or modularity clustering. Several MP methods for obtaining robust approximate solutions to DG are detailed for the usual case of perturbed weight data. An unconstrained quartic formulation minimizing the sum of squared square-differences, and two constrained variants may be tackled by a local nonlinear solver. A relaxation of DG may be cast into a semidefinite programming problem solvable for low dimension by interior point methods, which for high dimensions may be further relaxed to diagonally dominant form yielding an LP. Some fast exact, but very high dimensional embeddings may be obtained by incidence vectors, the Frechet max-norm embedding, or by multidimensional scaling. Embedding dimension may be reduced using principal component analysis, Isomap, Barvinok's probabilistic `naive method', and finally by random projections using the fact that these preserve (euclidean) norms on average. Very disturbing for most of these methods is the phenomenon of distance instability and concentration of distances in high dimension: it is shown that as dimension increases the difference between smallest and largest distance between pairs of points tends to zero in probability. The survey ends in an exercise of natural language clustering by way of ANN, comparing several of the techniques described.
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    Euclidean distance
    0 references
    isometric embedding
    0 references
    random projection
    0 references
    mathematical programming
    0 references
    machine learning
    0 references
    artificial neural networks
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references