A survey on tree matching and XML retrieval (Q394973)

The paper provides a useful survey of techniques for XML retrieval. The value of the paper is twofold. First, it is time to seriously summarize results achieved in XML retrieval during the more than 15 years of the existence of the XML data model. Second, the authors analyze these results from the point of view of graph theory that provides a theoretical umbrella for both XML query languages and algorithms used for query evaluation. The paper is divided into six sections. Section 2 presents some important concepts and background concerning the XML data model and two categories of XML retrieval: querying semi-structured data and structured text retrieval. The former includes basic XML query languages (XPath, XQuery), the latter uses a method known from information retrieval. Section 2 also introduces some XML query languages belonging to both retrieval categories. A possibility to integrate the functionalities of both these categories is offered by the language XQuery Full-text, which extends the XQuery language to full-text search. Section 3 lists the best known algorithms for exact and approximate tree matching. The latter is of special importance for practical querying of XML data. The authors discuss three main groups of algorithms for tree matching: tree edit distance approaches, tree inclusion and the tree alignment distance. Section 3 summarizes the most important approximate ordered tree matching algorithms and their complexities in a table. The authors emphasize that a lot of these algorithms assume the data to be processed in main memory which is not often possible in real XML databases. Approaches for XML retrieval using the tree matching algorithms described in Section 3 are presented in Section 4, which represents the core of the paper. Its first part covers the exact algorithms for finding all twig patterns in an XML database. The second part describes and illustrates in detail the approximate algorithms including their variants and improvements. All XML approaches are classified into different categories and summarized in tables. Evaluation methods of these approaches are exposed in the rather confusing Section 5. Finally, a discussion about future research directions is presented in Section 6.NEWLINENEWLINEThe approaches described in the paper use the tree representation of documents and queries, but do not use state-of-the art algorithms for tree matching. It seems that a deeper collaboration between the IR community and researchers developing new XML query languages would be beneficial. Although the development of new XML query languages was not so intensive in the last years, the paper offers a challenge for their further development.

0 references