Project:Debug WDQS: Difference between revisions

From MaRDI portal
Line 3: Line 3:
=== WDQS architecture ===
=== WDQS architecture ===


* 1) Whenever you create a wikibase item, all its information (including statements) are saved as a JSON object in the MediaWiki table that saves page information.
# Whenever a wikibase item is created, all its information (representing the statements) are saved as a JSON object in the MediaWiki table that saves page information.
* 2) Given that all the statement information is just packed as a single JSON object inside a mySQL database, it is not possible to use SQL queries to just query items that contain particular statements.
# Given that all the statements are just packed as a single JSON object inside a mySQL database, it is not possible to use SQL queries to just query items for particular statements.
* 3) To be able to perform these kind of queries (e.g. what items have a property value equal to X for property Y), we copy over all the item information in another database. This other database is not mySQL based, instead it is a graph database based on Blazegraph. Blazegraph accepts queries in SPARQL.
# To be able to perform these kind of queries (e.g. what items have a property value equal to X for property Y), all the statements are copied in another database, which is not mySQL based. This other database is a graph database using Blazegraph, which accepts queries in SPARQL.
* 4) Having two databases (mySQL and Blazegraph) that contain the same information requires a mechanism to keep them in synchronization. This mechanism runs in the <code>docker-wdqs-updater-1</code> container. All it does is to make API calls to the RecentChanges API endpoint from Mediawiki to see all the changes that have been introduced to Item and Property pages (i.e. pages in the Item: and Property: namespaces). Then, it processes these changes and it pushed them into Blazegraph using an RDF format.
# Having two databases (mySQL and Blazegraph) that contain the same information requires a mechanism to keep them in synchronization. This mechanism runs in the <code>docker-wdqs-updater-1</code> container. All it does are API calls to the RecentChanges API endpoint from Mediawiki to check the recent changes in Item and Property pages (pages in the <code>Item:</code> and <code>Property:</code> namespaces). Then, it processes these changes and it pushes them into Blazegraph in RDF format.
* 5) Given this situation, it is important to keep in mind that the mySQL table is the source of truth. Whenever we are trying to debug an error related to WDQS we should first make sure that the information is shown correctly in the Item/Property page.
# It is important to keep in mind that the MediaWiki mySQL table is the source of truth. Whenever we are trying to debug an error related to WDQS we should first make sure that the information is shown correctly in the Item/Property page.
* 6) Finally, the information being shown in the profile pages (e.g. Person:, Publication:, ... namespaces) is very often retrieved using SPARQL queries, i.e. it is read from Blazegraph, not from mySQL.
# The information being shown in the profile pages (e.g. <code>Person:</code>, <code>Publication:</code>, ... namespaces) is often retrieved using SPARQL queries, which means that it is read from Blazegraph, not from mySQL.


=== WDQS containers ===
=== WDQS containers ===

Revision as of 09:44, 14 March 2025

The first step to fix bugs related to WDQS is to understand how WDQS and Wikibase interact.

WDQS architecture

  1. Whenever a wikibase item is created, all its information (representing the statements) are saved as a JSON object in the MediaWiki table that saves page information.
  2. Given that all the statements are just packed as a single JSON object inside a mySQL database, it is not possible to use SQL queries to just query items for particular statements.
  3. To be able to perform these kind of queries (e.g. what items have a property value equal to X for property Y), all the statements are copied in another database, which is not mySQL based. This other database is a graph database using Blazegraph, which accepts queries in SPARQL.
  4. Having two databases (mySQL and Blazegraph) that contain the same information requires a mechanism to keep them in synchronization. This mechanism runs in the docker-wdqs-updater-1 container. All it does are API calls to the RecentChanges API endpoint from Mediawiki to check the recent changes in Item and Property pages (pages in the Item: and Property: namespaces). Then, it processes these changes and it pushes them into Blazegraph in RDF format.
  5. It is important to keep in mind that the MediaWiki mySQL table is the source of truth. Whenever we are trying to debug an error related to WDQS we should first make sure that the information is shown correctly in the Item/Property page.
  6. The information being shown in the profile pages (e.g. Person:, Publication:, ... namespaces) is often retrieved using SPARQL queries, which means that it is read from Blazegraph, not from mySQL.

WDQS containers

The entire WDQS service is based on four docker containers:

  1. WDQS backend docker-wdqs-1: Main database container. It runs the Blazegraph instance and contains all the data.
  2. WDQS frontend mardi-wdqs-frontend: Simple frontend application to write SPARQL queries and send them to the Blazegraph database. Available at Query Service UI.
  3. WDQS updater docker-wdqs-updater-1: Container running the updater process. It queries the MediaWiki RecentChanges API and inserts the changes into the Blazegraph database.
  4. WDQS proxy docker-wdqs-proxy-1: The WDQS backend accepts POST requests to insert data. The WDQS proxy is set up between the frontend and the backend to just allow GET requests (readonly). It also makes an API endpoint available to query the database.

Debug errors

  • 1) The profile page does not show information that I see when I visit the Item page.

First, check if you can see all the information related to that item in the Query Service UI. Just use a query like:

DESCRIBE wd:Q100
  • 1.1) If the Query Service UI does not load or an error is returned the error is in one of the four WDQS containers, most probably in WDQS frontend. In this situation it helps to check the logs in the containers, specially in docker-wdqs-1. The first recommendation to fix the problem is to just restart the four containers. After the containers have restarted try to send again the query in the UI and see if the problem persists. If it does, it will be necessary to check in detail the logs in the container. To fix the problem, it might be necessary to tweak some of the configuration variables that are passed to one of the containers. You can check the documentation on the configuration parameters here, which we pass to the containers through the docker-compose.yml file.
  • 1.2) If instead results are returned but they are incomplete, this indicates that the Blazegraph backend and query engine are working properly, but the information has not been properly copied from mySQL to the Blazegraph database. This has happened because at some point or during some time the WDQS updater container has not been running. In this case it will be necessary to resynchronize again specific items or just all the items starting at a given point in time. Follow these instructions for that.
  • 2) I have a SPARQL query that returns less results than expected

This is the case described in 1.2.

  • 3) The Query UI returns an error when I run a SPARQL query.

This is the case described in 1.1.