Page MenuHomePhabricator

[Investigation] Terms storage missing most item labels after merge, resulting in item IDs being shown
Open, Needs TriagePublic1 Estimated Story PointsBUG REPORT

Description

image.png (73×488 px, 18 KB)

Items linking to https://www.wikidata.org/wiki/Q758891 don't display the label, falling back to show the item ID instead.

Timeboxed: 1 day


Status as of 2022-07-26

We spent some time investigating this, concluded that it doesn’t seem to be a common problem at the moment, and decided to not spend more time on the task, partially due to logging problems (see below). We may return to it in the future.

Any affected item can reliably be fixed by purging the item whose terms are missing with a links update via the API, e.g. using the API sandbox (put the item ID in titles). Any edit will also fix the item. A regular purge using the web interface is not sufficient.

Internally, the problem is that the term store has an incomplete record of the item’s terms. Sometimes, the wbt_item_terms.wbit_term_in_lang_id points at a wbt_term_in_lang row that no longer exists; sometimes, the wbt_term_in_lang.wbtl_text_in_lang_id points at a wbt_text_in_lang row that no longer exists. (See T309445#8055711 for an example of both.) This is probably a race condition between the same terms being removed from one item and added to another item during a merge.

We tried to add some debug logging to investigate this issue (T311307, removed again in T313039), but only with partial success. Some of the log messages mysteriously went missing (see T309445#8055711 and T309445#8072981), even though the associated code was almost certainly still running, which made the investigation difficult, almost futile.

If anyone wants to investigate this further, you can start by going through the recent merges, looking at the history of the target item to see if it’s had any edits since then (if yes, skip it, those edits probably fixed the problem even if it happened), then getting its terms from the API (API sandbox – change the titles and, if necessary, the wbetlanguage) and seeing if any terms are missing.

Event Timeline

There are only two terms for this item in the term store, the German label and description (but not aliases): https://quarry.wmcloud.org/query/64885

wby_namewbx_textwbxl_language
labelAudiorekorderde
descriptionGerät zur Tonaufnahmede

These happen to be the exact terms the item had in the last pre-merge revision (compare with current revision permalink). I suspect we’re failing to update the term store when a merge takes place, but I should also check Logstash on Monday to see if there’s anything related in there.

Nothing in logstash looks obviously related (link), I’ll check some other merged items.

Q97819206 (terms quarry, pre-merge revision) and Q90305607 (terms quarry, pre-merge revision), two items that recently received a merge, are showing the same pattern. I think this is a real bug.

Tried reproducing this locally with the following snippet in the dev tools console:

api = new mw.Api(); now = new Date(); item1 = ( await api.postWithEditToken( { action: 'wbeditentity', new: 'item', summary: 'T309445', data: JSON.stringify( { labels: { en: { language: 'en', value: `${now.toISOString()} (en)` } } } ) } ) ).entity.id; console.log( item1 ); item2 = ( await api.postWithEditToken( { action: 'wbeditentity', new: 'item', summary: 'T309445', data: JSON.stringify( { labels: { de: { language: 'de', value: `${now.toISOString()} (de)` } } } ) } ) ).entity.id; console.log( item2 ); await api.postWithEditToken( { action: 'wbmergeitems', fromid: item2, toid: item1, summary: 'T309445' } );

But didn’t succeed in reproducing the error, both labels are present in the terms store afterwards. It’s possible that this bug is related to some of the extra stuff that the merge gadget does.

Expected Behavior:
when two items are merged, all labels are added and the term store is updated. The value is displayed (not the item ID)

karapayneWMDE renamed this task from Terms storage missing most item labels after merge, resulting in item IDs being shown to INVESTIGATION: Terms storage missing most item labels after merge, resulting in item IDs being shown.Jun 14 2022, 10:03 AM
karapayneWMDE updated the task description. (Show Details)
karapayneWMDE set the point value for this task to 1.

Tried reproducing this locally with the following snippet in the dev tools console … But didn’t succeed in reproducing the error, both labels are present in the terms store afterwards. It’s possible that this bug is related to some of the extra stuff that the merge gadget does.

Haven’t been able to reproduce it on Test Wikidata using that wiki’s version of the Merge gadget either. I’m not sure how to test this on Wikidata – we’d need to have a source of duplicate items that we can test the merge behavior with (using the gadget vs. using Special:MergeItems vs. using the API).

Managed to find a pair of duplicates via the recent changes tagged as sitelink change from connected wiki (the dewiki and ptwiki articles were clearly translations of one another, with a left parenthesis missing in the same place in both) and merged them using the merge gadget. I was not able to reproduce the issue.

before merge
MariaDB [wikidatawiki]> SELECT wbit_item_id, wby_name, wbx_text, wbxl_language FROM wbt_item_terms JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id JOIN wbt_type ON wbtl_type_id = wby_id JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id IN (112664122, 112671429);
+--------------+----------+----------+---------------+
| wbit_item_id | wby_name | wbx_text | wbxl_language |
+--------------+----------+----------+---------------+
|    112664122 | label    | Vyeyra   | de            |
|    112671429 | label    | Vyeyra   | es            |
|    112671429 | label    | Vyeyra   | pt            |
+--------------+----------+----------+---------------+
3 rows in set (0.005 sec)
after merge
MariaDB [wikidatawiki]> SELECT wbit_item_id, wby_name, wbx_text, wbxl_language FROM wbt_item_terms JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id JOIN wbt_type ON wbtl_type_id = wby_id JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id IN (112664122, 112671429);
+--------------+----------+----------+---------------+
| wbit_item_id | wby_name | wbx_text | wbxl_language |
+--------------+----------+----------+---------------+
|    112664122 | label    | Vyeyra   | de            |
|    112664122 | label    | Vyeyra   | es            |
|    112664122 | label    | Vyeyra   | pt            |
+--------------+----------+----------+---------------+
3 rows in set (0.001 sec)

Found and merged another pair, again unable to reproduce the issue:

before merge
MariaDB [wikidatawiki]> SELECT wbit_item_id, wby_name, wbx_text, wbxl_language FROM wbt_item_terms JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id JOIN wbt_type ON wbtl_type_id = wby_id JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id IN (111854511, 112670581);
+--------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+---------------+
| wbit_item_id | wby_name    | wbx_text                                                                                                                        | wbxl_language |
+--------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+---------------+
|    111854511 | label       | Hum3D                                                                                                                           | en            |
|    111854511 | description | Hum3D sells high quality 3D models, which are all made by themselves. They have the biggest collection of cars on the Internet. | en            |
|    112670581 | label       | Hum3D                                                                                                                           | en            |
|    112670581 | label       | Hum3D                                                                                                                           | uk            |
+--------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+---------------+
4 rows in set (0.004 sec)
after merge
MariaDB [wikidatawiki]> SELECT wbit_item_id, wby_name, wbx_text, wbxl_language FROM wbt_item_terms JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id JOIN wbt_type ON wbtl_type_id = wby_id JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id IN (111854511, 112670581);
+--------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+---------------+
| wbit_item_id | wby_name    | wbx_text                                                                                                                        | wbxl_language |
+--------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+---------------+
|    111854511 | label       | Hum3D                                                                                                                           | en            |
|    111854511 | description | Hum3D sells high quality 3D models, which are all made by themselves. They have the biggest collection of cars on the Internet. | en            |
|    111854511 | label       | Hum3D                                                                                                                           | uk            |
+--------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+---------------+
3 rows in set (0.001 sec)

(Both had the same English label, so two of the rows got merged into one. But all the terms are still there.)

Might be time to look at other recent merges and see if this issue still happens at all. Previously I was under the impression that it seemed to happen fairly reliably – I hadn’t found any merges where it didn’t happen…

Some recent merges:

  • Perdikas Palaeontological Museum: the merge added an English alias which is present in the term store.
  • Cecilio Angulo: the merge added a Dutch alias which is present in the term store.
  • José Barreto: the merge added a Spanish alias which is not present in the term store.
  • hydrosulfide: the merge added an Uzbek label which is present in the term store.
  • Kummerower See: the merge added a Serbian label which is present in the term store.

So the issue still seems to happen occasionally (José Barreto), but not all the time, and in fact not even most of the time.

Okay, but after 50 repeats of the code from earlier (i.e. 100 items, 50 merges), I still haven’t been able to reproduce the issue locally once. So now I’m back to thinking that the issue doesn’t happen with wbmergeitems alone, but is related to the combination of API requests the merge gadget makes.

But if it doesn’t happen very often, then trying to find enough merge pairs on real Wikidata to test this with might be challenging – porting the merge gadget code to work on a local wiki might be more promising.

Does any of the lists of potential merge candidates at https://www.wikidata.org/wiki/Help:Merge#Finding_items_to_merge help?

Thanks, I might try some of those later.


(Aside about localhost: I tried to make my wiki serve API requests in parallel more, because I now suspect this is a race condition. Currently, I can send requests in parallel and they’ll be processed in parallel, and after setting output_buffering = 0; in php.init and therefore letting MediaWiki taking care of buffering and sending a Content-Length response header, the browser will also perceive the API requests to have completed as soon as the main request is done, so that secondary data updates actually take place post-send as expected and can potentially run in parallel to subsequent API requests. But for some reason, a sleep() in a post-send update will still prevent later API requests from starting processing, so that post-send updates still won’t overlap with requests that came in after the first request finished sending the response. It’s possible that session_write_close() should help with this, or at least that it’s related to sessions – I stopped looking into it at this point, and started experimenting on Test Wikidata instead.)

It turns out the merge gadget doesn’t do as many different API things as I thought. It has some pre-merge checks and some error handling, but at the API level, the merge it does is just:

  1. wbmergeitems (with ignoreconflicts=description)
  2. purge (“// XXX: Do we even need to purge? Why?”)
  3. load target item

There is no extra API request for blanking the item or anything – if the merge makes several edits, they all come from wbmergeitems.

If you just create two new items and then merge them with wbmergeitems on Test Wikidata, and then immediately get their entityterms:

api = new mw.Api(); now = new Date().toISOString(); item1 = ( await api.postWithEditToken( { action: 'wbeditentity', new: 'item', summary: 'T309445', data: JSON.stringify( { labels: { en: { language: 'en', value: `${now} (label)` } } } ) } ) ).entity.id; console.log( item1 ); item2 = ( await api.postWithEditToken( { action: 'wbeditentity', new: 'item', summary: 'T309445', data: JSON.stringify( { descriptions: { en: { language: 'en', value: `${now} (description)` } } } ) } ) ).entity.id; console.log( item2 ); await api.postWithEditToken( { action: 'wbmergeitems', fromid: item2, toid: item1, summary: 'T309445' } ); await api.get( { action: 'query', prop: 'entityterms', wbetlanguage: 'en', titles: item1, formatversion: 2 } ).then( r => r.query.pages[ 0 ].entityterms );

Then the entityterms will fairly often just have the terms of the original item (a label), without the merged terms of the other item (a description). But this resolves itself after a second or so (once the secondary data updates of the merge finish), and if you repeat the entityterms request it’ll show all the terms as expected.

So I tried matching the merge gadget more closely, by also purging the page and then getting its HTML:

api = new mw.Api(); now = new Date().toISOString(); item1 = ( await api.postWithEditToken( { action: 'wbeditentity', new: 'item', summary: 'T309445', data: JSON.stringify( { labels: { en: { language: 'en', value: `${now} (label)` } } } ) } ) ).entity.id; console.log( item1 ); item2 = ( await api.postWithEditToken( { action: 'wbeditentity', new: 'item', summary: 'T309445', data: JSON.stringify( { descriptions: { en: { language: 'en', value: `${now} (description)` } } } ) } ) ).entity.id; console.log( item2 ); await api.postWithEditToken( { action: 'wbmergeitems', fromid: item2, toid: item1, summary: 'T309445' } ); await api.post( { action: 'purge', titles: item1 } ); await fetch( `https://test.wikidata.org/wiki/${item1}` ); await api.get( { action: 'query', prop: 'entityterms', wbetlanguage: 'en', titles: item1, formatversion: 2 } ).then( r => r.query.pages[ 0 ].entityterms );

But with this version, the logged entityterms are actually always complete (at least every time I tried this). It looks like the purge + render gives the secondary data updates enough time to finish.

In theory, I think both the merge and the page view could attempt to write terms to the term store. (Though I’m not sure if loading an item page that isn’t in the parser cache will actually cause the secondary data updates to run again.) However, the page view should always have the merged item state, thanks to MediaWiki’s ChronologyProtector, so there shouldn’t be any race condition there. The only possible race condition I can see at the moment is if someone else happens to view the item at the critical moment (ChronologyProtector only protects the chronology of requests from one source):

  1. Client A merges the items. Client A’s secondary data updates start.
  2. Client A purges the item.
  3. Client A loads the item.
  4. Client B loads the item (randomly?) and happens to get a database replica that hasn’t seen the merge yet.
  5. Client A’s secondary data updates finish, writing post-merge terms to the term store.
  6. Client B’s secondary data updates start, based on the pre-merge revision from the replica.
  7. Client B’s secondary data updates finish, writing pre-merge terms to the term store.

This seems very unlikely, and I’m still not sure if it’s even possible (does a pageview really run the secondary data updates again?). I think this still needs more investigation; I might try to just merge enough duplicates on Wikidata, with the network panel open, and if the bug happens, then I would look at the network panel more closely and see what requests happened in what order.

I tested it: a regular index.php?action=purge or action.php?action=purge doesn’t trigger secondary data updates, not even when you visit the page afterwards; api.php?action=purge&forcelinkupdate=1 does trigger secondary data updates. The merge gadget doesn’t set forcelinkupdate, so the secondary data updates shouldn’t get triggered by it (other than through the edit).

The simplest explanation for this whole business would be that the secondary data updates somehow just don’t happen. I don’t know why this would be – of course it’s possible that they crash for some reason, but I’d expect to see something in logstash in that case, and I’m not seeing anything – but at this point, I think this is a theory worth either confirming or ruling out, by adding some logging. I suggest that we add a temporary log channel that captures messages at any level (like, for example, the DuplicateParse channel), and make ItemHandler::getSecondaryDataUpdates() and ItemTermStoreWriterAdapter::saveTermsOfEntity() log debug messages indicating that the secondary data updates were created and executed, with some basic information about the terms being stored (number of labels, descriptions, and aliases across all languages; we don’t have access to the revision ID at this point, unfortunately).

Edit: created T311307: Add debug logging for item terms storage after merging

…and while working on adding the logging, I suddenly find this:

ItemTermStoreWriterAdapter::saveTermsOfEntity()
		if ( $entity instanceof Item ) {
			try {
				$this->store->storeTerms( $entity->getId(), $entity->getFingerprint() );
				return true;
			} catch ( TermStoreException $ex ) {
				return false;
			}
		}

Small wonder there’s no trace of errors in Logstash…

…and while working on adding the logging, I suddenly find this:

ItemTermStoreWriterAdapter::saveTermsOfEntity()
		if ( $entity instanceof Item ) {
			try {
				$this->store->storeTerms( $entity->getId(), $entity->getFingerprint() );
				return true;
			} catch ( TermStoreException $ex ) {
				return false;
			}
		}

Small wonder there’s no trace of errors in Logstash…

This turns out to be a red herring, there’s no code that actually throws this exception type. However, we can now see uncaught exceptions in logstash – they get logged to the normal mediawiki-errors board as well, we just didn’t find them before because they don’t include the item ID anywhere in the request URL or stack trace. These seem to happen in bursts – for example, 45 of them all between 5:19 and 5:26 AM today

image.png (300×1 px, 23 KB)

– and the error seems to be consistent:

Error 1205: Lock wait timeout exceeded; try restarting transaction
Function: Wikibase\Lib\Store\Sql\Terms\DatabaseTermStoreWriterBase::acquireAndInsertTerms
Query: SELECT  wbit_term_in_lang_id AS `value`  FROM `wbt_item_terms`    WHERE wbit_item_id = 41443421   FOR UPDATE

This can happen – I think we need to wrap the deferred update in a try/catch block and reschedule the update as a job if it fails. (And then rethrow the exception, so MediaWiki knows there was a DB error.) I guess when we wrote this code we might have assumed that MediaWiki did this automatically?

(We might also reconsider if the FOR UPDATE there is necessary, but I think that would be a larger project to change.)

Although I still don’t see an exception corresponding to José Barreto (from T309445#8023533), even when just looking through all the errors for that time regardless of item ID or other search terms.

I guess I should look for some missing labels from merges after we deployed this logging, and see if there are any without corresponding log events.

(If this is really due to locking errors as above, then I think in principle it could happen on any edit, but I guess merges make it much more likely that there’s a conflict around the same terms being removed somewhere and added elsewhere at the same time.)

Okay, I found another broken merge. Museo diocesano di Capua gained an Italian alias through a merge, but in the entityterms API (and in the term store) it’s missing. However, in logstash, there are only two WikibaseTerms events for this ID:

  • 09:58:28.607 UTC: schedule saveTermsOfEntity for Q21196221
  • 09:58:28.813 UTC: run saveTermsOfEntity for Q21196221

In the full log events for this request ID, we see a few more events:

  • Transaction spent 9ms in writes, under the 3s limit
  • schedule saveTermsOfEntity for Q55386461
  • run saveTermsOfEntity for Q55386461
  • schedule saveTermsOfEntity for Q21196221
  • Possibly redundant parse
  • run saveTermsOfEntity for Q21196221
  • schedule deleteTermsOfEntity for Q55386461
  • Possibly redundant parse
  • Possibly redundant parse

(In this order – i.e. the Q55386461 saveTermsOfEntity runs before the Q21196221 one is scheduled!)

There is no trace of an error here. So I think we’re looking at two (or more) separate issues here: as mentioned in T309445#8039853, sometimes writing the terms fails due to a database error, and it’s a bug that we don’t retry it; but also, unrelatedly, sometimes the terms don’t end up in the term store even though ItemTermStoreWriterAdapter::saveTermsOfEntity does not detect any error.

I wonder if this is due to the fact that we’re making several edits (remove data from old item, add data to new item, redirect old item to new item) in one API request? At least, I assume that explains why we’re seeing some secondary data updates run before others are scheduled, and I could imagine that this causes some confusion.

Coming back to José Barreto for a moment, since that item hasn’t been edited yet and is still missing the terms: The missing alias («José Alberto Barreto», Spanish) is not only missing from the term store as a whole; it’s not even present in the innermost text table:

MariaDB [wikidatawiki]> SELECT * FROM wbt_text WHERE wbx_text = 'José Alberto Barreto';
Empty set (0.001 sec)

(And just to confirm that this isn’t an input encoding issue with the é – the text without the middle name works:)

MariaDB [wikidatawiki]> SELECT * FROM wbt_text WHERE wbx_text = 'José Barreto';
+----------+---------------+
| wbx_id   | wbx_text      |
+----------+---------------+
| 18515003 | José Barreto  |
+----------+---------------+
1 row in set (0.001 sec)

But there is a trace of this alias in wbt_term_in_lang and wbt_item_terms! Success!!

MariaDB [wikidatawiki]> SELECT wbit_item_id, wbit_term_in_lang_id, wby_name, wbtl_text_in_lang_id, wbxl_id FROM wbt_item_terms LEFT JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id IN (61827465) AND wbxl_id IS NULL;
+--------------+----------------------+----------+----------------------+---------+
| wbit_item_id | wbit_term_in_lang_id | wby_name | wbtl_text_in_lang_id | wbxl_id |
+--------------+----------------------+----------+----------------------+---------+
|     61827465 |            951539057 | alias    |            946668473 |    NULL |
+--------------+----------------------+----------+----------------------+---------+
1 row in set (0.002 sec)

So we must have some race condition here, where the wbt_text_in_lang (and wbt_text) gets cleaned up in parallel to its being used for a new wbt_term_in_lang (and wbt_item_terms) row.

Change 810921 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Add debug logging for term storage cleanup

https://gerrit.wikimedia.org/r/810921

Change 810921 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Add debug logging for term storage cleanup

https://gerrit.wikimedia.org/r/810921

Found a new example, Still Life with Pie and Silver Ewer. The merge added a German label, one Dutch and English alias each, and a West Frisian description; the West Frisian description and English alias exist in the term store, but the German label and Dutch alias are missing, in an especially interesting manner:

MariaDB [wikidatawiki]> SELECT wbit_item_id, wbit_term_in_lang_id, wby_name, wbtl_text_in_lang_id, wbxl_id FROM wbt_item_terms LEFT JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id IN (17524323) AND wbxl_id IS NULL;
+--------------+----------------------+----------+----------------------+---------+
| wbit_item_id | wbit_term_in_lang_id | wby_name | wbtl_text_in_lang_id | wbxl_id |
+--------------+----------------------+----------+----------------------+---------+
|     17524323 |            906991069 | NULL     |                 NULL |    NULL |
|     17524323 |            957282069 | alias    |            228497807 |    NULL |
+--------------+----------------------+----------+----------------------+---------+
2 rows in set (0.015 sec)

For the alias, the wbt_text_in_lang was cleaned up, but the wbt_term_in_lang remained behind; for the label, seemingly, even the wbt_term_in_lang is gone.

The logstash board for this is unfortunately a bit long, 84 events in total (because each term ID gets a separate clean job scheduled, don’t know why). Also, mysteriously, the “schedule deleteTermsOfEntity” log message seems to be missing – not just for this merge, but in general, there are no instances of this message after midnight UTC yesterday/today. (The last instances before then were at 23:57 UTC, so the cutoff is suspiciously close to midnight – see Logstash of this message four hours before+after midnight.)

Searching for “schedule/running/ran CleanTermsIfUnusedJob” for the two term_in_lang IDs in the SQL output above, there is only one set of messages: that for the ID 906991069 (the label). Looking for just those jobs plus all the non-job messages (link), we get this timeline, in terms of seconds after 13:24 UTC:

  • 34.756 (mw1358): schedule saveTermsOfEntity for Q28060661 (the source item)
  • 35.105: (mw1358) run saveTermsOfEntity for Q28060661
  • 35.146 (mw1448): schedule saveTermsOfEntity for Q28060661
  • 35.161 (mw1448): run saveTermsOfEntity for Q28060661
  • 35.472 (mw1358): schedule CleanTermsIfUnusedJob for Q28060661
  • 35.532 (mw1437): running CleanTermsIfUnusedJob for Q28060661
  • 35.553 (mw1437): ran CleanTermsIfUnusedJob for Q28060661
  • 35.561 (mw1358): schedule saveTermsOfEntity for Q17524323 (the target item)
  • 35.835 (mw1358): run saveTermsOfEntity for Q17524323

The source item history also shows three edits as part of the merge there, instead of the usual two (plus one on the target item) that you can see e.g. on the José Barreto source item: Between “merged item into” (remove most data) and “redirected to” (turn into redirect), there is also “cleared an item” (remove four descriptions). This comes from a property of the merge gadget that I overlooked earlier – it doesn’t always just make a wbmergeitems API request.

  • If wbmergeitems actually redirects the item, then the merge gadget is basically done, and all three total edits – “merged item into” (remove all data from source), “redirected to” (turn source into redirect), “merged item from” (add data to target) – come from the one API request.
  • If wbmergeitems doesn’t redirect the item – I think this happens when there are conflicting descriptions (though the wbmergeitems request is made with ignoreconflicts: 'description' – then I think wbmergeitems only makes two edits, “merged item into” (remove most data from source) and “merged item from” (add data to target); the merge gadget then follows this with separate API requests wbeditentity with clear: true “cleared an item” (remove descriptions from source) and wbcreateredirect “redirected to” (turn source into redirect).

We can see this in the different mw servers in the timeline: I think that mw1358 served the original wbmergeitems request, mw1448 the wbeditentity (clear), and mw1437 ran a job. The “schedule deleteTermsOfEntity” log message that should correspond to the wbcreateredirect request seems to have gone missing, as already mentioned, otherwise we should see a fourth mw server.

This timeline claims that the final saveTermsOfEntity for the target item was scheduled and run after everything was done on the source item – including the additional API requests to clear and redirect the source item! I find this hard to believe right now: the merge gadget only makes those requests after wbmergeitems has returned, and I would think that at least the scheduling of the target item saveTermsOfEntity must happen before the API response is sent. (The mw1358 requests all have the same request ID, so it’s not just coincidentally the same backend server, it really is the same request… but I wonder if the clock can be trusted?) Also, I’m still wondering why there’s no sign of a “clean” job for the other term_in_lang ID.

(I had a quick look at mw1358 and mw1448, and their clocks seemed to be no more than 20 ms apart, not enough to fully explain the timeline being confusing. Also, we seem to be running systemd-timesyncd in production, with Wikimedia-specific time source servers. It doesn’t look like we can blame the confusing timeline on clock drift.)

Also, mysteriously, the “schedule deleteTermsOfEntity” log message seems to be missing – not just for this merge, but in general, there are no instances of this message after midnight UTC yesterday/today. (The last instances before then were at 23:57 UTC, so the cutoff is suspiciously close to midnight – see Logstash of this message four hours before+after midnight.)

This is still happening – the “schedule deleteTermsOfEntity” log messages vanished starting on the 6th of July and haven’t come back since then. Logstash

image.png (274×678 px, 13 KB)

I don’t know what to make of this. It looks like the code still runs, in that redirects get their term_in_langs deleted.


Found another merge, 3 turn. The English alias’ text_in_lang is gone from the term store:

MariaDB [wikidatawiki]> SELECT wbit_item_id, wbit_term_in_lang_id, wby_name, wbtl_text_in_lang_id, wbxl_id FROM wbt_item_terms LEFT JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id IN (3945846) AND wbxl_id IS NULL;
+--------------+----------------------+----------+----------------------+---------+
| wbit_item_id | wbit_term_in_lang_id | wby_name | wbtl_text_in_lang_id | wbxl_id |
+--------------+----------------------+----------+----------------------+---------+
|      3945846 |            960876949 | alias    |            955984243 |    NULL |
+--------------+----------------------+----------+----------------------+---------+
1 row in set (0.006 sec)

Logstash board: only 19 messages this time. Note that the merged item (history) had had an English label and description added to it right before it was merged into the other item. (The other item conveniently didn’t receive any edits immediately before or after the merge.) The two separate edits happened at :00 and :01, whereas the merge took place after :27:

MariaDB [wikidatawiki]> SELECT rev_timestamp FROM revision WHERE rev_page = (SELECT page_id FROM page WHERE page_namespace = 0 AND page_title = 'Q20918681') ORDER BY rev_timestamp DESC;
+----------------+
| rev_timestamp  |
+----------------+
| 20220712152628 |
| 20220712152628 |
| 20220712152627 |
| 20220712152601 |
| 20220712152600 |
| 20181023231241 |
| 20181023231235 |
| 20150906094514 |
+----------------+
8 rows in set (0.001 sec)

Also, we’re again dealing with an extra API request to clear the item, due to a description conflict. There are four request IDs in logstash (note that jobs have the same request ID as the request that triggered them), which I’ll give nicknames so I can refer to them below:

  • “addlabel”: request ID 7a16bb17-58ed-48db-9cb6-1ada9c7cfe49 – added the English label
  • “adddescription”: request ID dd5ec054-d8db-4f92-b6ed-687fa76dd039 – added the English description
  • “mergeitems”: request ID 4ba56d96-123d-4047-b6b0-f01142c42e60 – tried to merge one item into the other, added the data to the target item, but did not redirect the source item due to the description conflict
  • “clearitem”: request ID 8ff656c2-dfa9-4112-8bdd-74682b150486 – cleared the source item (removing the conflicting English description) to prepare it for being redirected to the target item

I assume there must have been a fifth request, “createredirect”, but its “schedule deleteTermsOfEntity” message did not get logged, and so there’s no trace of it in the WikibaseTerms channel.

The timeline of the log messages is:

  • 15:26:01.032, “addlabel”: schedule saveTermsOfEntity for Q20918681
  • 15:26:01.076, “addlabel”: run saveTermsOfEntity for Q20918681 (2 labels, 0 descriptions, 0 aliases)
  • 15:26:01.478, “adddescription”: schedule saveTermsOfEntity for Q20918681
  • 15:26:01.498, “adddescription”: run saveTermsOfEntity for Q20918681 (2 labels, 1 description, 0 aliases)
  • 15:26:27.764, “mergeitems”: schedule saveTermsOfEntity for Q20918681
  • 15:26:27.961, “mergeitems”: run saveTermsOfEntity for Q20918681 (0 labels, 1 description, 0 aliases)
  • 15:26:27.969, “mergeitems”: schedule CleanTermsIfUnusedJob for Q20918681 (term_in_lang 56239649)
  • 15:26:27.981, “mergeitems”: schedule CleanTermsIfUnusedJob for Q20918681 (term_in_lang 960876926)
  • 15:26:28.011, “mergeitems”: running CleanTermsIfUnusedJob for Q20918681 (term_in_lang 56239649)
  • 15:26:28.017, “mergeitems”: ran CleanTermsIfUnusedJob for Q20918681 (term_in_lang 56239649)
  • 15:26:28.029, “mergeitems”: running CleanTermsIfUnusedJob for Q20918681 (term_in_lang 960876926)
  • 15:26:28.040, “mergeitems”: ran CleanTermsIfUnusedJob for Q20918681 (term_in_lang 960876926)
  • 15:26:28.077, “mergeitems”: schedule saveTermsOfEntity for Q3945846
  • 15:26:28.204, “mergeitems”: run saveTermsOfEntity for Q3945846 (7 labels, 2 descriptions, 3 aliases)
  • 15:26:28.301, “clearitem”: schedule saveTermsOfEntity for Q20918681
  • 15:26:28.321, “clearitem”: run saveTermsOfEntity for Q20918681 (0 labels, 0 descriptions, 0 aliases)
  • 15:26:28.326, “clearitem”: schedule CleanTermsIfUnusedJob for Q20918681 (term_in_lang 960876927)
  • 15:26:28.393, “clearitem”: running CleanTermsIfUnusedJob for Q20918681 (term_in_lang 960876927)
  • 15:26:28.407, “clearitem”: ran CleanTermsIfUnusedJob for Q20918681 (term_in_lang 960876927)

At least the requests that we have logs for are nicely sequential and not interleaved. Here are the mentioned term_in_lang IDs:

MariaDB [wikidatawiki]> SELECT wbit_item_id, wbit_term_in_lang_id, wbtl_id, wby_name, wbtl_text_in_lang_id, wbxl_id, wbxl_text_id, wbxl_language, wbx_id, wbx_text FROM wbt_item_terms LEFT JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbtl_id IN (56239649, 960876926, 960876927) AND wbit_item_id IN (20918681, 3945846);
+--------------+----------------------+----------+----------+----------------------+----------+--------------+---------------+----------+----------+
| wbit_item_id | wbit_term_in_lang_id | wbtl_id  | wby_name | wbtl_text_in_lang_id | wbxl_id  | wbxl_text_id | wbxl_language | wbx_id   | wbx_text |
+--------------+----------------------+----------+----------+----------------------+----------+--------------+---------------+----------+----------+
|      3945846 |             56239649 | 56239649 | label    |             39332690 | 39332690 |     15585948 | fi            | 15585948 | Kolmonen |
+--------------+----------------------+----------+----------+----------------------+----------+--------------+---------------+----------+----------+
1 row in set (0.001 sec)

Only the Finnish label is left, the other two term_in_langs presumably got cleaned. But the term_in_lang that we saw as incomplete above – ID 960876949 – doesn’t occur in this log at all!

It looks like, whatever causes the “schedule deleteTermsOfEntity” log messages to vanish, actually causes all messages after that point to vanish? And this even propagates across jobs? I don’t understand how this is possible – but otherwise, I would expect some message in logstash about a CleanTermsIfUnusedJob with target:960876949, and I can’t find it.

Trying something slightly different. I think the Still Life with Pie and Silver Ewer, inspected previously (T309445#8055711), is too confusing to be of much further use in its broken state; therefore, I decided to purge it. An API purge with forcelinkupdate set successfully fixed all the terms of the item:

MariaDB [wikidatawiki]> SELECT wbit_item_id, wbit_term_in_lang_id, wby_name, wbtl_text_in_lang_id, wbxl_id FROM wbt_item_terms LEFT JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id IN (17524323) AND wbxl_id IS NULL;
Empty set (0.002 sec)

And the partial rows from before seem to have been properly cleaned up as well, there is no trace left of either the term_in_lang IDs or the text_in_lang ID:

MariaDB [wikidatawiki]> SELECT wbtl_id, wby_name, wbtl_text_in_lang_id, wbxl_id FROM wbt_term_in_lang LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbtl_id IN (906991069, 957282069);
Empty set (0.001 sec)

MariaDB [wikidatawiki]> SELECT * FROM wbt_term_in_lang WHERE wbtl_text_in_lang_id = 228497807;
Empty set (0.001 sec)

So I think one thing I am confident to say (for a change!) is that a purge with links update fixes the issue; and I extrapolate from this that a subsequent edit probably fixes the issue as well. (Note: on other wikis, a “purge with links update” is equivalent to a “null edit”, but you can’t null edit an item. Also, a “UI purge”, via ?action=purge in the normal URL, does not include a links update, and will not fix the issue.)

Is the "purge with links update" something we should do routinely after merges?

No, I don’t think so. The merge gadget could in theory do it, but I’m not sure that would be a good idea, since doing the purge right after the merge probably doesn’t help with the race condition.

Ah okay I see, yes that makes sense!

Thank you, Lucas, for everything that you have put into this and for documenting it so well!

Let's now do what you suggested and end the investigation at this point. I removed the task from the iteration boards so that I can keep the task open as a reference for the future.

Manuel renamed this task from INVESTIGATION: Terms storage missing most item labels after merge, resulting in item IDs being shown to [Investigation] Terms storage missing most item labels after merge, resulting in item IDs being shown.Jul 14 2022, 2:14 PM