Page MenuHomePhabricator

Include request id (if present) in a comment in DB queries
Closed, DeclinedPublic

Description

Came up in a conversation the other day: We assign a request ID to incoming requests, in many cases. It would be great to include that as a comment in SQL queries (like we do with class/method names), so that when we identify slow requests we can correlate the SQL queries that were generated by that request.

@jcrespo curious what you think of this...crazy? Doable? Somewhere in the middle?

Event Timeline

I don't have any thoughts about this, neither good nor bad, mediawiki loadbalancer already adds the mediawiki function to the SQL comment to identify the origin: https://tendril.wikimedia.org/report/slow_queries?host=^db&user=wikiuser&schema=wik&hours=1 and MySQL/MariaDB already have a pseudo-standard way to summarize mysql query patterns with a digest:

[email protected][sys]> SELECT query, digest FROM statement_analysis LIMIT 10;
+-------------------------------------------------------------------+----------------------------------+
| query                                                             | digest                           |
+-------------------------------------------------------------------+----------------------------------+
| SELECT `GET_LOCK` (...) AS `lockstatus`                           | 9a07945c091deaabda78ba0327ec1909 |
| SELECT `el_to` FROM `externallinks` WHERE `el_from` = ?           | 3526fe2d93402243c3653986882f33de |
| SELECT `pl_namespace` , `pl_ti ... agelinks` WHERE `pl_from` = ?  | 8ee6691d3dac58c77c41f6ff9c23a165 |
| COMMIT                                                            | ab648bfa17fa36e67f82321c0eb846cb |
| SELECT `tl_namespace` , `tl_ti ... atelinks` WHERE `tl_from` = ?  | 3bfa08845f033e90dcb0377b9a597ca7 |
| SELECT `user_id` , `user_name` ... er_id` = ? LIMIT ? FOR UPDATE  | b2eaa497fdf826873c67fb2daa7b7777 |
| SHOW GLOBAL VARIABLES LIKE ?                                      | 2c2b0d40c9af970951188c70d8765d39 |
| SELECT `cl_to` , `cl_sortkey_p ... orylinks` WHERE `cl_from` = ?  | e0fb2d642fdd3c1c20fca624852b4b80 |
| SELECT `wl_user` FROM `watchli ... ificationtimestamp` IS NULL )  | 0e696d8e5978797087e66597ba766e5b |
| SELECT `rev_id` , `rev_page` , ... id` = `page_latest` ) LIMIT ?  | deddb5b0a7e3a41e0fd3c437e516eab2 |
+-------------------------------------------------------------------+----------------------------------+
10 rows in set (0.07 sec)

I personally don't need anything else for performance tuning, but I have nothing against adding extra metadata on comments if that helps your team.

If you are open about other suggestions that would be useful to us DBAs is mediawiki logs generating a field related to the datacenter. While in most cases it can be taken from the mediawiki server using it, it may not be true in the future if cross-dc queries are a thing, plus multi-dc happens in a few months, to isolate errors only happening on 1 datacenter (e.g. there is lag on one database section, but only on one datacenter and not the other).

Vvjjkkii renamed this task from Include request id (if present) in a comment in DB queries to t8daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Mainframe98 renamed this task from t8daaaaaaa to Include request id (if present) in a comment in DB queries.Jul 1 2018, 8:32 AM
Mainframe98 lowered the priority of this task from High to Medium.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added a subscriber: Aklapper.

This would be super nice. Can this interfere with query caching though? (We don't reuse prepared statements so probably there's no query caching anyway, but better to be sure.)

This would be super nice. Can this interfere with query caching though? (We don't reuse prepared statements so probably there's no query caching anyway, but better to be sure.)

This is a good point to keep in mind. From what I understand, we don't employ nor have plans to employ caching at the layer of raw SQL text. We do have caching at the higher level (e.g. Memc/APC wrapper around a PHP method call), and the backend databases have their own hot caching of individual parts of tables and indexes.

We currently include the viewing user's name or IP address in several SQL queries already (through a prepended comment). We also include the function name of the PHP caller in many queries. While rare, there can be multiple different callers producing the same effective query.

The idea would be to extend the comment field to also include the request ID.

We'll get to this next quarter, maybe?

Change 721926 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/core@master] db: Include request ID in SQL query comment

https://gerrit.wikimedia.org/r/721926

@Marostegui says we shouldn't consider these as blockers while it would make tendril useless. I suggested these two to be the blockers but I can be easily convinced otherwise.

I am not entirely convinced it should be pushed unless it is essential. At the moment, tendril is the only tool that allows us to quickly spot slow queries, so making it useless isn't good.

The reason this is being pushed is that we had several outages that finding out what type of request has been causing really slow queries took a lot of our time and there wasn't a way to connect the slow query to a request. While I think having something cleaner (like using open telemetry, traffik, etc.) would be more useful. This can serve as a band-aid in the mean time.

If we can push and if it makes tendril useless revert and think of a better way, I am fine with that.

The reason this is being pushed is that we had several outages that finding out what type of request has been causing really slow queries took a lot of our time and there wasn't a way to connect the slow query to a request. While I think having something cleaner (like using open telemetry, traffik, etc.) would be more useful. This can serve as a band-aid in the mean time.

I meant pushed as "pushing the code to production" - I am totally fine with us investigating a way to narrow things down

@Ladsgroup This is not about slow queries or Logstash, but about tracing. Concretely, it proposes that from the DB server perspective and ad-hoc network interception one will be able to correlate a query with a request ID.

Is there a reason not to include the reqId in sql query? (e.g. in terms of functional gain/loss, or in terms of complexity/maintenance). It seems easy to do and beneficial, but I may be missing something :)

It has two issues:

  • It makes tendril useless because it wouldn't be able aggregate slow queries (https://tendril.wikimedia.org/report/slow_queries?host=^db&user=wikiuser&schema=wik&hours=1)
    • This one doesn't matter, we are killing tendril in favor of logstash anyway and SREs should look into logstash during outages.
  • It would also make performance_schema useless and there is no way to actually make it work (T291420)
    • This is important and nothing we can fix, we recently opened it up to devs (T195578) and I would like to build trainings/docs on how to utilize this important tool but it wouldn't be able to aggregates if we req id to comment (query events_statements_current table for example).

That being said, I understand the need for tracing but I think it should be part of a bigger initiative to have proper tracing for everything (using tools like jaeger for example)

OK. I've never used performance_schema but that seems interesting.

If our policy is then that apart from standard SQL syntax (which I guess performance_schema somehow knows how to deal with), that no variable comments are allowed, then we need to make the current situation a bit less brittle I think.

Right now, as per comments on unmerged https://gerrit.wikimedia.org/r/c/721926, we have code that inserts user names for web requests and hostnames for CLI. But at least one of those is broken right now, but could kick back in at any moment if there isnt' a policy against it.

Maybe this ticket can be rephrased to not suggest a specific implementation then. While TransactionProfiler is a partial solution, I think there is a need to be able to correlate any SQL query to a specific request, whether during live incident response or afterwards during an investigation.

Change 739692 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Document not adding varying ids to SQL comments

https://gerrit.wikimedia.org/r/739692

Right now, as per comments on unmerged https://gerrit.wikimedia.org/r/c/721926, we have code that inserts user names for web requests and hostnames for CLI. But at least one of those is broken right now, but could kick back in at any moment if there isnt' a policy against it.

Added a patch for it. I will also try to document it in mediawiki.org

It would also make performance_schema useless and there is no way to actually make it work

I keep hearing this, but I haven't seen any proof of this. I am not sure why you think so lowly of MySQL tooling engineers. :-( Example:

MariaDB [sys]> select * FROM statement_analysis WHERE digest = 'e788b66b07c3f3c74e4e7d8c623d623b'\G
*************************** 1. row ***************************
            query: SELECT * FROM `recentchanges` WHERE `rc_timestamp` < NOW ( ) 
               db: my_wiki
        full_scan: *
       exec_count: 1
        err_count: 0
       warn_count: 0
    total_latency: 2.08 ms
      max_latency: 2.08 ms
      avg_latency: 2.08 ms
     lock_latency: 60.00 us
        rows_sent: 1
    rows_sent_avg: 1
    rows_examined: 1
rows_examined_avg: 1
    rows_affected: 0
rows_affected_avg: 0
       tmp_tables: 0
  tmp_disk_tables: 0
      rows_sorted: 0
sort_merge_passes: 0
           digest: e788b66b07c3f3c74e4e7d8c623d623b
       first_seen: 2021-11-18 10:25:46
        last_seen: 2021-11-18 10:25:46
1 row in set (0.001 sec)

MariaDB [sys]> use my_wiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [my_wiki]> SELECT /* this is a test */ * FROM recentchanges WHERE rc_timestamp < now();
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
| rc_id | rc_timestamp   | rc_actor | rc_namespace | rc_title | rc_comment_id | rc_minor | rc_bot | rc_new | rc_cur_id | rc_this_oldid | rc_last_oldid | rc_type | rc_source | rc_patrolled | rc_ip     | rc_old_len | rc_new_len | rc_deleted | rc_logid | rc_log_type | rc_log_action | rc_params |
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
|     1 | 20210909171156 |        3 |            0 | Test     |             1 |        0 |      0 |      1 |         2 |             2 |             0 |       1 | mw.new    |            0 | 127.0.0.1 |          0 |          7 |          0 |        0 | NULL        |               |           |
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
1 row in set (0.000 sec)

MariaDB [my_wiki]> select * FROM sys.statement_analysis WHERE digest = 'e788b66b07c3f3c74e4e7d8c623d623b'\G
*************************** 1. row ***************************
            query: SELECT * FROM `recentchanges` WHERE `rc_timestamp` < NOW ( ) 
               db: my_wiki
        full_scan: *
       exec_count: 2
        err_count: 0
       warn_count: 0
    total_latency: 2.29 ms
      max_latency: 2.08 ms
      avg_latency: 1.15 ms
     lock_latency: 128.00 us
        rows_sent: 2
    rows_sent_avg: 1
    rows_examined: 2
rows_examined_avg: 1
    rows_affected: 0
rows_affected_avg: 0
       tmp_tables: 0
  tmp_disk_tables: 0
      rows_sorted: 0
sort_merge_passes: 0
           digest: e788b66b07c3f3c74e4e7d8c623d623b
       first_seen: 2021-11-18 10:25:46
        last_seen: 2021-11-18 10:27:43
1 row in set (0.002 sec)

MariaDB [my_wiki]> SELECT /* {transaction_id:666} */ * FROM recentchanges WHERE rc_timestamp < now();
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
| rc_id | rc_timestamp   | rc_actor | rc_namespace | rc_title | rc_comment_id | rc_minor | rc_bot | rc_new | rc_cur_id | rc_this_oldid | rc_last_oldid | rc_type | rc_source | rc_patrolled | rc_ip     | rc_old_len | rc_new_len | rc_deleted | rc_logid | rc_log_type | rc_log_action | rc_params |
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
|     1 | 20210909171156 |        3 |            0 | Test     |             1 |        0 |      0 |      1 |         2 |             2 |             0 |       1 | mw.new    |            0 | 127.0.0.1 |          0 |          7 |          0 |        0 | NULL        |               |           |
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
1 row in set (0.000 sec)

MariaDB [my_wiki]> select * FROM sys.statement_analysis WHERE digest = 'e788b66b07c3f3c74e4e7d8c623d623b'\G
*************************** 1. row ***************************
            query: SELECT * FROM `recentchanges` WHERE `rc_timestamp` < NOW ( ) 
               db: my_wiki
        full_scan: *
       exec_count: 3
        err_count: 0
       warn_count: 0
    total_latency: 2.50 ms
      max_latency: 2.08 ms
      avg_latency: 833.34 us
     lock_latency: 187.00 us
        rows_sent: 3
    rows_sent_avg: 1
    rows_examined: 3
rows_examined_avg: 1
    rows_affected: 0
rows_affected_avg: 0
       tmp_tables: 0
  tmp_disk_tables: 0
      rows_sorted: 0
sort_merge_passes: 0
           digest: e788b66b07c3f3c74e4e7d8c623d623b
       first_seen: 2021-11-18 10:25:46
        last_seen: 2021-11-18 10:29:29
1 row in set (0.002 sec)

Constant are digested, too, not only query normalization/syntax/comments:

MariaDB [my_wiki]> SELECT /* {transaction_id:666} */ * FROM recentchanges WHERE rc_timestamp < '2022';
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
| rc_id | rc_timestamp   | rc_actor | rc_namespace | rc_title | rc_comment_id | rc_minor | rc_bot | rc_new | rc_cur_id | rc_this_oldid | rc_last_oldid | rc_type | rc_source | rc_patrolled | rc_ip     | rc_old_len | rc_new_len | rc_deleted | rc_logid | rc_log_type | rc_log_action | rc_params |
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
|     1 | 20210909171156 |        3 |            0 | Test     |             1 |        0 |      0 |      1 |         2 |             2 |             0 |       1 | mw.new    |            0 | 127.0.0.1 |          0 |          7 |          0 |        0 | NULL        |               |           |
+-------+----------------+----------+--------------+----------+---------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+--------------+-----------+------------+------------+------------+----------+-------------+---------------+-----------+
1 row in set (0.000 sec)

MariaDB [my_wiki]> use sys
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

MariaDB [sys]> select * FROM sys.statement_analysis WHERE digest = '66971d919a0d1faf6d7b3f7dc985dd04'\G
*************************** 1. row ***************************
            query: SELECT * FROM `recentchanges` WHERE `rc_timestamp` < ? 
               db: my_wiki
        full_scan: 
       exec_count: 1
        err_count: 0
       warn_count: 0
    total_latency: 286.67 us
      max_latency: 286.67 us
      avg_latency: 286.67 us
     lock_latency: 56.00 us
        rows_sent: 1
    rows_sent_avg: 1
    rows_examined: 1
rows_examined_avg: 1
    rows_affected: 0
rows_affected_avg: 0
       tmp_tables: 0
  tmp_disk_tables: 0
      rows_sorted: 0
sort_merge_passes: 0
           digest: 66971d919a0d1faf6d7b3f7dc985dd04
       first_seen: 2021-11-18 10:32:50
        last_seen: 2021-11-18 10:32:50
1 row in set (0.002 sec)

MariaDB [sys]> use my_wiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [my_wiki]> SELECT /* {transaction_id:667} */ * FROM recentchanges WHERE rc_timestamp < '2019';
Empty set (0.000 sec)

MariaDB [my_wiki]> select * FROM sys.statement_analysis WHERE digest = '66971d919a0d1faf6d7b3f7dc985dd04'\G
*************************** 1. row ***************************
            query: SELECT * FROM `recentchanges` WHERE `rc_timestamp` < ? 
               db: my_wiki
        full_scan: 
       exec_count: 2
        err_count: 0
       warn_count: 0
    total_latency: 499.86 us
      max_latency: 286.67 us
      avg_latency: 249.93 us
     lock_latency: 110.00 us
        rows_sent: 1
    rows_sent_avg: 1
    rows_examined: 1
rows_examined_avg: 1
    rows_affected: 0
rows_affected_avg: 0
       tmp_tables: 0
  tmp_disk_tables: 0
      rows_sorted: 0
sort_merge_passes: 0
           digest: 66971d919a0d1faf6d7b3f7dc985dd04
       first_seen: 2021-11-18 10:32:50
        last_seen: 2021-11-18 10:35:02
1 row in set (0.002 sec)

Same for tendril- which of course I am not advocating to keep- also does the right thing when generating digests. You can see the same query structure here (a lock) with different constant and comments under the same md5sum:

Screenshot from 2021-11-18 10-50-15.png (1×2 px, 395 KB)

As far as I remember, the ip of user used to be written to the comment- and the digesting kept working-, but it hasn't been for a while, for some reason (actor refactoring?) -that would also be useful for detecting malfunctioning bots.

I still think logstash is the right way to implement this (it is application profiling), just mentioning that it will still work on p_s and tendril, so it is not a blocker for those.

The main problem we are having isn't grouping queries, it is the fact that using only p_s we get the queries redacted, which obviously makes impossible to troubleshoot specific queries either via explain or via optimizer trace.

using only p_s we get the queries redacted

Ah, yes, that is a big limitation of the P_S (and PMM, by extension). I wish an example of a few were given, but unless they are ongoing on processlist, that is not possible. Ideally on T295706 both a digest (for grouping) and the full query are available.

Another limitation of P_S/PMM (I know Manuel knows this, and we discussed this on our meeting, but mentioning it for others) is that it won't register not-finished queries, which is very important for very long queries- and tendril can do it because it does it through processlist.

However, my response wasn't "P_S is the right solution" (I did mention logstash would be a preferred solution), but pushing against "this (the patch) would make P_S useless" or "this would make tendril useless" as those I believe are not accurate- those would be unaffected- it won't improve P_S tracing, but it won't disrupt it (it is not a blocker).

In other words, if the patch improves logstash usage, it would be net or at least zero beneficial.

Change 739692 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Document not adding varying ids to SQL comments

https://gerrit.wikimedia.org/r/739692

Change 739950 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Stop calling LBFactory::setAgentName from MW wiring

https://gerrit.wikimedia.org/r/739950

using only p_s we get the queries redacted

Ah, yes, that is a big limitation of the P_S (and PMM, by extension). I wish an example of a few were given, but unless they are ongoing on processlist, that is not possible. Ideally on T295706 both a digest (for grouping) and the full query are available.

Even if the request ID is only available via processlist that would be a start, in past incidents the slow queries were still running and therefore still visible in show processlist.

To be more explicit, my understanding is that we didn't want unique comments in queries because it would hurt aggregation via tendril, performance_schema, and whatever replaces tendril. However, my reading of T193050#7512577 and the other comments by @jcrespo is that this is not the case, and those tools are able to properly aggregate queries regardless of the comments.

Is that a correct understanding? If so, it seems like there's no prohibition anymore on having query comments be unique and re-introduce IP/username/request ID/etc. with the understanding that it will only be visible in limited places like processlist and not every tool like tendril, performance_schema, etc.

I feel we are mixing two things, what tendril, perforamance_schema or whatever other tool we use, can come up with and what we want to do with the comments.
Right now, tendril uses an unique method to grab the queries, which is running show processlist, while this works great to know instantly what is running and what is or isn't slow, it is something that only tendril uses out there. I do not know any other tool that provides this. Again, it is great but has lots of scalability issues like we are seeing in our environment and it has a limited lifetime. Unfortunately, we cannot continue with tendril and we'd need to look for something else.
And whatever we go for, if it uses performance_schema it won't use show processlist, and the query will be redacted.

So far, looks like logstash might be able to provide something somewhat closer to what tendril offers (the full slow query, that was run at a given time and at a given serve). Still, won't provide real time queries, but this is the closest we can get if we want to know exactly which query was slow.

If in logstash we are able to group queries (or slow queries) even if we introduce IP/username/request ID/etc, then that's fine by me and we can go ahead.

I'm sorry for the confusion around the comment, the reason was that events_statements_current stores both normalized and non-normalized one and show them next to each other.

I'm still slightly in favor of not varying comments because I think whether in p_s or PMM or something else in the future, we might run into issues in aggregation and then later in the future it's going to be harder to fix (compared to now that we are preventing it). Not super strong on it though.

As far as logstash is concerned, there's little point in putting the IP etc. in the query when you can just store them as separate logstash fields, allowing better normalization, search, ranking etc.

There's also a privacy aspect - in general we try to avoid storing information which allows deducing reader behavior (ie. who looked at that). Reading pages doesn't usually result in DB queries, but occasionally it might - are we concerned about that?

As far as logstash is concerned, there's little point in putting the IP etc. in the query when you can just store them as separate logstash fields, allowing better normalization, search, ranking etc.

Indeed, we store the reqId that can get cross referenced with the rest of logstash or hadoop. And that is by default in every mw log.

There's also a privacy aspect - in general we try to avoid storing information which allows deducing reader behavior (ie. who looked at that). Reading pages doesn't usually result in DB queries, but occasionally it might - are we concerned about that?

The IP is usually important because during (malicious or not) DoS attacks using slow queries to overload the db. We need to find the abuser.

So I don't think we need to do more, the reqId is there.

I find curious all this thread- I feel like people (including me) were discussing through each other- I am going to try clarify where we are after giving a read to all comments :-).

I think the main pending misunderstanding is that reqid may not be needed because if in the end logstash ends up being the main location for query monitoring, with work on T295706, there will be no need to literally send it within the query, as it will be stored alongside the query, on separate field. Please note this conclusion was changed over time (it wasn't the original blocker and hence I think the confusion), mine and of others. Could @Krinkle and @Ladsgroup collaborate, to the extent possible, on T295706 so that can happen, making sure the reqId is part of the logging and we have the best traceability possible for SQL? I think with that, tracing will be possible regarding slow queries. IF (and that is a very hypothetical) logstash ends up being not the right tooling, or not the enough for slow queries and something else is needed, we should reevaluate this ticket (depending on that hypothetical extra tooling). For example, req id + P_S is not useful without MySQL 8 query sampling, so it wouldn't work either.

There is one thing pending, however. My understanding of @Krinkle's pending raised issues is that T295706 will cover traceability of slow queries- but how to trace non-slow ones? E.g. the case where a DOS or a bug happens, but they are very fast queries, as it happened recently at T296063 (where I believe they were fast queries). This is a new use case that wasn't covered by tendril or will be by T295706 and a new solution will be needed (doesn't have to be outside of logstash!). Even a req id may not be enough here because it won't appear on logstash in any other register, except maybe sampling? Throwing to the air this separate need for tracing, for which no solution is probably available yet.

Change 739950 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Stop calling LBFactory::setAgentName from MW wiring

https://gerrit.wikimedia.org/r/739950

I am lost in this thread. As far as performance and tracing goes, I am neutral or supportive of adding reqId in a comment to SQL we send. I don't mind much if we can't do it, but if it won't be considered problematic or adding difficulty, then I think we should do it.

Jaime seems to indicate that Tendril never had an issue with varying queries in the first place, and that it's not unusual at all in other tooling to normalize this correctly as well. My guess is that myself and others were confused by the fact that comments were visible in Tendril, not realizing that its aggregation logic for queries that are similar was handling things just fine. And not just fine with comments, but also with with whitespace, capitalisation of keywords, and aggregating variables and constants into placeholders. And not just in Tendril, but in other relevant tooling where aggregation is needed as well. Am I reading that right?

It seem to me that if a tool is going to do any kind of meaningful aggregation over queries, then normalizing is a hard and minimum requirement for it to be even remotely useful. There are lots of variables in conditional clauses that would otherwise still make most queries from the same caller different in small ways. Once past that the need for normalization over clauses and such, I think stripping comments is an easy thing to do, easier than normalizing conditional clauses. If a tool we use does one without the other, perhaps we can patch that upstream.

If someone can speak with authority having understood and read the various arguments, and thinks we can do it, please say so. Otherwise, please decline this task. I won't mind :)

Things have changed quite a bit since we last discussed this. Most notably: Tendril will be shut down next week so the whole discussion around its aggregation is now moot. OTOH, we now have slow queries dashboard which actually aggregates with the comment so adding the request id would break that aggregation. OTOOH (I have three hands), request id is automatically part of the logs in logstash (as replacement of tendril) so that already covers most of the usages needed in here.

Given the above, I suggest declining this.

Jaime asked:

There is one thing pending, however. […] how to trace non-slow ones? E.g. the case where a DOS or a bug happens, but they are very fast queries, as it happened recently at T296063 […]

Thanks. While it's a valid usecase, I feel this is a bit out of scope and hypothetical. If a fast query can cause issue, it must happen so much that it would show itself somewhere else, e.g. in traffic-related logs.

My future hope and plan is to have mediawiki do random sampling of queries (1:128 maybe? or 1:1024) which would give us massive o11y insight into the system and automatically address this issue as well.

I'd like to run a few statements/questions by you to make sure I understand your choice. Do some of these sound right?

  • You don't foresee a need to access individual DB hosts to look at running queries, where a reqId might come in handy.
  • It's not worth adding it anyway, because there are (or one day might be?) tools that aggregate queries but don't support stripping comments.
  • It's not worth doing because sampled entries Logstash will suffice, and this would be extra code to support something we don't need.

As I was quoted here, I feel I need to give context to my comment- I think a separate ticket/conversation can/should be open for things we have "lost" with the shutdown of tendril or maybe were never possible, like the one I mentioned. I think it is a valid conversation to have, but I also think it is out of scope of this ticket- they are not blockers for tendril shutdown, but "nice to have/improvements" over the current situation. Specially given that Ladsgroup is aware of this already -we can always do better :-)- , but we cannot just give everything top priority. I will be glad to provide feedback on things I missed while attending an outage, specially in the context of a db overload (some arose during the conversation leading to https://wikitech.wikimedia.org/wiki/Db-kill )- but none were related to a request id.

So there are a few things I hope everybody agrees:

  • There should be a way to correlate logs from the same HTTP requests, including database queries and other logs (I believe that is already possible)
  • Slow log queries should be grouped by same db/same general structure (this already happens)
  • Work is not finished (it never is), and more observability is always better (sampling, performance schema, more logs, more metrics, more dashboards, etc.), but not all can be implemented at the same time, it has to be in a rank of concrete, actionable needs
  • Performance team has already access to performance_schema and that could help with some of the work they are doing (they may need guidance?)

I'd like to run a few statements/questions by you to make sure I understand your choice. Do some of these sound right?

  • You don't foresee a need to access individual DB hosts to look at running queries, where a reqId might come in handy.

It's hard to predict but I think so. Mostly because it's either the query is so slow that it won't show up in slow queries dashboard. In that case, it's the code path that's more important. Or fast enough to show up in slow queries. If it's fast but there is a flood of them, that should be tackled in sampling and out of scope of this.

  • It's not worth adding it anyway, because there are (or one day might be?) tools that aggregate queries but don't support stripping comments.

That's correct. We are already in that point with logstash slow queries but there might be more tools doing similar things.

  • It's not worth doing because sampled entries Logstash will suffice, and this would be extra code to support something we don't need.

Instead of suffice, I'd say it will be complementary but yes. It adds more code to maintain that we really don't need.

So there are a few things I hope everybody agrees:

  • There should be a way to correlate logs from the same HTTP requests, including database queries and other logs (I believe that is already possible)

Yes! We got it for free by moving to logstash and mw's native logging.

  • Slow log queries should be grouped by same db/same general structure (this already happens)

Same as above.

  • Work is not finished (it never is), and more observability is always better (sampling, performance schema, more logs, more metrics, more dashboards, etc.), but not all can be implemented at the same time, it has to be in a rank of concrete, actionable needs

Exactly. I don't think we are losing any major functionality with this but we can always improve and I already have some ideas.

  • Performance team has already access to performance_schema and that could help with some of the work they are doing (they may need guidance?)

Let me know if you want to know more.

Thanks. That works for me.