Talk:Quarry

About this board

Previous discussion was archived at Talk:Quarry/Archive 1 on 2015-04-17. Discussion area for discussion about Quarry itself and help about individual queries.

Start a new topic

combining SQL and SPARQL query

One comment • 16:59, 19 August 2024 1 day ago

1

Jarekt (talkcontribs)

Is there a way to find Commons files that

use P625 SDC property
do not transclude c:Module:Coordinates.

Reply 16:59, 19 August 2024 1 day ago

Reply to "combining SQL and SPARQL query"

Replag for enwiki

One comment • 23:36, 15 August 2024 5 days ago

1

Summary by GTrang

No more replag

GTrang (talkcontribs)

Tracked in Phabricator
Task T367856

The enwiki database has been on replag for an entire week now. It should hopefully be fixed in the next week or so.

05:31, 11 August 2024 10 days ago

Quarry / SQL query - how to fix OperationalError('table resultsets already exists')?

4 comments • 17:16, 24 July 2024 27 days ago

4

Gluo88 (talkcontribs)

I was able to complete the above SQL query as seen in https://quarry.wmcloud.org/history/84807/911494/884555.

However, when I run the same query again at https://quarry.wmcloud.org/query/84807 . I got the error "OperationalError('table resultsets already exists')". How should I fix the error?

I guess that resultsets may be a temporary table that was produced from running my previous SQL query. I tried to execute the following:

DROP TABLE IF EXISTS resultsets;

However, I got the following information: “Access denied; you need (at least one of) the SUPER, READ_ONLY ADMIN privilege(s) for this operation”

May anyone help me on this issue?

Thanks.

Reply 21:27, 13 July 2024 1 month ago

Gluo88 (talkcontribs)

I just found that the above query is complete now. The issue looks to be automatically solved, at least for now, although I still don't know the reason of the "OperationalError('table resultsets already exists').

However, my query https://quarry.wmcloud.org/query/84817 has just failed still with the same reason of the "OperationalError('table resultsets already exists').

May anyone know how to handle "OperationalError('table resultsets already exists')?

Thanks.

Reply Edited 01:25, 15 July 2024 1 month ago

Gluo88 (talkcontribs)

The issue looks to be automatically solved again. Did someone help in background?

Thanks a lot.

Reply 02:49, 15 July 2024 1 month ago

This post was hidden by Gluo88 (history)

Reply to "Quarry / SQL query - how to fix OperationalError('table resultsets already exists')?"

Tool databases

5 comments • 23:56, 23 July 2024 28 days ago

5

Samwilson (talkcontribs)

Is there a delay between a tool database being created and it being available in Quarry? It looks like s55926__wishlist_p can not be queried (it was only created today): https://quarry.wmcloud.org/query/11263

Reply 05:19, 23 July 2024 29 days ago

TheDJ (talkcontribs)

As far as I know, tools databases are not public, and so not in quarry.

Reply 09:16, 23 July 2024 28 days ago

Samwilson (talkcontribs)

@TheDJ: That used to be the case, but I'd thought that recently (phab:T151158) it had become possible. It's only databases with names ending in _p.

Reply 13:18, 23 July 2024 28 days ago

BDavis (WMF) (talkcontribs)

Quarry reads from the mirror of the primary ToolsDB rather than the primary. The mirror is lagged by 15 hours at the moment: https://grafana.wmcloud.org/d/PTtEnEyVk/toolsdb-mariadb?orgId=1&var-server=tools-db-3

Reply 16:08, 23 July 2024 28 days ago

Samwilson (talkcontribs)

Ah that's good to know, thanks! I'll be patient. :-)

Reply 23:56, 23 July 2024 28 days ago

Reply to "Tool databases"

Quarry / SQL optimization - using DB indexes?

3 comments • 18:33, 10 July 2024 1 month ago

3

Fl.schmitt (talkcontribs)

Hi, I'm trying to optimize a simple SQL query for pages on commonswiki (table page) which start with a certain string (e.g. "SELECT * FROM page WHERE page_title LIKE "Building%" ORDER by page_title;"- see also https://quarry.wmcloud.org/query/83277 for an example with a smaller result set). The SQL Optimizer on Toolforge tells me that this query would use filesort instead of indexes which causes performance issues ("Query plan 1.1 is using filesort or a temporary table. This is usually an indication of an inefficient query. If you find your query is slow, try taking advantage of available indexes to avoid filesort."). The DB schema documentation tells me that there should be an index defined (key: page_name_title) on page_title and page_namespace columns. The MySQL docs tell me that i could use USE INDEX (page_name_title) to enforce using that index. If I add that clause to my query (SELECT * FROM page USE INDEX (page_name_title) WHERE page_title LIKE "Building%" ORDER by page_title;), the SQL optimizer complains about a "Query error: Key 'page_name_title' doesn't exist in table 'page'". At the Commons village pump, I've learned that the replicas may lack the indices. So, I'm not sure if there's a way to optimize such a query on Quarry. I would prefer using Quarry instead of the toolforge CLI because Quarry allows for linking queries and results.

Reply 17:57, 10 July 2024 1 month ago

Matěj Suchánek (talkcontribs)

This query cannot use the index since you don't have a condition on page_namespace.

Reply 18:18, 10 July 2024 1 month ago

Fl.schmitt (talkcontribs)

Aww - ok :-) - yes, with such a condition, it works like a charm - thanks a lot!

Reply 18:33, 10 July 2024 1 month ago

Reply to "Quarry / SQL optimization - using DB indexes?"

Long Query Time

2 comments • 19:47, 12 June 2024 2 months ago

2

SoySauceOnRice (talkcontribs)

Hi, I am doing some research on sockpuppets. I'm trying to use the SQL database to assist in building my dataset, but queries seem to take a long time.

For instance, this previous query ran in 196 seconds:

https://quarry.wmcloud.org/query/61732

Whereas my identical query has been going for over 9 hours:

https://quarry.wmcloud.org/query/83588

A previous instance was running over a week before I restarted it.

Am I doing something wrong, or are these running times typical?

On another note, I have been looking for complete SQL dumps to run my own instance so that I don't cause issues with excessive queries, however I can only find complete xml dumps (https://dumps.wikimedia.org/enwiki/20240601/).

Is there any way to get a full SQL dump (complete with article revision history)?

Thank you for your help.

Reply 12:23, 11 June 2024 2 months ago

Matěj Suchánek (talkcontribs)

Using Toolforge CLI, the query returns 187814 rows in 31.187 sec. Maybe that's too much for Quarry.

Reply 19:47, 12 June 2024 2 months ago

Reply to "Long Query Time"

Queued queries

One comment • 14:06, 8 June 2024 2 months ago

1

Plastikspork (talkcontribs)

Recently, some of my queries get immediately queued and won't run. If I fork the query, then it runs immediately, but now I have a bunch that are marked as queued but not running. Is there a way to kill these old ones, and any ideas why this happens?

Here is an example of a few that are still queued:

https://quarry.wmcloud.org/query/71639

https://quarry.wmcloud.org/query/83591

https://quarry.wmcloud.org/query/83590

Thanks!

Reply 14:06, 8 June 2024 2 months ago

Reply to "Queued queries"

Query for wikipedia user registrations by day

3 comments • 07:12, 8 June 2024 2 months ago

3

Felipeangelim (talkcontribs)

Hi! I was trying to get the timeseries of new user registrations in wikipedia, by day (two columns, day and count), but my query is really slow. Am I doing something wrong?

SELECT

  DATE(user_registration) AS registration_date,

  COUNT(*) AS user_count

FROM

  user

WHERE

  user_editcount > 10

GROUP BY

  DATE(user_registration)

ORDER BY

  registration_date;

Reply Edited by Matěj Suchánek 06:50, 8 June 2024 2 months ago

Matěj Suchánek (talkcontribs)

There is no index on user_registration and user_editcount, so you are doing nothing wrong, it's just impossible to use a better query plan than scanning the whole table.

Reply 06:50, 8 June 2024 2 months ago

Matěj Suchánek (talkcontribs)

You can try query Special:Log/newusers instead. For example:

SELECT LEFT(log_timestamp, 8), COUNT(*)
FROM logging_logindex
JOIN user ON user_name = REPLACE(log_title, '_', ' ')
WHERE log_type = 'newusers'
AND log_action IN ('create', 'newusers')
AND log_timestamp > '2006'
AND user_editcount > 10
GROUP BY LEFT(log_timestamp, 8);

There are even more values possible for log_action, but I'm not sure if the JOIN worked for them, too.

Reply 07:12, 8 June 2024 2 months ago

Reply to "Query for wikipedia user registrations by day"

All files in a category (for categories with million+ files)

2 comments • 09:07, 5 June 2024 2 months ago

2

Schlurcher (talkcontribs)

Hi, I'm trying to efficiently get a list of all files in a category with 1'000'000+ files. For example all files on Commons in this Category "Category:Flickr images reviewed by FlickreviewR 2". This was the most simple I could come up with, but it takes like forever to get an output: https://quarry.wmcloud.org/query/83527

Is there a better way or does this require another tool (which would that be)?

Reply 14:08, 4 June 2024 2 months ago

TheDJ (talkcontribs)

You would have to write a dedicated tool to do requests that divides the query into multiple chunks and writes them to a file. You could use the Special:MyLanguage/API:Categorymembers api for instance, and list per 500 (limit) and use the continue parameter from request to request. Or get a database dump installed locally and export them from there. A million is a lot, it is not a type of request that the wikimedia optimises for.

Also depending on what you are trying to achieve, there might be alternate/better ways to achieve that goal other than listing a million+ entries in one go.

Reply 09:07, 5 June 2024 2 months ago

Reply to "All files in a category (for categories with million+ files)"

Help finding articles with a given template

6 comments • 20:27, 1 June 2024 2 months ago

6

Wizmut (talkcontribs)

I am trying to find articles that use "Template:Auto short description" but not "Template:Short description". Is this possible?

Reply 23:00, 31 May 2024 2 months ago

Matěj Suchánek (talkcontribs)

Though it is possible using an SQL query, ~~you can also use the default search.~~

Reply Edited 17:18, 1 June 2024 2 months ago

Wizmut (talkcontribs)

Thank you. But I also need to sort by the length of the induced short description, so it has to be SQL.

Reply 10:31, 1 June 2024 2 months ago

Matěj Suchánek (talkcontribs)

I think I misunderstood your request. "Template:Short description" is used inside articles, but "Template:Auto short description" is only for documentation purposes. So you actually want articles with a template documented as "short description inducing" without "Template:Short description", right?

I made this query (hope it's correct):

SELECT page_title, LENGTH(pp_value) AS len, pp_value
FROM page_props
JOIN page ON page_id = pp_page
LEFT JOIN templatelinks ON tl_from = page_id AND tl_target_id = 137  # Short_description
WHERE pp_propname = 'wikibase-shortdesc'
AND tl_target_id IS NULL
AND page_namespace = 0
AND page_is_redirect = 0
AND EXISTS (
 SELECT 1 FROM templatelinks
 JOIN linktarget ON lt_id = tl_target_id
 JOIN page ON page_namespace = lt_namespace AND page_title = lt_title
 JOIN categorylinks ON cl_from = page_id
 WHERE tl_from = page_id AND cl_to = 'Templates_that_generate_short_descriptions'
);

It took 10 minutes and the only returned entry was for en:Main_Page.

Reply 17:18, 1 June 2024 2 months ago

Wizmut (talkcontribs)

I very much appreciate the effort. But there are probably a few thousand pages that have very long induced short descriptions. I was able to find this one: [https://en.wikipedia.org/wiki/St._Anthony_Catholic_School] by skimming through [https://quarry.wmcloud.org/query/82687]. A fellow wikipedian who adds SDs requested a script for all such auto-SDs in excess of 80 characters, although SDs tend to get visually cut off around 55 characters.

Reply 18:10, 1 June 2024 2 months ago

Wizmut (talkcontribs)

The "AND tl_target_id = 137" seems to filter in only articles with explicit SDs. But changing it to "AND tl_target_id <> 137" gives a list of template links that aren't T:short_description. I'm not sure how to make it return a list of pages instead of template links.

Reply 20:27, 1 June 2024 2 months ago

Reply to "Help finding articles with a given template"