Jump to content

Talk:Quarry

About this board

Previous discussion was archived at Talk:Quarry/Archive 1 on 2015-04-17. Discussion area for discussion about Quarry itself and help about individual queries.

combining SQL and SPARQL query

1
Jarekt (talkcontribs)
Reply to "combining SQL and SPARQL query"
Summary by GTrang

No more replag

GTrang (talkcontribs)

The enwiki database has been on replag for an entire week now. It should hopefully be fixed in the next week or so.

Quarry / SQL query - how to fix OperationalError('table resultsets already exists')?

4
Gluo88 (talkcontribs)

I was able to complete the above SQL query as seen in https://quarry.wmcloud.org/history/84807/911494/884555.  

However, when I run the same query again at https://quarry.wmcloud.org/query/84807 .   I got the error "OperationalError('table resultsets already exists')".           How should I fix the error?

I guess that resultsets may be a temporary table that was produced from running my previous  SQL query.   I tried to execute the following:

DROP TABLE IF EXISTS resultsets;

However,  I got the following information: “Access denied; you need (at least one of) the SUPER, READ_ONLY ADMIN privilege(s) for this operation”

May anyone help me on this issue?

Thanks.

Gluo88 (talkcontribs)

I just found that the above query is complete now. The issue looks to be automatically solved, at least for now, although I still don't know the reason of the "OperationalError('table resultsets already exists').


However, my query https://quarry.wmcloud.org/query/84817 has just failed still with the same reason of the "OperationalError('table resultsets already exists').

May anyone know how to handle "OperationalError('table resultsets already exists')?

Thanks.

Gluo88 (talkcontribs)

The issue looks to be automatically solved again. Did someone help in background?

Thanks a lot.

This post was hidden by Gluo88 (history)
Reply to "Quarry / SQL query - how to fix OperationalError('table resultsets already exists')?"
Samwilson (talkcontribs)

Is there a delay between a tool database being created and it being available in Quarry? It looks like s55926__wishlist_p can not be queried (it was only created today): https://quarry.wmcloud.org/query/11263

TheDJ (talkcontribs)

As far as I know, tools databases are not public, and so not in quarry.

Samwilson (talkcontribs)

@TheDJ: That used to be the case, but I'd thought that recently (phab:T151158) it had become possible. It's only databases with names ending in _p.

BDavis (WMF) (talkcontribs)
Samwilson (talkcontribs)

Ah that's good to know, thanks! I'll be patient. :-)

Reply to "Tool databases"

Quarry / SQL optimization - using DB indexes?

3
Fl.schmitt (talkcontribs)

Hi, I'm trying to optimize a simple SQL query for pages on commonswiki (table page) which start with a certain string (e.g. "SELECT * FROM page WHERE page_title LIKE "Building%" ORDER by page_title;"- see also https://quarry.wmcloud.org/query/83277 for an example with a smaller result set). The SQL Optimizer on Toolforge tells me that this query would use filesort instead of indexes which causes performance issues ("Query plan 1.1 is using filesort or a temporary table. This is usually an indication of an inefficient query. If you find your query is slow, try taking advantage of available indexes to avoid filesort."). The DB schema documentation tells me that there should be an index defined (key: page_name_title) on page_title and page_namespace columns. The MySQL docs tell me that i could use USE INDEX (page_name_title) to enforce using that index. If I add that clause to my query (SELECT * FROM page USE INDEX (page_name_title) WHERE page_title LIKE "Building%" ORDER by page_title;), the SQL optimizer complains about a "Query error: Key 'page_name_title' doesn't exist in table 'page'". At the Commons village pump, I've learned that the replicas may lack the indices. So, I'm not sure if there's a way to optimize such a query on Quarry. I would prefer using Quarry instead of the toolforge CLI because Quarry allows for linking queries and results.

Matěj Suchánek (talkcontribs)

This query cannot use the index since you don't have a condition on page_namespace.

Fl.schmitt (talkcontribs)

Aww - ok :-) - yes, with such a condition, it works like a charm - thanks a lot!

Reply to "Quarry / SQL optimization - using DB indexes?"
SoySauceOnRice (talkcontribs)

Hi, I am doing some research on sockpuppets. I'm trying to use the SQL database to assist in building my dataset, but queries seem to take a long time.

For instance, this previous query ran in 196 seconds:

https://quarry.wmcloud.org/query/61732

Whereas my identical query has been going for over 9 hours:

https://quarry.wmcloud.org/query/83588

A previous instance was running over a week before I restarted it.

Am I doing something wrong, or are these running times typical?

On another note, I have been looking for complete SQL dumps to run my own instance so that I don't cause issues with excessive queries, however I can only find complete xml dumps (https://dumps.wikimedia.org/enwiki/20240601/).

Is there any way to get a full SQL dump (complete with article revision history)?

Thank you for your help.

Matěj Suchánek (talkcontribs)

Using Toolforge CLI, the query returns 187814 rows in 31.187 sec. Maybe that's too much for Quarry.

Reply to "Long Query Time"
Plastikspork (talkcontribs)
Reply to "Queued queries"

Query for wikipedia user registrations by day

3
Felipeangelim (talkcontribs)

Hi! I was trying to get the timeseries of new user registrations in wikipedia, by day (two columns, day and count), but my query is really slow. Am I doing something wrong?

SELECT

  DATE(user_registration) AS registration_date,

  COUNT(*) AS user_count

FROM

  user

WHERE

  user_editcount > 10

GROUP BY

  DATE(user_registration)

ORDER BY

  registration_date;
Matěj Suchánek (talkcontribs)

There is no index on user_registration and user_editcount, so you are doing nothing wrong, it's just impossible to use a better query plan than scanning the whole table.

Matěj Suchánek (talkcontribs)

You can try query Special:Log/newusers instead. For example:

SELECT LEFT(log_timestamp, 8), COUNT(*)
FROM logging_logindex
JOIN user ON user_name = REPLACE(log_title, '_', ' ')
WHERE log_type = 'newusers'
AND log_action IN ('create', 'newusers')
AND log_timestamp > '2006'
AND user_editcount > 10
GROUP BY LEFT(log_timestamp, 8);

There are even more values possible for log_action, but I'm not sure if the JOIN worked for them, too.

Reply to "Query for wikipedia user registrations by day"

All files in a category (for categories with million+ files)

2
Schlurcher (talkcontribs)

Hi, I'm trying to efficiently get a list of all files in a category with 1'000'000+ files. For example all files on Commons in this Category "Category:Flickr images reviewed by FlickreviewR 2". This was the most simple I could come up with, but it takes like forever to get an output: https://quarry.wmcloud.org/query/83527

Is there a better way or does this require another tool (which would that be)?

TheDJ (talkcontribs)

You would have to write a dedicated tool to do requests that divides the query into multiple chunks and writes them to a file. You could use the Special:MyLanguage/API:Categorymembers api for instance, and list per 500 (limit) and use the continue parameter from request to request. Or get a database dump installed locally and export them from there. A million is a lot, it is not a type of request that the wikimedia optimises for.

Also depending on what you are trying to achieve, there might be alternate/better ways to achieve that goal other than listing a million+ entries in one go.

Reply to "All files in a category (for categories with million+ files)"

Help finding articles with a given template

6
Wizmut (talkcontribs)

I am trying to find articles that use "Template:Auto short description" but not "Template:Short description". Is this possible?

Matěj Suchánek (talkcontribs)

Though it is possible using an SQL query, you can also use the default search.

Wizmut (talkcontribs)

Thank you. But I also need to sort by the length of the induced short description, so it has to be SQL.

Matěj Suchánek (talkcontribs)

I think I misunderstood your request. "Template:Short description" is used inside articles, but "Template:Auto short description" is only for documentation purposes. So you actually want articles with a template documented as "short description inducing" without "Template:Short description", right?

I made this query (hope it's correct):

SELECT page_title, LENGTH(pp_value) AS len, pp_value
FROM page_props
JOIN page ON page_id = pp_page
LEFT JOIN templatelinks ON tl_from = page_id AND tl_target_id = 137  # Short_description
WHERE pp_propname = 'wikibase-shortdesc'
AND tl_target_id IS NULL
AND page_namespace = 0
AND page_is_redirect = 0
AND EXISTS (
 SELECT 1 FROM templatelinks
 JOIN linktarget ON lt_id = tl_target_id
 JOIN page ON page_namespace = lt_namespace AND page_title = lt_title
 JOIN categorylinks ON cl_from = page_id
 WHERE tl_from = page_id AND cl_to = 'Templates_that_generate_short_descriptions'
);

It took 10 minutes and the only returned entry was for en:Main_Page.

Wizmut (talkcontribs)
Wizmut (talkcontribs)

The "AND tl_target_id = 137" seems to filter in only articles with explicit SDs. But changing it to "AND tl_target_id <> 137" gives a list of template links that aren't T:short_description. I'm not sure how to make it return a list of pages instead of template links.

Reply to "Help finding articles with a given template"