Page MenuHomePhabricator

Store page IDs to speed up queries and reports
Closed, ResolvedPublic8 Estimated Story Points

Description

Many of our queries look at pages that were edited/created in the time frame by the given participants, etc. This is done as a subquery.

With the new reports we'll be doing more such queries, and would have to fetch the pages all over again. It may be more efficient to get the page IDs in a separate query, store them and use a rev_page IN (...) clause.

We want to store them in our database rather than cache because organizers may create reports long after the event has ended, yet the list of pages would remain unchanged. We could store the page IDs as a comma-separated list, compressed, and this wouldn't require much for storage.

When fetching page IDs, the code should look at the event end date. If it's before the end date, run the query to get page IDs and store it, and use this for subsequent queries. If it's after the end date, we don't need to fetch page IDs again.

For titles, just do something like SELECT page_title FROM page WHERE page_id IN (...). This should be fast. When getting pageviews for instance, we may want to use fetch() on the aforementioned query instead of fetchAll(), at least in chunks, so all the titles don't have to be in memory at once.

Related Objects

StatusSubtypeAssignedTask
ResolvedMusikAnimal
ResolvedMusikAnimal
ResolvedMusikAnimal
InvalidNone
Resolved Prtksxna
ResolvedNiharika
OpenNone
ResolvedMusikAnimal
ResolvedMaxSem
ResolvedMusikAnimal
ResolvedMusikAnimal
ResolvedSamwilson
ResolvedMaxSem
ResolvedMusikAnimal
InvalidNone
ResolvedMaxSem
ResolvedMusikAnimal

Event Timeline

jmatazzoni renamed this task from Store page IDs to speed up queries and reports to Store page IDs to speed up queries and reports [timebox 8 hours].Oct 30 2018, 11:23 PM

Mental note: we should fetch page titles too in this part of the code because we need them to fetch pageviews

jmatazzoni subscribed.

We timeboxed this, so I am moving it to "Ready" on the Sprint board.

Preliminary PR: https://github.com/wikimedia/grantmetrics/pull/139

This does contain logic to detect when new page IDs should be fetched versus using the ones that already saved (saving that for a separate PR). Instead it re-fetches page IDs every time you regenerate the stats.

Also, I know this was a spike but it's one of those things you just have to implement to see how it fairs. As far as I can tell this page ID system helps, so let's move forward with it.

jmatazzoni renamed this task from Store page IDs to speed up queries and reports [timebox 8 hours] to Store page IDs to speed up queries and reports .Nov 7 2018, 12:07 AM
jmatazzoni set the point value for this task to 8.

The core work here has been merged, however we still need to add the logic to only update page IDs if the current date is before the end date. That's basically a performance improvement, that isn't required to move forward with other tasks dependent on page IDs. I've created T210682 to track this. I think this task can be considered resolved.