Page MenuHomePhabricator

Repeated deployment-mediawiki-07 socket timeouts
Closed, ResolvedPublic

Description

https://commons.wikimedia.beta.wmflabs.org/wiki/

If you report this error to the Wikimedia System Administrators, please include the details below.

Request from - via deployment-cache-text05.deployment-prep.eqiad.wmflabs, ATS/8.0.5
Error: 502, Next Hop Connection Failed at 2020-03-01 15:29:36 GMT

Earlier today I got this error:

Upload error
Unable to modify the file "6f2dbfewe.jpg" because the file repository "local" is in read-only mode.
The system administrator who locked it offered this explanation: "".

That's no explanation!

Event Timeline

I get this also:

image.png (653×1 px, 42 KB)

Output of tracert command on Windows:

C:\Users\Kizulee>tracert commons.wikimedia.beta.wmflabs.org

Tracing route to commons.wikimedia.beta.wmflabs.org [185.15.56.36]
over a maximum of 30 hops:

  1    <1 ms    <1 ms     2 ms  homerouter.cpe [192.168.8.1]
  2     *        *        *     Request timed out.
  3    32 ms    33 ms    29 ms  192.168.118.22
  4     *        *        *     Request timed out.
  5    33 ms    27 ms    21 ms  217.65.201.129
  6    29 ms    25 ms    28 ms  217.65.201.146
  7    26 ms    29 ms    28 ms  te0-1-0-7-0.ccr51.beg03.atlas.cogentco.com [149.
14.236.49]
  8    47 ms    38 ms    37 ms  be3464.ccr52.vie01.atlas.cogentco.com [154.54.59
.189]
  9    46 ms    38 ms    38 ms  ae-14.r01.vienat01.at.bb.gin.ntt.net [129.250.9.
129]
 10   146 ms   139 ms   138 ms  ae-1.r00.vienat01.at.bb.gin.ntt.net [129.250.2.3
6]
 11     *        *        *     Request timed out.
 12   148 ms     *      153 ms  ae-8.r22.asbnva02.us.bb.gin.ntt.net [129.250.4.9
6]
 13   145 ms   148 ms   139 ms  ae-1.r05.asbnva02.us.bb.gin.ntt.net [129.250.2.2
0]
 14   152 ms   146 ms   142 ms  ae-0.a03.asbnva02.us.bb.gin.ntt.net [129.250.5.1
94]
 15   148 ms   140 ms   138 ms  xe-0-0-28-0.a03.asbnva02.us.ce.gin.ntt.net [129.
250.204.190]
 16   159 ms   150 ms   140 ms  instance-deployment-cache-text05.deployment-prep
.wmflabs.org [185.15.56.36]
 17   150 ms   141 ms   137 ms  instance-deployment-cache-text05.deployment-prep
.wmflabs.org [185.15.56.36]

Trace complete.

C:\Users\Kizulee>

15:14:04 <shinken-wm> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
15:14:13 <shinken-wm> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
15:16:57 <shinken-wm> PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds

Related? enwp beta seems down as well fwiw

Commons:

Request from 86.160.232.199 via deployment-cache-text05 frontend, Varnish XID 37126363
Error: 503, Backend fetch failed at Sun, 01 Mar 2020 15:54:24 GMT

Enwp:

ERR_CONNECTION_TIMED_OUT

16:06:47 <shinken-wm> RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 92302 bytes in 1.110 second response time

Looks back to me.

@AlexisJazz, @Zoranzoki21: what about you?

RhinosF1 renamed this task from Commons Beta-Cluster is not loading to deployment-mediawiki-07 socket timeout 2020-03-01 15:14-16:09 UTC.Mar 1 2020, 4:10 PM
RhinosF1 claimed this task.

Marking as resolved per conversation in releng channel. Reopen if issue continues.

RhinosF1 removed a project: User-RhinosF1.

Unassigning, I didn’t do anything.

16:06:47 <shinken-wm> RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 92302 bytes in 1.110 second response time

Looks back to me.

@AlexisJazz, @Zoranzoki21: what about you?

Works, thanks!

AlexisJazz reopened this task as Open.EditedMar 2 2020, 2:26 PM

Just getting timeouts now.

Request from - via deployment-cache-text05.deployment-prep.eqiad.wmflabs, ATS/8.0.5
Error: 502, Next Hop Connection Failed at 2020-03-02 14:26:00 GMT

In T246577#5933010, @Zoranzoki21 wrote:

I can't reproduce this.

Now it works again. Maybe leave this open for a day or so, see if remains up this time.

In T246577#5933010, @Zoranzoki21 wrote:

I can't reproduce this.

Now it works again. Maybe leave this open for a day or so, see if remains up this time.

Checked the logs and it seemed to be ~14:10 to `~15:10

RhinosF1 renamed this task from deployment-mediawiki-07 socket timeout 2020-03-01 15:14-16:09 UTC to deployment-mediawiki-07 socket timeout 2020-03-01/02.Mar 2 2020, 4:02 PM

If you're in europe, it's a wikimedia-wide outage.

Services should be back now :)

If you're in europe, it's a wikimedia-wide outage.

There are many people in Europe you know! ;) Including me. Alright, let's hope things keep working now. I noticed regular enwiki wasn't responding very well either, but that didn't seem to outright break.

thcipriani triaged this task as Medium priority.Mar 2 2020, 5:24 PM
In T246577#5933010, @Zoranzoki21 wrote:

I can't reproduce this.

Now it works again. Maybe leave this open for a day or so, see if remains up this time.

Checked the logs and it seemed to be ~14:10 to `~15:10

Hrm. Right around 14:00 in deployment-mediawiki-07:/var/log/php7.2-fpm/error.log I see a lot of:

Mar  2 14:00:00 deployment-mediawiki-07 php7.2-fpm: PHP Fatal error:  Allowed memory size of 692060160 bytes exhausted (tried to allocate 8388608 bytes) in /srv/mediawiki/php-master/includes/cache/localisation/LCStoreStaticArray.php on line 136
Mar  2 14:00:21 deployment-mediawiki-07 php7.2-fpm: PHP Fatal error:  Allowed memory size of 692060160 bytes exhausted (tried to allocate 8388608 bytes) in /srv/mediawiki/php-master/includes/cache/localisation/LCStoreStaticArray.php on line 136
Mar  2 14:00:35 deployment-mediawiki-07 php7.2-fpm: PHP Fatal error:  Allowed memory size of 692060160 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/php-master/cache/l10n/koi.l10n.php on line 4

Which looks related to the new php l10n work (T99740) IIRC @Jdforrester-WMF enabled this on beta last week sometime.

If you're in europe, it's a wikimedia-wide outage.

There are many people in Europe you know! ;) Including me. Alright, let's hope things keep working now. I noticed regular enwiki wasn't responding very well either, but that didn't seem to outright break.

I’m Europe as well! Just wanted to point out that the issue at the point was unrelated to the others reported.

Here's another odd thing that may or may not be related: https://commons.wikimedia.beta.wmflabs.org/wiki/File:6f2dbfewe.jpg

Sure, I'm a butcher as I'm trying to create a mass-undeletion tool. But how did I manage to duplicate my own upload? The file has 4 revisions now, all identical. I have no idea how I did that.

Now it's got 8 revisions! It had 2 at some point, then 4, now 8.. I can see where this is going. Your server is going to run out of space rather soon.

Made a new task for this: T246695

Here's another odd thing that may or may not be related: https://commons.wikimedia.beta.wmflabs.org/wiki/File:6f2dbfewe.jpg

Probably not related.

Sure, I'm a butcher as I'm trying to create a mass-undeletion tool. But how did I manage to duplicate my own upload? The file has 4 revisions now, all identical. I have no idea how I did that.

Now it's got 8 revisions! It had 2 at some point, then 4, now 8.. I can see where this is going. Your server is going to run out of space rather soon.

Also, https://upload.beta.wmflabs.org/wikipedia/commons/archive/0/0e/20200302174125%216f2dbfewe.jpg returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public.0e/archive/0/0e/20200302174125%216f2dbfewe.jpg"

FYI beta broke at 21:03:57ish today and I restored it around 21:14:53ish (by restarting php7.2-fpm on deployment-mediawiki-07).

Mar  2 20:38:20 deployment-mediawiki-07 php7.2-fpm: PHP Fatal error:  Allowed memory size of 692060160 bytes exhausted (tried to allocate 8388608 bytes) in /srv/mediawiki/php-master/includes/cache/localisation/LCStoreStaticArray.php on line 136
Mar  2 21:16:56 deployment-mediawiki-07 php7.2-fpm: PHP Fatal error:  Allowed memory size of 692060160 bytes exhausted (tried to allocate 1048576 bytes) in /srv/mediawiki/php-master/includes/cache/localisation/LCStoreStaticArray.php on line 136

Only thing between these two log lines is a JSON version of the second log.
Around the time it was broken instead we had a lot of logs like this:

Mar  2 21:07:45 deployment-mediawiki-07 php7.2-fpm[642]: [WARNING] [pool www] child 651, script '/srv/mediawiki/docroot/standard-docroot/w/api.php' (request: "GET /w/api.php?format=json&formatversion=2&errorformat=plaintext&action=query&meta=notifications&notformat=model&notlimit=max&notwikis=wikidatawiki%7Ccommonswiki%7Cenwiki&notfilter=%21read") executing too slow (18.202071 sec), logging
Mar  2 21:07:45 deployment-mediawiki-07 php7.2-fpm[642]: [NOTICE] child 651 stopped for tracing
Mar  2 21:07:45 deployment-mediawiki-07 php7.2-fpm[642]: [NOTICE] about to trace 651
Mar  2 21:07:45 deployment-mediawiki-07 php7.2-fpm[642]: [NOTICE] finished trace of 651

Unable to modify the file "76f2dbef4cc9_7.jpg" because the file repository "local" is in read-only mode.

The system administrator who locked it offered this explanation: "".

There we go again..

And saving preferences often results in:

The Wikimedia Commons database is temporarily in read-only mode for the following reason:
The database is read-only until replication lag decreases.

Preferences are saved though.

Tried to import https://commons.wikimedia.org/wiki/Commons:Undeletion_requests/Archive/2020-02 on betacommons. Which actually worked, even though I was given this.

Request from - via deployment-cache-text05.deployment-prep.eqiad.wmflabs, ATS/8.0.5
Error: 502, Next Hop Connection Failed at 2020-03-07 19:05:57 GMT

This comment was removed by AlexisJazz.
RhinosF1 renamed this task from deployment-mediawiki-07 socket timeout 2020-03-01/02 to repeated deployment-mediawiki-07 socket timeouts.Mar 20 2020, 10:04 PM
Jdforrester-WMF renamed this task from repeated deployment-mediawiki-07 socket timeouts to Repeated deployment-mediawiki-07 socket timeouts ("Next Hop Connection Failed").Apr 13 2020, 5:44 PM
RhinosF1 renamed this task from Repeated deployment-mediawiki-07 socket timeouts ("Next Hop Connection Failed") to Repeated deployment-mediawiki-07 socket timeouts.Apr 20 2020, 9:07 AM

Jdforrester-WMF renamed this task from repeated deployment-mediawiki-07 socket timeouts to Repeated deployment-mediawiki-07 socket timeouts ("Next Hop Connection Failed").

Not always as stated last time and seen today

In T246577#6991866, @Majavah wrote:

Is this still happening?

I don't think so.

Thanks! Closing in that case.