The guy improving the security posture and watching South Park. In daily life I am a Computer Science student specialising in cybersecurity, and open source infrastructure, networks and software.
User Details
- User Since
- Oct 12 2014, 7:12 AM (513 w, 4 d)
- Availability
- Available
- IRC Nick
- Southparkfan
- LDAP User
- Southparkfan
- MediaWiki User
- Southparkfan [ Global Accounts ]
Mon, Aug 12
Follow-up from IRC: Wikimedia uses the Hosted RPKI, but we assume the ARIN portal just doesn't support anything else than ROAs. There is an ASPA record for AS11358, whose ASN is controlled by ARIN, and we think they either use Hybrid RPKI, where ARIN still hosts the RPKI objects through their Repository Publication Service. Technically, Krill could both act as the CA and create ASPA records, and it is possible that AS11358 and others are doing this.
Fri, Aug 9
Wed, Jul 24
sessionstorage04 is no longer.
Tue, Jul 23
Couldn't upgrade Buster to 4.x, because there are no packages in buster-wikimedia. Installing Cassandra was a rather interesting process.
I didn't get a response in -sre, but Andrew has provided me with extra information.
Mon, Jul 22
Puppet fails to install the Cassandra instance:
Error: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] Error: /Stage[main]/Cassandra/Cassandra::Instance[default]/Exec[install-/var/lib/cassandra/data]/returns: change from 'notrun' to ['0'] failed: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] (corrective) Error: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] Error: /Stage[main]/Cassandra/Cassandra::Instance[default]/Exec[install-/var/lib/cassandra/data]/returns: change from 'notrun' to ['0'] failed: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] (corrective)
Had to delete sessionstorage05 (bookworm) due to T357791, will replace with a bullseye instance for Cassandra
Sat, Jul 20
Great, can't ssh into my new instance:
$ ssh deployment-sessionstore05.deployment-prep.eqiad1.wikimedia.cloud Connection closed by UNKNOWN port 65535
@Jgiannelos hey! Is deployment-restbase-bullseye (created by you last year) ready to take over the work from restbase04? Other than changing the references to restbase04 in Horizon hiera and LabsServices.php, and in the changeprop Chart (deployment--charts), it should be possible to switch, although the restbase service is not listening to port 7231 on -bullseye - any idea what's wrong?
@hnowlan I see you have created deployment-maps-master02. Other than possibly replacing the old master in https://github.com/wikimedia/maps-kartotherian-deploy/blob/master/scap/environments/beta/targets, is there anything needed before deleting master01?
As soon as the above changes have been merged, urldownloader03 can be deleted.
@BTullis I see you have created deployment-snapshot05 (Bullseye), although this new host was not part of https://gerrit.wikimedia.org/r/c/operations/dumps/scap/+/1008451, and neither is it part of the mediawiki-installation dsh group. Do we have to add snapshot05 to your 'scap' repository, as well or is it fine to just add it to the dsh group?
Given that Puppet does not have a flag to stop the periodic MediaWiki jobs, I had to disable Puppet on mwmaint03 and kill the jobs myself (just like the DC switchover cookbooks do). Can be re-enabled as soon as mwmaint02 is gone (deleted + removed from dsh groups).
It looks like the shellbox container is broken on the new host:
root@deployment-shellbox01:~# /usr/bin/docker run --rm=true --env-file /etc/shellbox/env -p 8081:8081 -v shellbox:/etc/shellbox -v /run/shared:/run/shared -v /srv/shellbox/config/:/srv/app/config -v /srv/shellbox/src:/srv/app/src --name spftest docker-registry.wikimedia.org/wikimedia/mediawiki-libs-shellbox:2024-06-13-133425-video --nodaemonize [20-Jul-2024 14:43:13] ERROR: unable to bind listening socket for address '/run/shared/fpm-www.sock': Permission denied (13) [20-Jul-2024 14:43:13] ERROR: unable to bind listening socket for address '/run/shared/fpm-www.sock': Permission denied (13) [20-Jul-2024 14:43:13] ERROR: FPM initialization failed [20-Jul-2024 14:43:13] ERROR: FPM initialization failed
deployment-parsoid14 has been installed with a Bullseye image.
Upgrade to bullseye/bookworm blocked due to T332015. @MoritzMuehlenhoff, can I help you to get poolcounter-prometheus-exporter imported to bullseye and/or bookworm (preferably both)?
^ after merging this change, deployment-push-notifications01 can be replaced with a Bookworm instance.
mediawiki11 and mediawiki12 are no longer in use, but still receive scap deployments. As soon as the two changes above have been merged, we can delete these instances.
@Jgiannelos I see you have created mobileapps02 with a Bullseye image. Is mobileapps01 ready for removal?
Fri, Jul 19
Done (volume has been deleted as well)
Done :)
deployment-jobrunner04 has been shut down. As soon as https://gerrit.wikimedia.org/r/1055394 and https://gerrit.wikimedia.org/r/1055412 are merged, we can delete that instance.
Instance is offline, seems to be superseded by deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud per T357476#9540192. @Urbanecm_WMF, do you agree we can delete deployment-docker-cpjobqueue01?
Instance does not exist anymore?
After merging https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1055306, the irc.beta.wmflabs.org RRset can be removed, and the floating IP can be removed from deployment-ircd03 as well; if you want to test the IRC server, you can run irssi on a Cloud VPS instance.
Thu, Jul 18
Loop fixed by setting profile::base::remove_python2_on_bullseye: false on prefix level, also done in production.
role::mw_rc_irc seems to work fine on a Bullseye box, except for a loop.
Jun 14 2024
Relevant: T127717#9671526 (i.e. can be deleted without issues, but preferably only before other instances are deleted)
May 10 2024
Thanks for your help, Riccardo! Given current time constraints, I'm afraid most of this work will take multiple months, but nevertheless, to see whether kea_python still works with the Kea packages provided by Debian, I felt it was time to bite the bullet and build the bindings manually.
Apr 17 2024
Thank you for your reply! My comments:
As I understand it, no server in production VLANs (that is: starting with {analytics,private,public} - excluding frack infrastructure?) should rely on DHCP for any purpose other than reimaging, because the IPv4 address will be set statically in d-i. For that reason, I can see why we would like to refuse DHCP requests if no syslinux path is provided by NetBox. I wouldn't classify it as a security measure against malevolent administrators, but rather as a failsafe to mitigate the impact of operator error.
Apr 9 2024
@ayounsi and I have discussed my first findings, and we thought it made sense to share them here.
Mar 28 2024
Mar 27 2024
Haven't made a lot of progress on this, unfortunately. Scheduled for April.
Nov 20 2023
I'll work on this.
Nov 15 2023
Production migration from the gnutls driver to the openssl driver can be tracked in T324623.
Nov 3 2023
Oct 13 2023
Alternative to consider: injecting REDIRECTs for traffic meant for a VIP. See the second section at http://www.linuxvirtualserver.org/docs/arp.html. I haven't tested it and it requires some sort of Netfilter implementation on the realservers, but it avoids MTU-related issues (when tunneling traffic). Nevermind, ARP problem is solved at Wikimedia by not annoucing ARP. MTU is a challenge when using any type of encapsulation (in this case IPIP), but that's a different issue :)
Oct 3 2023
Aug 5 2023
May 12 2023
Feb 1 2023
I have expanded https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Auth_logging. The 'known limitations' section shows there is enough work to do, but to avoid a never ending task, I am fine with resolving this task when T127717#8505600 has been applied to Cloud VPS. I find the lack of monitoring to be a blocker too, though.
Dec 14 2022
Standalone puppetmasters are also affected by this Git update:
$ git push -f project_puppetmaster HEAD:production Total 0 (delta 0), reused 0 (delta 0), pack-reused 0 remote: fatal: detected dubious ownership in repository at '/var/lib/git/operations/puppet' remote: To add an exception for this directory, call: remote: remote: git config --global --add safe.directory /var/lib/git/operations/puppet
Dec 7 2022
I have tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/865731 by using rsyslog-openssl on one syslog client and one syslog server running buster + one syslog client and one syslog server running bullseye. All works as expected.
Status: we chose #3 (Let's Encrypt via acme-chief). We've gotten stuck on a bug in the gnutls driver for rsyslog: T324623
Background: for T127717, we went with Let's Encrypt certificates. Unlike the rather simple chain of trust for the Puppet CA (leaf certificate -> root certificate (Puppet CA)), Let's Encrypt certificates have an intermediate certificate in between. 'Because TLS' (certificates are terrible, I know), the clients need to receive all certificates but the root certificate (because that is in /etc/ssl/certs/ca-certificates.crt).
Dec 5 2022
@Andrew and I have spent this evening on the initial set up of two WMCS-wide syslog servers. Those work fine. However, this setup is broken for all Cloud VPS instances that do not use the central puppetmaster.
Jul 27 2022
I understand, eliminating $wgRequest was low-hanging fruit here. DI is still better than either using ::getMain().
I'm happy to accept it, but keep the task and checkbox open after this lands.
If you'd like to resolve the use of global WebRequest in ::inDebugMode(), I can recommend two approaches to try.
Now that we're at it, I'll try to get a durable solution; after all, I would like to reduce technical debt, not move it somewhere else. Assistance is needed to get me starting here, though :-). Your advice is welcome.
- Perhaps create a setDebugMode() method marked @internal that we'd call from load.php and possibly OutputPage.php. It could take WebRequest and Config as parameter. Doing this would actually highlight an issue which is that we appear to be reading the cookie even when on load.php which is potentially a problem even today. The cookie should instruct OutputPage to make load.php?debug=true requests, there is no need for it to read it directly. If it does, it might actually poison the cache.
Assumptions
- load.php (RL\Context) only needs to know if ?debug=<value> exists; cookies and config don't matter here.
- index.php emits HTML elements (sourcing resources from load.php) that may or may not contain ?debug, depending on the sixth argument of ResourceLoader::makeLoaderQuery(). This entry point is not interested in the presence of a ?debug parameter - inverse of load.php.
Jul 21 2022
Cool! Not sure what the relationship with a DDoS is, though :). If you have time, you can review the patch above. I wasn't too sure about the locations of the default hiera: some things have to be defined in cloud.yaml, others in common/, ...
Jul 20 2022
(wrong task)
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/815776/ would alleviate the 'usage of $wgRequest global' concern, but a second pair of eyes is needed:
- Since [\MediaWiki\ResourceLoader\ResourceLoader]::inDebugMode() is a static function with neither WebRequest exposed via $this->getRequest() or [\MediaWiki\ResourceLoader\Context]->getRequest() available. Unless extensions can/should fetch a 'ResourceLoader object' (and therefore convert this function into a non-static one), I'm not sure how to refrain from this "last restort".
- Apart from OutputPage and OutputPageTest, [\MediaWiki\ResourceLoader\ResourceLoader]::makeCombinedStyles() is the only caller for OutputPage::transformCssMedia(), hence I couldn't refrain from using RequestContext::getMain() here either.
- According to T165176, RequestContext::getMain() can cause side-effects? Will that affect the ResourceLoader Context too?