Blurb
User Details
- User Since
- Oct 3 2014, 8:40 AM (512 w, 18 h)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- akosiaris
- LDAP User
- Alexandros Kosiaris
- MediaWiki User
- AKosiaris (WMF) [ Global Accounts ]
Yesterday
Change merged, give it some 30minutes to propagate everwhere. Resolving this, feel free to reopen in case of things not functioning as expected. Thanks everyone!
Change merged, give it some 30minutes to propagate everwhere. Resolving this, feel free to reopen in case of things not functioning as expected. Thanks everyone!
I 've also performed a NOOP deployment from deploy1003 today, worked slowly (20minutes) due to having to build the images, but otherwise OK)
No significant changes in the last month per
, proceeding with the drop to 66%The situation is indeed known, see also T309772, T357950. Some efforts did happen to modernize the codebase, however, as far as I know, they haven't materialized into something yet (at least that can be recommended). I should note that service-template-node and service-runner are related but also distinct. service-template-node was meant to be what the name suggests. Just a template on how to structure code to use the service-runner framework. @sbassett is correct in characterizing it as dated and unmaintained. However, it isn't in itself plagued by security issues, it's the outdated packages defined in package.json (and their dependency trees) that are. The actual template code is barely 700 lines and it's meant as a template with best practices (well, not anymore given it's dated and unmaintained) in using service-runner.
Good point, do we have some more information why the automatic rollback didn't happen/failed?
Thu, Jul 25
Move to this server from deploy1002 scheduled for Monday 2024-07-29 09:00 UTC
Switching to low as all we can do now is wait.
OK, I 've group them into 1, named it Wikimedia Citation Bot, submitting the two different User-Agent headers from the links above. as well as a 2 match patterns that would match always, that is ZoteroTranslationServer/WMF and Citoid/WMF. I 've also provided a link to https://www.mediawiki.org/wiki/Citoid, copying to a short description field the first sentence.
Wed, Jul 24
Just armed keyholder, everything looks ok right now. I 'll send a notification to wikitech-l and engineering in slack for a deployment server move. Not much different from what we do for the switchover.
OK, thanks I can see that too now thanks.
Adding some more info, I 've went to https://dash.cloudflare.com/?to=/:account/:zone/security/bots with a personal free account I have and of course there is no section to tell them about my bot as the blog suggests. Maybe an account with more privileges than a free account is required.
There is already T370118 for this and discussion is ongoing, I suggest to close this as a duplicate of that task and continue there.
Tue, Jul 23
Couple of notes here:
Moving from SRE to serviceops-radar and subscribing the people that can approve this (same as in T339102). On the SRE side, it's not difficult to implement this.
Moving from SRE to serviceops-radar and subscribing the people that can approve this (same as in T339102). On the SRE side, it's not difficult to implement this.
What @Legoktm suggsted. If you have already a JSON input for that command and expect back an SVG (it looks this way judging from https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Chart/+/refs/heads/master/cli/src/command.ts), it's way better architecturally to expose an HTTP endpoint via a nodejs service, feeding the JSON via an HTTP POST and get back the SVG and insert in whatever content you were planning to insert it to. It can probably even be done async via the JobQueue if you don't want it in the parsing/rendering request hot path. So, my suggest would be to keep the CLI part for quick dev/debugging, but also add an express dependency (or even better service-runner, with the caveat that effort is undergoing to modernize it) and expose the functionality over an HTTP route, enable the pipeline and get a proper nodejs service that can be monitored and reasoned with all the standard tooling we have.
Tue, Jul 9
T361724 is a followup (and I think the only one) of this one, so I guess it's best to not resolve yet.
Mon, Jul 8
Fri, Jul 5
All fluentbit images have (once more) been delete from the registry using https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images
My git grep above was wrongly also matching the deploy-service user that is being indeed used in a number of cases, not just the group. The merged patch drops just the group and it's correct. I 'll resolve this.
While it's a bit early to gauge this:
I am gonna be bold and lower this to "High".
Wed, Jul 3
For posterity's sake, a summary follows:
Hasn't this already been done in T355020 ?
Tue, Jul 2
Apologies, I failed to anticipate that consequence, I 've merged a change to remove deploy1003 from the list of scap masters.
I am resolving the task given comments from 4 years ago. However, repeating that the functionality added in the course of this task 4 years ago is going to be removed since it's unused and causes maintenance burden.
Summarizing from a discussion in #wikimedia-tracing for posterity's sake.
4 years later, we don't see any data flowing in the kafka topic created back then. This feature apparently has never been used. But it is costing us in maintenance efforts as the image is on buster and we wanna to remove those images from the registry. Hence, after some discussions in #wikimedia-serviceops IRC channel, we have decided to disable the functionality from api-gateway and delete the fluentbit docker image from our repo as this pipeline is the only user of it. If anyone ever reaches this task and comment and is interested in the functionality implemented during work on this task, it can always be resurrected, assuming it's properly resourced.
Mon, Jul 1
Host is imaged, rest of the work is ongoing in T364417
- python3-imagecatalog published and gerrit repo updated
- php72 component made conditional
I 've applied the role and now working through packaging python3-imagecatalog for bullseye
For posterity's sake
Fri, Jun 28
Thu, Jun 27
T352956 is related (possibly a duplicate) and I 've mulling over it for a few months now. I think we need to have a larger in person discussion regarding this. There's some things I wanna understand on the kubernetes side before we move forward. I 'll send invites.
Jun 26 2024
Just to point out that this is probably not from the network. We don't have networking rate limiting in either of these machines (nor actually anywhere) and 5MB/s is less than 5% of the capacity of a 1Gbps link, which is the lowest common denominator in our infrastructure.
Jun 25 2024
Jun 21 2024
Thanks for this and thanks for documenting the selection process in T362999. It's probably worth it to update the summary of that task with a quick note about the conclusion and chosen solution.
Patch merged and deployed.
Jun 18 2024
Yeah, I confirm. The older hosts in the clusters, kafka-main[12]00[0-5] have 2TB free space left, so 100GB isn't an issue. The newer hosts have smaller disks (budget reasons) but they aren't in service yet.
Jun 17 2024
I 've removed
Jun 13 2024
In Kubernetes Special Interest Group, we recently re-evaluated the approach of running Docker outside of Kubernetes. The findings has been documented at https://wikitech.wikimedia.org/wiki/Docker, the discussion is at T363558.
Jun 12 2024
Given the above, I am gonna close this as Declined in the interest of not having lingering tasks, but feel free to reopen.
This is a first. We never had to implement something like that in the past. Historically, we did have some ingress ratelimiting functionality in RESTBase (it ended up abandonware) and we very recently did add some rate limiting functionality to the service-mesh, but it's internal requests only. Egressing rate-limiting functionality from our infrastructure has never been implemented, to my knowledge at least.
Jun 11 2024
Jun 10 2024
And I am still not sure.
I am not sure what this task asks to be honest. Care to add a bit more information as to what the problem is?
Jun 6 2024
Jun 5 2024
Jun 4 2024
Jun 1 2024
kafka-main1010, after 2 rounds of imaging (1 with the normal recipe and 1 with the reuse recipe) imaged successfully. I am resolving this. Thanks for all the work everyone!
May 31 2024
Hmm, this complicates things.
May 30 2024
kafka-main1009 is successfully imaged fully
May 29 2024
The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick look.
I 've gone ahead and created the following dashboard today T366094
May 24 2024
This no longer needed. Thanks to the work of @Joe and others, copy by url is now asynchronous and no longer suffers from this.