Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability EngineerAdministrator

Projects (26)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (512 w, 18 h)
Roles
Administrator
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Yesterday

akosiaris closed T344678: Allow Wikimedia Maps usage on wikidata.pl as Resolved.

Change merged, give it some 30minutes to propagate everwhere. Resolving this, feel free to reopen in case of things not functioning as expected. Thanks everyone!

Fri, Jul 26, 12:56 PM · serviceops-radar, Maps
akosiaris closed T339102: Allow Wikimedia Maps usage on vikidia.org as Resolved.
Fri, Jul 26, 12:55 PM · serviceops-radar, Maps
akosiaris added a comment to T339102: Allow Wikimedia Maps usage on vikidia.org.

Change merged, give it some 30minutes to propagate everwhere. Resolving this, feel free to reopen in case of things not functioning as expected. Thanks everyone!

Fri, Jul 26, 12:53 PM · serviceops-radar, Maps
akosiaris added a comment to T364417: deploy1003 implementation tracking.

I 've also performed a NOOP deployment from deploy1003 today, worked slowly (20minutes) due to having to build the images, but otherwise OK)

Fri, Jul 26, 11:13 AM · Patch-For-Review, serviceops
akosiaris added a comment to T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts.

No significant changes in the last month per

image.png (443×1 px, 55 KB)
, proceeding with the drop to 66%

Fri, Jul 26, 10:30 AM · MW-on-K8s, Scap, serviceops, Patch-For-Review
akosiaris added a comment to T370739: Figure out how a shellbox instance for the Chart extension would work.

The situation is indeed known, see also T309772, T357950. Some efforts did happen to modernize the codebase, however, as far as I know, they haven't materialized into something yet (at least that can be recommended). I should note that service-template-node and service-runner are related but also distinct. service-template-node was meant to be what the name suggests. Just a template on how to structure code to use the service-runner framework. @sbassett is correct in characterizing it as dated and unmaintained. However, it isn't in itself plagued by security issues, it's the outdated packages defined in package.json (and their dependency trees) that are. The actual template code is barely 700 lines and it's meant as a template with best practices (well, not anymore given it's dated and unmaintained) in using service-runner.

Fri, Jul 26, 8:34 AM · serviceops, SRE, Shellbox, Charts
akosiaris added a comment to T371069: Add helm rollback functionality to scap.

Good point, do we have some more information why the automatic rollback didn't happen/failed?

Fri, Jul 26, 7:56 AM · Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap
akosiaris updated the task description for T371069: Add helm rollback functionality to scap.
Fri, Jul 26, 7:56 AM · Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap
akosiaris added a comment to T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.

There's only 18 bots on the list at https://radar.cloudflare.com/traffic/verified-bots. Hopefully that isn't a sign of a slow or difficult application process.

Fri, Jul 26, 7:54 AM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid

Thu, Jul 25

akosiaris added a comment to T364417: deploy1003 implementation tracking.

Move to this server from deploy1002 scheduled for Monday 2024-07-29 09:00 UTC

Thu, Jul 25, 2:41 PM · Patch-For-Review, serviceops
akosiaris triaged T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare as Low priority.

Switching to low as all we can do now is wait.

Thu, Jul 25, 1:42 PM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
akosiaris added a comment to T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.

OK, I 've group them into 1, named it Wikimedia Citation Bot, submitting the two different User-Agent headers from the links above. as well as a 2 match patterns that would match always, that is ZoteroTranslationServer/WMF and Citoid/WMF. I 've also provided a link to https://www.mediawiki.org/wiki/Citoid, copying to a short description field the first sentence.

Thu, Jul 25, 1:38 PM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid

Wed, Jul 24

akosiaris updated the task description for T341555: Allow running periodic jobs for mw on k8s.
Wed, Jul 24, 2:31 PM · serviceops, MW-on-K8s
akosiaris added a comment to T364417: deploy1003 implementation tracking.

Just armed keyholder, everything looks ok right now. I 'll send a notification to wikitech-l and engineering in slack for a deployment server move. Not much different from what we do for the switchover.

Wed, Jul 24, 2:26 PM · Patch-For-Review, serviceops
akosiaris added a comment to T369898: Reduce the number of resource_change and resource_purge events emitted due to template changes.

The number of resource_change and resource_purge events can get extremely high, spiking at 10k req/sec at times

I'm curious about the the problem that this causes. Too many jobs inserted for job queue to handle quickly enough? Too many purge requests at once?

Wed, Jul 24, 2:26 PM · Essential-Work, MW-1.43-notes (1.43.0-wmf.16; 2024-07-30), serviceops, Performance Issue, MediaWiki-Engineering, MediaWiki-Core-HTTP-Cache, ChangeProp
akosiaris added a comment to T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.

@ppelberg, @DLynch @zoe. The verified bot form requires entering some input we need your help on.

Wed, Jul 24, 12:03 PM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
akosiaris added a comment to T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.

OK, thanks I can see that too now thanks.

Wed, Jul 24, 11:40 AM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
akosiaris merged T370808: Consider registering citoid as a verified or friendly bot with Cloudflare into T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.
Wed, Jul 24, 11:27 AM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
akosiaris merged task T370808: Consider registering citoid as a verified or friendly bot with Cloudflare into T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.
Wed, Jul 24, 11:25 AM · Infrastructure-Foundations, Citoid, Editing-team
akosiaris renamed T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare from Register Citoid as a "friendly bot" with Cloudflare to Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.
Wed, Jul 24, 10:50 AM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
akosiaris added a comment to T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.

Adding some more info, I 've went to https://dash.cloudflare.com/?to=/:account/:zone/security/bots with a personal free account I have and of course there is no section to tell them about my bot as the blog suggests. Maybe an account with more privileges than a free account is required.

Wed, Jul 24, 10:35 AM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
akosiaris added a comment to T370808: Consider registering citoid as a verified or friendly bot with Cloudflare.

There is already T370118 for this and discussion is ongoing, I suggest to close this as a duplicate of that task and continue there.

Wed, Jul 24, 10:32 AM · Infrastructure-Foundations, Citoid, Editing-team

Tue, Jul 23

akosiaris added a comment to T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.

Couple of notes here:

Tue, Jul 23, 3:45 PM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
akosiaris edited projects for T370650: Allow Wikimedia Maps usage on mwoffliner, added: serviceops-radar; removed SRE.

Moving from SRE to serviceops-radar and subscribing the people that can approve this (same as in T339102). On the SRE side, it's not difficult to implement this.

Tue, Jul 23, 3:24 PM · serviceops-radar, affects-Kiwix-and-openZIM, Maps
akosiaris edited projects for T344678: Allow Wikimedia Maps usage on wikidata.pl, added: serviceops-radar; removed SRE.

Moving from SRE to serviceops-radar and subscribing the people that can approve this (same as in T339102). On the SRE side, it's not difficult to implement this.

Tue, Jul 23, 3:19 PM · serviceops-radar, Maps
Ladsgroup awarded T364417: deploy1003 implementation tracking a Barnstar token.
Tue, Jul 23, 12:11 PM · Patch-For-Review, serviceops
akosiaris added a comment to T370739: Figure out how a shellbox instance for the Chart extension would work.

What @Legoktm suggsted. If you have already a JSON input for that command and expect back an SVG (it looks this way judging from https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Chart/+/refs/heads/master/cli/src/command.ts), it's way better architecturally to expose an HTTP endpoint via a nodejs service, feeding the JSON via an HTTP POST and get back the SVG and insert in whatever content you were planning to insert it to. It can probably even be done async via the JobQueue if you don't want it in the parsing/rendering request hot path. So, my suggest would be to keep the CLI part for quick dev/debugging, but also add an express dependency (or even better service-runner, with the caveat that effort is undergoing to modernize it) and expose the functionality over an HTTP route, enable the pipeline and get a proper nodejs service that can be monitored and reasoned with all the standard tooling we have.

Tue, Jul 23, 9:18 AM · serviceops, SRE, Shellbox, Charts

Tue, Jul 9

akosiaris updated the task description for T359423: Migrate charts to Calico Network Policies.
Tue, Jul 9, 1:56 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
akosiaris added a comment to T361706: 2024-04-03 calico/typha down.

T361724 is a followup (and I think the only one) of this one, so I guess it's best to not resolve yet.

Tue, Jul 9, 9:41 AM · Prod-Kubernetes, Wikimedia-Incident
akosiaris added subtasks for T368714: kafka-main replacement nodes don't fit kafka-main (storage wise): Unknown Object (Task), Unknown Object (Task).
Tue, Jul 9, 9:18 AM · serviceops

Mon, Jul 8

akosiaris raised the priority of T368892: Function evaluations are often failing on Wikifunctions.org with "gateway timeout" or “service unavailable” from High to Unbreak Now!.

I am gonna be bold and lower this to "High".

UBN per https://www.mediawiki.org/wiki/Phabricator/Project_management is

Unbreak Now! – Something is broken and needs to be fixed immediately, setting anything else aside. This should meet the requirements for issues that hold the train.

Per https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions, v1/evaluate sees traffic on the order of 0.1 to 0.2 requests per second and apparently per the description and comments in this tasks only a, currently not well estimated/unknown (correct me if I am wrong), ratio of those is affected. This doesn't look like something that needs to be fixed immediately, settings anything else aside. Nor is it holding the train of course.

Something like 80% of the functionality of Wikifunctions has been offline for over a week now. We don't understand what is actually broken (it seems to be between MW and the k8s cluster, or in handling the request at the boundary, or otherwise), and its errors are challenging to parse. I think this definitely counts as UBN still.

Mon, Jul 8, 3:18 PM · Abstract Wikipedia team (25Q1 (Jul–Sep))
akosiaris added a comment to T364400: map the /api/ prefix to /w/rest.php.

Could we implement this remapping at the ATS layer rather than the Apache one, in a manner that would mean that when we need to cache we only need to store each effective URL once?

It would be preferable to not do so. The caching gains would be minimal, but more importantly: we hope to minimize the details of the application layer that are spread into the cache configuration (there will always be necessary cases, but the more we avoid it, the easier things are in the future).

Even more to @BBlack's comment, I would just have apache funnel anything under /api it receives to an endpoint in mediawiki, and do routing there.

Mon, Jul 8, 1:44 PM · serviceops, Traffic, MW-Interfaces-Team

Fri, Jul 5

akosiaris added a comment to T251812: System administrator reviews API usage by client.

All fluentbit images have (once more) been delete from the registry using https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images

Fri, Jul 5, 8:23 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
akosiaris closed T340165: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? as Resolved.

My git grep above was wrongly also matching the deploy-service user that is being indeed used in a number of cases, not just the group. The merged patch drops just the group and it's correct. I 'll resolve this.

Fri, Jul 5, 1:30 PM · serviceops-radar, SRE
akosiaris added a comment to T360403: Helm deployment of MediaWiki now takes 6 minutes.

The sync-prod-k8s step went from ~ 420 seconds to 180 seconds at some point between May 27th and May 29th:

scap_sync-prod-k8s.png (569×1 px, 84 KB)

Fri, Jul 5, 1:20 PM · serviceops-radar, Release-Engineering-Team (Radar), MW-on-K8s
akosiaris added a comment to T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts.

While it's a bit early to gauge this:

Fri, Jul 5, 12:00 PM · MW-on-K8s, Scap, serviceops, Patch-For-Review
akosiaris added a comment to T368238: Wikifeeds' tls proxy cpu usage heavily increased in April.

To keep archives happy - this is the result after some days:

Screenshot from 2024-07-05 10-39-25.png (2×1 px, 370 KB)

I'd like to lower down the concurrency again to see if we get more benefits.

Fri, Jul 5, 10:20 AM · Wikifeeds, serviceops
akosiaris lowered the priority of T368892: Function evaluations are often failing on Wikifunctions.org with "gateway timeout" or “service unavailable” from Unbreak Now! to High.

I am gonna be bold and lower this to "High".

Fri, Jul 5, 6:56 AM · Abstract Wikipedia team (25Q1 (Jul–Sep))

Wed, Jul 3

akosiaris added a comment to T366819: Enable PCS to send resource change events to handle URL purges.

For posterity's sake, a summary follows:

Wed, Jul 3, 10:51 AM · Patch-For-Review, RESTBase Sunsetting, Content-Transform-Team-WIP, Wikipedia-iOS-App-Backlog, RESTBase, serviceops
akosiaris added a comment to T369144: Upgrade thumbor Docker images.

Hasn't this already been done in T355020 ?

Wed, Jul 3, 10:30 AM · serviceops, Infrastructure-Foundations

Tue, Jul 2

akosiaris added a comment to T364417: deploy1003 implementation tracking.

Apologies, I failed to anticipate that consequence, I 've merged a change to remove deploy1003 from the list of scap masters.

Tue, Jul 2, 4:22 PM · Patch-For-Review, serviceops
akosiaris closed T251812: System administrator reviews API usage by client as Resolved.

I am resolving the task given comments from 4 years ago. However, repeating that the functionality added in the course of this task 4 years ago is going to be removed since it's unused and causes maintenance burden.

Tue, Jul 2, 4:21 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
akosiaris closed T251812: System administrator reviews API usage by client, a subtask of T255034: Wikimedia API Gateway Long-term Use, as Resolved.
Tue, Jul 2, 4:19 PM · serviceops, Platform Engineering Roadmap, Epic, Platform Team Workboards (Epics), Core Platform Team Initiatives (API Gateway)
akosiaris added a comment to T363407: Proper service names in trace data.

Summarizing from a discussion in #wikimedia-tracing for posterity's sake.

Tue, Jul 2, 4:15 PM · Patch-For-Review, Observability-Tracing
akosiaris added a comment to T251812: System administrator reviews API usage by client.

4 years later, we don't see any data flowing in the kafka topic created back then. This feature apparently has never been used. But it is costing us in maintenance efforts as the image is on buster and we wanna to remove those images from the registry. Hence, after some discussions in #wikimedia-serviceops IRC channel, we have decided to disable the functionality from api-gateway and delete the fluentbit docker image from our repo as this pipeline is the only user of it. If anyone ever reaches this task and comment and is interested in the functionality implemented during work on this task, it can always be resurrected, assuming it's properly resourced.

Tue, Jul 2, 4:07 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API

Mon, Jul 1

akosiaris closed T364416: Q4:rack/setup/install deploy1003 as Resolved.

Host is imaged, rest of the work is ongoing in T364417

Mon, Jul 1, 4:50 PM · SRE, serviceops, ops-eqiad, DC-Ops
akosiaris added a comment to T364417: deploy1003 implementation tracking.
  • python3-imagecatalog published and gerrit repo updated
  • php72 component made conditional
Mon, Jul 1, 4:48 PM · Patch-For-Review, serviceops
akosiaris updated the task description for T364416: Q4:rack/setup/install deploy1003.
Mon, Jul 1, 4:48 PM · SRE, serviceops, ops-eqiad, DC-Ops
akosiaris added a comment to T364417: deploy1003 implementation tracking.

I 've applied the role and now working through packaging python3-imagecatalog for bullseye

Mon, Jul 1, 2:33 PM · Patch-For-Review, serviceops
akosiaris added a comment to T366819: Enable PCS to send resource change events to handle URL purges.

For posterity's sake

Mon, Jul 1, 12:12 PM · Patch-For-Review, RESTBase Sunsetting, Content-Transform-Team-WIP, Wikipedia-iOS-App-Backlog, RESTBase, serviceops

Fri, Jun 28

akosiaris added a comment to T361728: SwaggerProbeHasFailures for citoid (due to Zotero failures) after upgrading to node 18.
Fri, Jun 28, 2:41 PM · Patch-For-Review, serviceops-radar, Citoid
akosiaris added a comment to T361728: SwaggerProbeHasFailures for citoid (due to Zotero failures) after upgrading to node 18.

Latest graph o fail: https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=1719399913415&to=1719401713415&forceLogin=&var-dc=codfw+prometheus%2Fk8s&var-service=citoid&viewPanel=46

Fri, Jun 28, 1:50 PM · Patch-For-Review, serviceops-radar, Citoid

Thu, Jun 27

akosiaris added a comment to T328036: MCS decommission (2023).

After some back and forth with kiwix folks it looks like end of July is reasonable to keep the MCS endpoints available for mw-offliner.
They have already migrated to other endpoints and they just finish some details.

Thu, Jun 27, 2:50 PM · Patch-For-Review, Essential-Work, Content-Transform-Team-WIP, Mobile-Content-Service
akosiaris added a comment to T368544: IPIP encapsulation considerations for low-traffic services.

T352956 is related (possibly a duplicate) and I 've mulling over it for a few months now. I think we need to have a larger in person discussion regarding this. There's some things I wanna understand on the kubernetes side before we move forward. I 'll send invites.

Thu, Jun 27, 2:20 PM · Infrastructure-Foundations, serviceops, netops, Traffic

Jun 26 2024

akosiaris added a comment to T341441: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes.

Just to point out that this is probably not from the network. We don't have networking rate limiting in either of these machines (nor actually anywhere) and 5MB/s is less than 5% of the capacity of a 1Gbps link, which is the lowest common denominator in our infrastructure.

Jun 26 2024, 4:12 PM · Release-Engineering-Team, serviceops, Scap, MW-on-K8s

Jun 25 2024

akosiaris added a comment to T364126: Disable Chrome Private Prefetch Proxy.

The description describes CP3 as used to «automatically prefetch top-ranked search results when the user views a Google search result page»

while https://developer.chrome.com/blog/private-prefetch-proxy/ states

Note: At this moment, to allow other sites to preload navigations through Google servers, users need to select the "Extended preloading" mode in Chrome's preload settings. We are looking for interested parties as a catalyst for further improvements to this initial approach.

Jun 25 2024, 1:54 PM · Movement-Insights, Traffic

Jun 21 2024

akosiaris added a comment to T364797: Create a helm chart for the cloudnativepg postgresql operator.

Thanks for this and thanks for documenting the selection process in T362999. It's probably worth it to update the summary of that task with a quick note about the conclusion and chosen solution.

Jun 21 2024, 1:59 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Infrastructure-Foundations, serviceops, Patch-For-Review
akosiaris updated the task description for T364797: Create a helm chart for the cloudnativepg postgresql operator.
Jun 21 2024, 12:22 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Infrastructure-Foundations, serviceops, Patch-For-Review
akosiaris added a comment to T368052: Allow connections from PCS to eventgate.

Patch merged and deployed.

Jun 21 2024, 8:13 AM · Patch-For-Review, RESTBase Sunsetting, Content-Transform-Team-WIP, Wikipedia-iOS-App-Backlog, RESTBase, serviceops

Jun 18 2024

akosiaris updated the task description for T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts.
Jun 18 2024, 2:47 PM · MW-on-K8s, Scap, serviceops, Patch-For-Review
akosiaris added a comment to T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).

I do have one question though. Why 4 weeks retention? Is there some business reason or could it be dropped to a smaller duration?

we need 4 weeks to be able to backfill after an import, from the time the wikidata dump process starts, the time required to shuffle the data around (compression, hdfs-rsync to hdfs) and til the end of the import into blazegraph, see the initial lag column in T241128 for past import times, perhaps 3weeks would be manageable but we went to 4 weeks to have extra room.

Jun 18 2024, 8:46 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata
akosiaris added a comment to T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).

Yeah, I confirm. The older hosts in the clusters, kafka-main[12]00[0-5] have 2TB free space left, so 100GB isn't an issue. The newer hosts have smaller disks (budget reasons) but they aren't in service yet.

Jun 18 2024, 8:27 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata

Jun 17 2024

akosiaris updated the task description for T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK).
Jun 17 2024, 1:54 PM · Infrastructure-Foundations, IPv6, User-jbond, netbox
akosiaris closed T271142: Some Service Operations clusters apparently do not support IPv6 as Resolved.

I 've removed

Jun 17 2024, 1:53 PM · Infrastructure-Foundations, Dumps-Generation, IPv6, serviceops, SRE-tools
akosiaris closed T271142: Some Service Operations clusters apparently do not support IPv6, a subtask of T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK), as Resolved.
Jun 17 2024, 1:52 PM · Infrastructure-Foundations, IPv6, User-jbond, netbox
akosiaris updated the task description for T271142: Some Service Operations clusters apparently do not support IPv6.
Jun 17 2024, 1:50 PM · Infrastructure-Foundations, Dumps-Generation, IPv6, serviceops, SRE-tools
akosiaris added a comment to T367544: Cloud VPS "packaging" project Buster deprecation.

builder-envoy-03.packaging.eqiad1.wikimedia.cloud

Any objections to just remove the VM since we moved to (re-)packaging upstream (https://wikitech.wikimedia.org/wiki/Envoy#Building_envoy_for_WMF)?

Jun 17 2024, 12:48 PM · collaboration-services, Cloud-VPS (Debian Buster Deprecation)
akosiaris added a comment to T367544: Cloud VPS "packaging" project Buster deprecation.

packager02.packaging.eqiad1.wikimedia.cloud

According to the etherpad upgrade docs this host is used to build the etherpad Debian package. I also used the host in the past to build the etherpad package. The dedicated host is used because "etherpad builds fetches npm modules during the build time".

Jun 17 2024, 11:41 AM · collaboration-services, Cloud-VPS (Debian Buster Deprecation)

Jun 13 2024

akosiaris added a comment to T364900: Add enhanced logging to Citoid.

I deployed a change today for this. It is not everything we hoped, alas :).

We don't actually seem to log info, only warns. However, it may look like we were because of the presence of NOTICE levels in logstash. Which brings me to a second issues, which is: these are fact NOTICES caused by parsing failure of what are actually WARNS. Some of them are just total garbage: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TmlhDJAB_bxzc-Pm7TGI. Some of these NOTICES are a single line of a stack trace. Guessing from this one I am thinking... the inclusion of the html of the response in the message of the error might be the cause of this. It's maybe actually not sanitising it and it's escaping it some how? Yikes :P And then we're getting like 20 NOTICEs for parts of the response, each separately. I.e. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TdVkDJABDD1VxFBSzVVI

Jun 13 2024, 9:38 AM · Platform Engineering, Patch-For-Review, VisualEditor, VisualEditor-MediaWiki-References, Citoid
akosiaris added a comment to T275551: Using docker in WMF production network outside of kubernetes.

In Kubernetes Special Interest Group, we recently re-evaluated the approach of running Docker outside of Kubernetes. The findings has been documented at https://wikitech.wikimedia.org/wiki/Docker, the discussion is at T363558.

Jun 13 2024, 9:10 AM · Data-Engineering-Icebox, serviceops, Analytics-Radar, Machine-Learning-Team
akosiaris updated the task description for T360636: Phase out cergen for ServiceOps services.
Jun 13 2024, 8:39 AM · Patch-For-Review, serviceops, Epic, SRE

Jun 12 2024

akosiaris closed T367013: 2030.wikimedia.org is a double redirect as Declined.

Given the above, I am gonna close this as Declined in the interest of not having lingering tasks, but feel free to reopen.

Jun 12 2024, 10:14 AM · serviceops, Wikimedia-Apache-configuration
akosiaris added a comment to T367194: Citoid/Zotero: Create rate limiting configurable on a per site basis.

This is a first. We never had to implement something like that in the past. Historically, we did have some ingress ratelimiting functionality in RESTBase (it ended up abandonware) and we very recently did add some rate limiting functionality to the service-mesh, but it's internal requests only. Egressing rate-limiting functionality from our infrastructure has never been implemented, to my knowledge at least.

Jun 12 2024, 10:12 AM · VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
akosiaris renamed T367194: Citoid/Zotero: Create rate limiting configurable on a per site basis from Create rate limiting configurable on a per site basis to Citoid/Zotero: Create rate limiting configurable on a per site basis.
Jun 12 2024, 9:13 AM · VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid

Jun 11 2024

akosiaris added a comment to T367013: 2030.wikimedia.org is a double redirect.

Thanks for the historical aspect @Dzahn. Given that perspective and the fact the double redirect isn't apparently considered a problem, my opinion is that it is probably prudent to keep the redirect as it was asked in T264797.

Jun 11 2024, 10:33 AM · serviceops, Wikimedia-Apache-configuration

Jun 10 2024

akosiaris added a comment to T367013: 2030.wikimedia.org is a double redirect.

And I am still not sure.

Jun 10 2024, 1:25 PM · serviceops, Wikimedia-Apache-configuration
akosiaris added a comment to T367013: 2030.wikimedia.org is a double redirect.

I am not sure what this task asks to be honest. Care to add a bit more information as to what the problem is?

Jun 10 2024, 1:14 PM · serviceops, Wikimedia-Apache-configuration

Jun 6 2024

akosiaris updated the task description for T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts.
Jun 6 2024, 1:59 PM · MW-on-K8s, Scap, serviceops, Patch-For-Review
akosiaris moved T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts from Backlog to In Progress on the MW-on-K8s board.
Jun 6 2024, 9:07 AM · MW-on-K8s, Scap, serviceops, Patch-For-Review
akosiaris moved T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts from Incoming 🐫 to Doing 😎 on the serviceops board.
Jun 6 2024, 9:07 AM · MW-on-K8s, Scap, serviceops, Patch-For-Review
akosiaris added projects to T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts: serviceops, Scap, MW-on-K8s.
Jun 6 2024, 9:06 AM · MW-on-K8s, Scap, serviceops, Patch-For-Review
akosiaris created T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts.
Jun 6 2024, 9:01 AM · MW-on-K8s, Scap, serviceops, Patch-For-Review

Jun 5 2024

akosiaris updated the task description for T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3.
Jun 5 2024, 9:47 AM · netops, Infrastructure-Foundations, SRE

Jun 4 2024

akosiaris added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

I am still unclear on when metrics are sent directly to statsd.eqiad.wmnet or to the prometheus-statsd-exporter through envoy. Is this a transitional state towards using prometheus to collect metrics and both are required for now, or are some metrics sent through one path and others through the second? I think the scope of this task is to solve the second path only (MediaWiki -> statsd-exporter -> prometheus).

Jun 4 2024, 10:08 AM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
akosiaris added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.
Jun 4 2024, 8:59 AM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics

Jun 1 2024

akosiaris closed T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 as Resolved.

kafka-main1010, after 2 rounds of imaging (1 with the normal recipe and 1 with the reuse recipe) imaged successfully. I am resolving this. Thanks for all the work everyone!

Jun 1 2024, 1:42 PM · SRE, serviceops, ops-eqiad, DC-Ops

May 31 2024

akosiaris added projects to T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G: SRE-OnFire, Sustainability (Incident Followup).
May 31 2024, 3:26 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad
akosiaris added projects to T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G: Sustainability (Incident Followup), SRE-OnFire.
May 31 2024, 3:25 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
akosiaris added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

Hmm, this complicates things.

May 31 2024, 2:09 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
akosiaris added a comment to T364900: Add enhanced logging to Citoid.

Splitting from the main thread, to inquire about this specifically:

I 'll have a deeper look tomorrow.

urldownloaders don't have the visibility needed to look at URLs since most sites are accessed over HTTPS. So the only thing they do see is URL domains, not paths. We need Citoid to log out what requests it sees.

I'm implementing logging in citoid now, but - don't we mostly need to know the domain, not the path, anyway? As it lets us sort by hosts that are blocking us? I don't think we actually need the full path...

May 31 2024, 1:34 PM · Platform Engineering, Patch-For-Review, VisualEditor, VisualEditor-MediaWiki-References, Citoid
akosiaris added a comment to T364253: Metrics api response sometimes returns cached 301 (from kubernetes ??).

I have never seen 301 when requesting anything to any AQS services. I have no idea about why that happened. In my case, all request responses include envoy as the server value even when a there was a hit on the cache.

May 31 2024, 12:48 PM · serviceops, AQS2.0, Data Products (Data Products Sprint 14)

May 30 2024

akosiaris added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

kafka-main1009 is successfully imaged fully

May 30 2024, 4:43 PM · SRE, serviceops, ops-eqiad, DC-Ops

May 29 2024

akosiaris added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick look.

May 29 2024, 3:57 PM · SRE, serviceops, ops-eqiad, DC-Ops
akosiaris added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick look.

May 29 2024, 3:55 PM · SRE, serviceops, ops-eqiad, DC-Ops
akosiaris added a comment to T366094: k8s master capacity issues.

Mentioned in SAL (#wikimedia-sre) [2024-05-29T11:23:04Z] <akosiaris> T366094 re-undeploy otel-collector, it being around increased traffic to the API >50%

FWIW I think this traffic increase will not be an issue once we move to the new three control plane nodes

May 29 2024, 2:13 PM · serviceops, SRE
akosiaris added a comment to T366094: k8s master capacity issues.
May 29 2024, 2:09 PM · serviceops, SRE
akosiaris added a comment to T366094: k8s master capacity issues.

I 've gone ahead and created the following dashboard today T366094

May 29 2024, 2:01 PM · serviceops, SRE
akosiaris added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

@akosiaris re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035769/1/modules/profile/data/profile/installserver/preseed.yaml

I think since that is Bash globbing and not regex you can't use the round brackets as in "kafka-main10(09|10)". Or at least it's the only line in preseed.yaml that does that.

May 29 2024, 8:20 AM · SRE, serviceops, ops-eqiad, DC-Ops

May 24 2024

akosiaris added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

I was able to correct kafka-main1010 issue for dhcp but image fails still

Screenshot 2024-05-24 at 8.50.08 AM.png (906×1 px, 247 KB)
@akosiaris did you have this issue with other servers?

May 24 2024, 1:04 PM · SRE, serviceops, ops-eqiad, DC-Ops
akosiaris closed T302430: <Tech Initiative> Commons Copy-by-URL Image Uploads Slowdown (Shellbox) as Invalid.

This no longer needed. Thanks to the work of @Joe and others, copy by url is now asynchronous and no longer suffers from this.

May 24 2024, 11:23 AM · Foundational Technology Requests