Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad
Open, MediumPublic

Description

Moved forward one week: 2024-07-23, 15:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/91/

  • db1195 - s1
  • db1202 - s7
  • db1203 - s8
  • db1205 - backup
  • moss-be1003
  • ms-be1075
  • an-presto1015
  • an-worker1156
  • an-worker1146
  • kafka-jumbo1015
  • kafka-stretch1002
  • elastic1100
  • elastic1101
  • elastic1102
  • wdqs1016
  • dse-k8s-worker1008
  • ml-serve1008
  • kubernetes1025
  • kubernetes1026
  • kubernetes1052
  • kubernetes1053
  • kubernetes1054
  • kubernetes1055
  • kubernetes1056
  • mw1496

Teams Involved: Data Persistence, Data Platform, Search, Machine Learning, Service Ops

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
MatthewVernon

Event Timeline

swift-wise, just need to check the cluster's happy afterwards.

moss-be1003 is part of the apus Ceph cluster, which should be in production by end of this quarter (i.e. before this work is due to happen), and will need a bit of care. Should just be a case of putting it into maintenance mode beforehand, but it's 1/3 of the cluster capacity.

cmooney triaged this task as Medium priority.
cmooney updated the task description. (Show Details)

db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unless it is the active server because the primary is unavailable, it just has to be checked that replication restarts correctly after maintenance.

Icinga downtime and Alertmanager silence (ID=6a298ae5-e736-4051-8220-9ec4f352950a) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e3-eqiad

lsw1-e3-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=39fcbcd0-8c16-4208-ac06-f4b442e55a54) set by cmooney@cumin1002 for 0:30:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e3-eqiad

lsw1-e3-eqiad,lsw1-e3-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=2a5cb43e-793c-4103-9499-369354315479) set by cmooney@cumin1002 for 0:40:00 on 27 host(s) and their services with reason: JunOS upgrade lsw1-e3-eqiad

an-presto1010.eqiad.wmnet,an-worker1154.eqiad.wmnet,backup1009.eqiad.wmnet,cephosd1003.eqiad.wmnet,db[1192,1198-1199,1204].eqiad.wmnet,druid1010.eqiad.wmnet,dse-k8s-worker1006.eqiad.wmnet,elastic[1093-1095].eqiad.wmnet,kafka-jumbo1012.eqiad.wmnet,kafka-stretch1001.eqiad.wmnet,kubernetes[1047-1051,1061].eqiad.wmnet,ml-serve1006.eqiad.wmnet,ms-be1074.eqiad.wmnet,mw[1491-1493].eqiad.wmnet,wdqs1015.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-09T15:04:20Z] <topranks> rebooting lsw1-e3-eqiad to install updated JunOS version T365998

This comment was removed by cmooney.

Mentioned in SAL (#wikimedia-operations) [2024-07-18T14:47:54Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365998 - depooling db1195 - s1 db1202 - s7 db1203 - s8', diff saved to https://phabricator.wikimedia.org/P66816 and previous config saved to /var/cache/conftool/dbconfig/20240718-144754-arnaudb.json

data-persistence hosts handled, ready whenever you are @cmooney

Mentioned in SAL (#wikimedia-operations) [2024-07-19T12:23:20Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365998 - depooling db1195 - s1 db1202 - s7 db1203 - s8', diff saved to https://phabricator.wikimedia.org/P66843 and previous config saved to /var/cache/conftool/dbconfig/20240719-122320-arnaudb.json

@Marostegui and I will be absent on tuesday, hosts have been depooled and are ready.

@Marostegui and I will be absent on tuesday, hosts have been depooled and are ready.

Thanks @ABran-WMF, enjoy your weekend and/or break!

elastic110[0-2] are banned and ready , as is wdqs1016.

Mentioned in SAL (#wikimedia-operations) [2024-07-23T10:39:11Z] <claime> Cordoning kubernetes1025.eqiad.wmnet kubernetes1026.eqiad.wmnet kubernetes1052.eqiad.wmnet kubernetes1053.eqiad.wmnet kubernetes1054.eqiad.wmnet kubernetes1055.eqiad.wmnet kubernetes1056.eqiad.wmnet mw1496.eqiad.wmnet for T365998

Mentioned in SAL (#wikimedia-operations) [2024-07-23T13:05:17Z] <claime> Cordoning dse-k8s-worker1008.eqiad.wmnet for T365998