原文链接:https://www.yuque.com/erik.zhao/trouble/lhhzukyw9zf8fxoc?singleDoc# 《某数据中心局点二层网络对接思科Eth-trunk出现环路问题》
某数据中心局点二层网络对接思科Eth-trunk出现环路问题
CE12800 M-LAG组网,与思科对接二层Eth-trunk,使用PVST破环协议。
CE128与思科对接Eth-trunk后,发现SHN-P-MCA-SW01上Eth-Trunk17成员接口为Unselected状态。Eth-Trunk17成员接口分别为10GE7/0/19和10GE7/0/20,对端思科接口为Gi1/14和Gi1/15。期间现象为:Gi1/14物理口状态up,协议状态down;Gi1/15物理口状态down,协议状态down。后经排查,Gi1/15口为物理连线错误。在排查LACP协商状态时,代理商在思科交换机上将成员口逐个移出Port-Channel,在移除正常对接的接口后,业务中断。
21:13:00 :发现MCA01上Eth-Trunk成员口Unselected状态,开始定位原因;
21:13:00~21:19:00 :代理商继续移出成员口操作尝试解决Eth-Trunk对接问题
21:22:00 :发现长ping SA02交换机的Loopback地址不通;
21:22:00 :立刻执行关闭SA01交换机和MAC交换机互联接口;
21:23:20 :汇报领导,交换机二层打通操作影响业务;
21:22:00 ~ 21:24:30:陆续关闭SA01交换机和MAC交换机互联的8个接口;
21:24:35 :长ping SA02交换机Loopback地址恢复,网络恢复
客户业务侧反馈影响两个区的业务,6笔银联业务。
# 20:50:33 网线接好,MCA01设备Eth17、Eth18都UP起来了:
Nov 5 2022 20:50:33+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=Eth-Trunk18, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk18)
Nov 5 2022 20:50:33+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=Eth-Trunk17, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk17)
# 20:53:19 MCA01设备Eth-Trunk17 LACP down,原因是Eth-Trunk17成员口10GE7/0/19收到的远端接口的信息不一致:
Nov 5 2022 20:53:19+08:00 SHN-P-MCA-SW01 %LACP/4/LACP_STATE_DOWN(l):CID=0x804804ba;The LACP state is down. (PortName=10GE7/0/19, TrunkName=Eth-Trunk17, LastReceivePacketTime=[2022-11-05 20:53:18:605+08:00], Reason=The remote portkey in the LACPDU received from this interface was different from other members. Please check the remote members bandwidths, duplex modes, or Eth-Trunk IDs.)
Nov 5 2022 20:53:19+08:00 SHN-P-MCA-SW01 %LACP/2/hwLacpNegotiateFailed_active(l):CID=0x807a0405-alarmID=0x09360000;The member of LAG negotiation failed. (TrunkIndex=5, PortIfIndex=77, TrunkId=17, TrunkName=Eth-Trunk17, PortName=10GE7/0/19, Reason=A link fault occurred or negotiation information synchronization failed.)
# 21:05:49 代理商shutdown/undo shutdown思科侧G1/14口,CE128 Eth17成员口10GE7/0/19物理DOWN后又恢复UP:
Nov 5 2022 21:05:49+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=10GE7/0/19, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk17)
Nov 5 2022 21:05:47+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_active(l):CID=0x807a0405-alarmID=0x08520003;The interface status changes. (ifName=10GE7/0/19, AdminStatus=UP, OperStatus=DOWN, Reason=Interface physical link is down, mainIfname=Eth-Trunk17)
# 21:12:04 代理商shutdown/undo shutdown思科侧G1/15口, CE128 Eth17成员口10GE7/0/20网线物理DOWN后又恢复UP:
Nov 5 2022 21:12:04+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=10GE7/0/20, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk17)
Nov 5 2022 21:11:54+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_active(l):CID=0x807a0405-alarmID=0x08520003;The interface status changes. (ifName=10GE7/0/20, AdminStatus=UP, OperStatus=DOWN, Reason=Interface physical link is down, mainIfname=Eth-Trunk17)
# 21:13:19 代理商重新shutdown/undo shutdown Eth17成员口之后,LACP依旧是DOWN的(未UP过)
Nov 5 2022 21:13:19+08:00 SHN-P-MCA-SW01 %LACP/2/hwLacpNegotiateFailed_active(l):CID=0x807a0405-alarmID=0x09360000;The member of LAG negotiation failed. (TrunkIndex=5, PortIfIndex=77, TrunkId=17, TrunkName=Eth-Trunk17, PortName=10GE7/0/19, Reason=A link fault occurred or negotiation information synchronization failed.)
Nov 5 2022 21:13:18+08:00 SHN-P-MCA-SW01 %LACP/4/LACP_STATE_DOWN(l):CID=0x804804ba;The LACP state is down. (PortName=10GE7/0/19, TrunkName=Eth-Trunk17, LastReceivePacketTime=[2022-11-05 21:13:18:728+08:00], Reason=The remote interface was not selected. Please check the remote interface s status and configurations.)
Nov 5 2022 21:12:08+08:00 SHN-P-MCA-SW01 %LACP/2/hwLacpNegotiateFailed_clear(l):CID=0x807a0405-alarmID=0x09360000-clearType=service_resume;Link negotiation failure is resumed. (TrunkIndex=5, PortIfIndex=78, TrunkId=17, TrunkName=Eth-Trunk17, PortName=10GE7/0/20, Reason=The link fault was rectified and negotiation information was synchronized.)
# LACP协商失败的原因是远端接口未加入聚合(加入聚合标记位为0)
Nov 5 2022 21:13:18.713+08:00 SHN-P-MCA-SW01 %LACP/6/PDU_STE_CHANGE(D):CID=0x804804ba;The Actor_State of received PDU packets changed. (PortName=10GE7/0/19, ReceivedPDUActorstate=10000000, ReceivedPDUPartnerstate=10111100, LocalActorstate=10111100)
Nov 5 2022 21:13:18.713+08:00 SHN-P-MCA-SW01 %LACP/6/MUX_STE_CHANGE(D):CID=0x804804ba;The state in the MUX state machine changes. (TrunkName=Eth-trunk17, PortName=10GE7/0/19, MuxOldStatus=COLLECTING_DISTRIBUTING, MuxNewStatus=DETACHED)
Nov 5 2022 21:13:18.718+08:00 SHN-P-MCA-SW01 %LACP/7/LACP_SELECT_REASON(D):CID=0x807a0405;The state of Eth-trunk17 s port 10GE7/0/19 is changed from SELECTED to UNSELECTED for the reason of RemoteWontAgg.
Nov 5 2022 21:13:18.728+08:00 SHN-P-MCA-SW01 %LACP/6/PDU_STE_CHANGE(D):CID=0x804804ba;The Actor_State of received PDU packets changed. (PortName=10GE7/0/19, ReceivedPDUActorstate=01000000, ReceivedPDUPartnerstate=00101100, LocalActorstate=10100000)
Nov 5 2022 21:13:19.685+08:00 SHN-P-MCA-SW01 %OPS/6/OPS_DIAG_USERDEFINED_INFORMATION(D):CID=0x80c2272b;2022-11-05 21:13:18+08:00;LACP negotiation failed because the remote interface was not selected. Please check the remote interface s status and configurations. (Interface=10GE7/0/19, Eth-Trunk17) (user="_lacp_mtp.py", session=52)
# 21:14:33 21:14:33 代理商shutdown思科侧G2/14、G2/15口,CE128 Eth18成员口10GE8/0/19、 10GE8/0/20物理DOWN:
Nov 5 2022 21:14:33+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_active(l):CID=0x807a0405-alarmID=0x08520003;The interface status changes. (ifName=10GE8/0/20, AdminStatus=UP, OperStatus=DOWN, Reason=Interface physical link is down, mainIfname=Eth-Trunk18)
Nov 5 2022 21:14:33+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_active(l):CID=0x807a0405-alarmID=0x08520003;The interface status changes. (ifName=10GE8/0/19, AdminStatus=UP, OperStatus=DOWN, Reason=Interface physical link is down, mainIfname=Eth-Trunk18)
# 21:19:39 代理商undo shutdown思科侧G2/14、G2/15口,CE128 Eth18成员口10GE8/0/19、 10GE8/0/20物理UP,Eth18 接口UP:
Nov 5 2022 21:19:39+08:00 SHN-P-MCA-SW01 %LACP/2/hwLacpTotalLinkLoss_clear(l):CID=0x807a0405-alarmID=0x09360001-clearType=service_resume;Link bandwidth lost totally is resumed. (TrunkIndex=6, TrunkIfIndex=238, TrunkId=18, TrunkName=Eth-Trunk18, Reason=Link is selected.)
Nov 5 2022 21:19:38+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=Eth-Trunk18, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk18)
Nov 5 2022 21:19:33+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=10GE8/0/20, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk18)
Nov 5 2022 21:19:29+08:00 SHN-P-MCA-SW01 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=10GE8/0/19, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk18)
# 20:50:41 网线接好,MCA01设备Eth17、Eth18都UP起来了:
Nov 5 2022 20:50:41+08:00 SHN-P-MCA-SW02 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=Eth-Trunk17, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk17)
Nov 5 2022 20:50:41+08:00 SHN-P-MCA-SW02 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=Eth-Trunk18, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk18)
# 21:14:33~21:19:33 代理商shutdown思科G1/14、G1/5口之后,undo shutdown了两个接口,Eth18之后UP,LACP协商选中
Nov 5 2022 21:14:33+08:00 SHN-P-MCA-SW02 %IFNET/2/linkDown_active(l):CID=0x807a0405-alarmID=0x08520003;The interface status changes. (ifName=10GE8/0/20, AdminStatus=UP, OperStatus=DOWN, Reason=Interface physical link is down, mainIfname=Eth-Trunk18)
Nov 5 2022 21:14:33+08:00 SHN-P-MCA-SW02 %IFNET/2/linkDown_active(l):CID=0x807a0405-alarmID=0x08520003;The interface status changes. (ifName=10GE8/0/19, AdminStatus=UP, OperStatus=DOWN, Reason=Interface physical link is down, mainIfname=Eth-Trunk18)
……
Nov 5 2022 21:19:33+08:00 SHN-P-MCA-SW02 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=10GE8/0/20, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk18)
Nov 5 2022 21:19:33+08:00 SHN-P-MCA-SW02 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=10GE8/0/19, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk18)
Nov 5 2022 21:19:38+08:00 SHN-P-MCA-SW02 %LACP/2/hwLacpTotalLinkLoss_clear(l):CID=0x807a0405-alarmID=0x09360001-clearType=service_resume;Link bandwidth lost totally is resumed. (TrunkIndex=6, TrunkIfIndex=236, TrunkId=18, TrunkName=Eth-Trunk18, Reason=Link is selected.)
Nov 5 2022 21:19:38+08:00 SHN-P-MCA-SW02 %IFNET/2/linkDown_clear(l):CID=0x807a0405-alarmID=0x08520003-clearType=service_resume;The interface status changes. (ifName=Eth-Trunk18, AdminStatus=UP, OperStatus=UP, Reason=Interface physical link is up, mainIfname=Eth-Trunk18)
# 21:20:49~21:22:00 MCA01的Eth17、Eth18之间出现MAC漂移(漂移的MAC均为思科设备):
Nov 5 2022 21:22:00+08:00 SHN-P-MCA-SW02 %FEI_COMM/4/hwMflpVlanLoopAlarm_active(l):CID=0x807f0485-alarmID=0x095e0012;MAC flapping detected, VlanId = 133, Original-Port = Eth-Trunk18, Flapping port 1 = Eth-Trunk17, port 2 = -. Check the network connected to the interface learning a flapping MAC address : 0000-0c07-ac01.
Nov 5 2022 21:22:00+08:00 SHN-P-MCA-SW02 %FEI_COMM/4/hwMflpVlanLoopAlarm_active(l):CID=0x807f0485-alarmID=0x095e0012;MAC flapping detected, VlanId = 131, Original-Port = Eth-Trunk18, Flapping port 1 = Eth-Trunk17, port 2 = -. Check the network connected to the interface learning a flapping MAC address : 0000-0c07-ac01.
Nov 5 2022 21:21:41+08:00 SHN-P-MCA-SW02 %FEI_COMM/4/hwMflpVlanLoopAlarm_active(l):CID=0x807f0485-alarmID=0x095e0012;MAC flapping detected, VlanId = 132, Original-Port = Eth-Trunk18, Flapping port 1 = Eth-Trunk17, port 2 = -. Check the network connected to the interface learning a flapping MAC address : 0000-0c07-ac01.
Nov 5 2022 21:21:40+08:00 SHN-P-MCA-SW02 %FEI_COMM/4/hwMflpVlanLoopAlarm_active(l):CID=0x807f04e4-alarmID=0x095e0012;MAC flapping detected, VlanId = 152, Original-Port = Eth-Trunk18, Flapping port 1 = Eth-Trunk17, port 2 = -. Check the network connected to the interface learning a flapping MAC address : 0000-0c07-ac01.
Nov 5 2022 21:20:49+08:00 SHN-P-MCA-SW02 %FEI_COMM/4/hwMflpVlanLoopAlarm_active(l):CID=0x807f0485-alarmID=0x095e0012;MAC flapping detected, VlanId = 130, Original-Port = Eth-Trunk18, Flapping port 1 = Eth-Trunk17, port 2 = -. Check the network connected to the interface learning a flapping MAC address : ac1f-6b22-b8b8.
代理商在排查Eth-trunk接口未选中问题时,根据之前经验在思科设备上将成员口移出Eth-tunk,由于二层物理口是UP的,与原二层Eth-trunk接口在VLAN1形成环路(VLAN1运行了客户业务),该操作触发了两台思科之间的接口err-disable。
在排查接口Unselected状态时,代理商根据之前变更经验,在思科侧将成员口逐个移出聚合口(未shutdown接口),属于方案外的操作,该操作触发了两台思科之间的接口err-disable,影响了思科下挂网络的路由,最终导致了客户业务中断。
紧急将变更涉及的所有接口Shutdown掉,业务恢复正常。
1、VLAN1是特殊的VLAN,建议不要运行业务; 2、Eth-trunk口在添加/移出成员口操作时,需要先将成员口shutdown后,再执行添加/移出操作; 3、RFC操作不要执行变更外的操作。