S7712主控板收到大量TC报文,设备频繁刷新ARP表项导致异常重启。
问题描述
一、故障现象
某客户反馈,他们局域网核心交换机S7712设备出现异常,所有业务板卡系统运行灯快闪,端口灯全熄灭,下挂业务中断,通过掉电重启紧急恢复。
二、设备版本/补丁信息
V200R003C00SPC500+Null
三、业务影响
此交换机为客户局域网核心交换机,下挂近2000用户,其中包含客户门户网站等重要业务,业务影响比较大。
处理过程
1.跟现场工程师确认,出问题时所有业务单板端口灯不亮,所有接口板系统灯为绿色快闪状态。
2.查看日志可以确定设备13号槽位为主用主控板,14号槽位为备用主控板,14:57:28左右14号主控板升为主用主控,并上报与13号主控互联通道故障,怀疑此时13号主控板异常复位,由于设备掉电,导致掉电前部分日志没有正常记录。
| Aug 31 2017 14:57:28+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[637]:No.0 channel from slot 1/14 to slot 1/13 is faulty.
Aug 31 2017 14:57:28+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[638]:All channels from slot 1/14 to slot 1/13 are faulty. |
3.14号主控板做为主用主控板后,14:58:04所有与接口板互联的HG变为DOWN状态。
| %2017-Aug-31 14:58:04.900.1+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1415]:Slot 14 layer DRV module AV level INFO: unit 0 hg5 change to down. %2017-Aug-31 14:58:04.900.2+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1416]:Slot 14 layer DRV module AV level INFO: unit 0 hg8 change to down. %2017-Aug-31 14:58:04.900.3+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1417]:Slot 14 layer DRV module AV level INFO: unit 0 hg12 change to down. %2017-Aug-31 14:58:05.910.1+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1418]:Slot 14 layer DRV module AV level INFO: unit 0 hg13 change to down. |
4.查看所有接口板与主控板间通信信道故障,接口板收不到主控板心跳报文复位,业务中断。
| Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[639]:No.0 channel from slot 1/14 to slot 1/1 is faulty.
Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[640]:All channels from slot 1/14 to slot 1/1 are faulty. Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[641]:No.0 channel from slot 1/14 to slot 1/8 is faulty. Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[642]:All channels from slot 1/14 to slot 1/8 are faulty. Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[643]:No.0 channel from slot 1/14 to slot 1/3 is faulty. Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[644]:All channels from slot 1/14 to slot 1/3 are faulty. Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[645]:No.0 channel from slot 1/14 to slot 1/12 is faulty. Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[646]:All channels from slot 1/14 to slot 1/12 are faulty. |
5.问题发生前后,设备有频繁收到大量TC报文。
| Aug 31 2017 13:31:53+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12968]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.
Aug 31 2017 13:33:22+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12973]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:33:56+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12979]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:35:09+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12986]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:35:14+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12989]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:35:24+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12993]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:36:08+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12996]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:36:22+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13002]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:36:43+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13005]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:37:24+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13009]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:37:50+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13014]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:38:27+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13017]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:39:51+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13024]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:39:55+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13027]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:39:59+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13030]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. Aug 31 2017 13:40:31+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13035]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10. |
6.在出问题之前14号主控板做为主用主控时,已发生过频繁收到大量TC报文导致异常重启的问题。
| Aug 30 2017 07:50:46+08:00 2_L_YCS_C_S7712-0001 %%01ALML/4/ENTRESET(l)[4973]:MPU frame[1] board[14] is reset. The reason is: VRP reset selfboard because of find deadloop. |
7.主控板上记录多次TC刷新ARP导致死循环的异常记录。
| ============ Task Infinite Loop Information Begin ============
Dopra Version = DOPRA V100R006C09CP0671 Application Version = VRPV500R013C00SPC295-GR Task Infinite Loop Type = Task overrun Task Infinite Loop Handle = Suspend Task Task Infinite Loop CpuId = 13 Overrun Task Name = L2IF Overrun Task VOS ID = 214 Overrun Task Osal ID = 0x0883fba0 Task Overrun Threshold = 20000 (ms) Task Has-run Time = 20000 (ms) Task Infinite Loop Occur Time = [2017.08.29 07:31:08] Task Infinite Loop Occur Cputick = [0x0000187c, 0xd11773bb]
Task switch trace info before task infinite loop: From cputick [0,0] to cputick [0x187c,0xd11773bb] ------------------------------------------------------------- No. TaskName VosTID OsalTID Prio RunTime[s, ns] No Task switch trace info!!!
============ Task Infinite Loop Information Begin ============ Dopra Version = DOPRA V100R006C09CP0671 Application Version = VRPV500R013C00SPC295-GR Task Infinite Loop Type = Task overrun Task Infinite Loop Handle = Suspend Task Task Infinite Loop CpuId = 13 Overrun Task Name = L2IF Overrun Task VOS ID = 214 Overrun Task Osal ID = 0x08841300 Task Overrun Threshold = 20000 (ms) Task Has-run Time = 20000 (ms) Task Infinite Loop Occur Time = [2017.08.29 17:57:23] Task Infinite Loop Occur Cputick = [0x000017e5, 0x932a814c] Task switch trace info before task infinite loop: From cputick [0,0] to cputick [0x17e5,0x932a814c] ------------------------------------------------------------- No. TaskName VosTID OsalTID Prio RunTime[s, ns] No Task switch trace info!!!
Corresponding task call stack info: -------------------------------------------------------------
============ Task Infinite Loop Information Begin ============ Dopra Version = DOPRA V100R006C09CP0671 Application Version = VRPV500R013C00SPC295-GR Task Infinite Loop Type = Task overrun Task Infinite Loop Handle = Suspend Task Task Infinite Loop CpuId = 13 Overrun Task Name = L2IF Overrun Task VOS ID = 214 Overrun Task Osal ID = 0x088444c0 Task Overrun Threshold = 20000 (ms) Task Has-run Time = 20000 (ms) Task Infinite Loop Occur Time = [2017.08.29 20:18:33] Task Infinite Loop Occur Cputick = [0x0000055e, 0x196312b2] Task switch trace info before task infinite loop: From cputick [0,0] to cputick [0x55e,0x196312b2] ------------------------------------------------------------- No. TaskName VosTID OsalTID Prio RunTime[s, ns] No Task switch trace info!!! |
根因
1.13号主用主控板收到大量TC报文频繁刷新ARP表项触发已知问题进而导致异常重启,在14号主控板升主用主控板后,由于硬件存在故障,与所有接口板通信通道故障,导致所有接口板无心跳复位,近而导致下挂业务全部中断。
2.已知问题为:
解决方案
1.设备没有加载任何补丁,需要将设备加载V200R003SPH022,防止问题再次发生。
2.在设备上配置优化命令
1)stp tc-protection。保证设备频繁收到TC报文时,每2秒周期内最多只处理1次表项刷新,从而减少MAC、ARP表项频繁刷新对设备造成的CPU处理任务过多。
2)arp topology-change disable、mac-address update arp,当设备收到TC报文后,默认会清除MAC、老化ARP。当设备上的ARP表项较多时,ARP的重新学习会导致网络中的ARP报文过多。配置此两条命令后,在网络拓扑变化时,可以根据AMC地址的出接口变化刷新ARP表项出接口。可以减少大量不必要的ARP表项刷新。
3.替换14号主控板,信息如下:
[Board Properties]
BoardType=ES02SRUA
BarCode=030MQS10E3000365
Item=03030MQS
Description=Quidway S7700,ES02SRUA,Quidway S7706/S7712,Main Control Unit A
Manufactured=2014-03-16
VendorName=Huawei
IssueNumber=00
CLEICode=
BOM=
建议与总结
1.设备定期进行版本及补丁更新工作,防止已知问题再次发生。
2.根据用户网络特点,配置优化数据,防止突发情况,导致设备出现故障。