S7712主控板收到大量TC报文,设备频繁刷新ARP表项导致异常重启。

问题描述

一、故障现象

某客户反馈,他们局域网核心交换机S7712设备出现异常,所有业务板卡系统运行灯快闪,端口灯全熄灭,下挂业务中断,通过掉电重启紧急恢复。

二、设备版本/补丁信息

V200R003C00SPC500+Null

三、业务影响

此交换机为客户局域网核心交换机,下挂近2000用户,其中包含客户门户网站等重要业务,业务影响比较大。

处理过程

1.跟现场工程师确认,出问题时所有业务单板端口灯不亮,所有接口板系统灯为绿色快闪状态。

2.查看日志可以确定设备13号槽位为主用主控板,14号槽位为备用主控板,14:57:28左右14号主控板升为主用主控,并上报与13号主控互联通道故障,怀疑此时13号主控板异常复位,由于设备掉电,导致掉电前部分日志没有正常记录。

Aug 31 2017 14:57:28+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[637]:No.0 channel from slot 1/14 to slot 1/13 is faulty.

Aug 31 2017 14:57:28+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[638]:All channels from slot 1/14 to slot 1/13 are faulty.

3.14号主控板做为主用主控板后,14:58:04所有与接口板互联的HG变为DOWN状态。

%2017-Aug-31 14:58:04.900.1+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1415]:Slot 14 layer DRV module AV level INFO: unit 0 hg5 change to down.
%2017-Aug-31 14:58:04.900.2+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1416]:Slot 14 layer DRV module AV level INFO: unit 0 hg8 change to down.
%2017-Aug-31 14:58:04.900.3+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1417]:Slot 14 layer DRV module AV level INFO: unit 0 hg12 change to down.
%2017-Aug-31 14:58:05.910.1+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1418]:Slot 14 layer DRV module AV level INFO: unit 0 hg13 change to down.

4.查看所有接口板与主控板间通信信道故障,接口板收不到主控板心跳报文复位,业务中断。

Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[639]:No.0 channel from slot 1/14 to slot 1/1 is faulty.

Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[640]:All channels from slot 1/14 to slot 1/1 are faulty.

Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[641]:No.0 channel from slot 1/14 to slot 1/8 is faulty.

Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[642]:All channels from slot 1/14 to slot 1/8 are faulty.

Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[643]:No.0 channel from slot 1/14 to slot 1/3 is faulty.

Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[644]:All channels from slot 1/14 to slot 1/3 are faulty.

Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[645]:No.0 channel from slot 1/14 to slot 1/12 is faulty.

Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[646]:All channels from slot 1/14 to slot 1/12 are faulty.

5.问题发生前后,设备有频繁收到大量TC报文。

Aug 31 2017 13:31:53+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12968]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:33:22+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12973]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:33:56+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12979]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:35:09+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12986]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:35:14+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12989]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:35:24+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12993]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:36:08+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12996]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:36:22+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13002]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:36:43+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13005]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:37:24+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13009]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:37:50+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13014]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:38:27+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13017]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:39:51+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13024]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:39:55+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13027]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:39:59+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13030]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:40:31+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13035]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

6.在出问题之前14号主控板做为主用主控时,已发生过频繁收到大量TC报文导致异常重启的问题。

Aug 30 2017 07:50:46+08:00 2_L_YCS_C_S7712-0001 %%01ALML/4/ENTRESET(l)[4973]:MPU frame[1] board[14] is reset. The reason is: VRP reset selfboard because of find deadloop.

7.主控板上记录多次TC刷新ARP导致死循环的异常记录。

============ Task Infinite Loop Information Begin ============

Dopra Version                    = DOPRA V100R006C09CP0671

Application Version              = VRPV500R013C00SPC295-GR

Task Infinite Loop Type          = Task overrun

Task Infinite Loop Handle        = Suspend Task

Task Infinite Loop CpuId         = 13

Overrun Task Name                = L2IF

Overrun Task VOS ID              = 214

Overrun Task Osal ID             = 0x0883fba0

Task Overrun Threshold           = 20000 (ms)

Task Has-run Time                = 20000 (ms)

Task Infinite Loop Occur Time    = [2017.08.29  07:31:08]

Task Infinite Loop Occur Cputick = [0x0000187c, 0xd11773bb]

 

Task switch trace info before task infinite loop:

From cputick [0,0] to cputick [0x187c,0xd11773bb]

-------------------------------------------------------------

No. TaskName        VosTID  OsalTID     Prio  RunTime[s, ns]

No Task switch trace info!!!

 

============ Task Infinite Loop Information Begin ============

Dopra Version                    = DOPRA V100R006C09CP0671

Application Version              = VRPV500R013C00SPC295-GR

Task Infinite Loop Type          = Task overrun

Task Infinite Loop Handle        = Suspend Task

Task Infinite Loop CpuId         = 13

Overrun Task Name                = L2IF

Overrun Task VOS ID              = 214

Overrun Task Osal ID             = 0x08841300

Task Overrun Threshold           = 20000 (ms)

Task Has-run Time                = 20000 (ms)

Task Infinite Loop Occur Time    = [2017.08.29  17:57:23]

Task Infinite Loop Occur Cputick = [0x000017e5, 0x932a814c]

Task switch trace info before task infinite loop:

From cputick [0,0] to cputick [0x17e5,0x932a814c]

-------------------------------------------------------------

No. TaskName        VosTID  OsalTID     Prio  RunTime[s, ns]

No Task switch trace info!!!

 

Corresponding task call stack info:

-------------------------------------------------------------

 

============ Task Infinite Loop Information Begin ============

Dopra Version                    = DOPRA V100R006C09CP0671

Application Version              = VRPV500R013C00SPC295-GR

Task Infinite Loop Type          = Task overrun

Task Infinite Loop Handle        = Suspend Task

Task Infinite Loop CpuId         = 13

Overrun Task Name                = L2IF

Overrun Task VOS ID              = 214

Overrun Task Osal ID             = 0x088444c0

Task Overrun Threshold           = 20000 (ms)

Task Has-run Time                = 20000 (ms)

Task Infinite Loop Occur Time    = [2017.08.29  20:18:33]

Task Infinite Loop Occur Cputick = [0x0000055e, 0x196312b2]

Task switch trace info before task infinite loop:

From cputick [0,0] to cputick [0x55e,0x196312b2]

-------------------------------------------------------------

No. TaskName        VosTID  OsalTID     Prio  RunTime[s, ns]

No Task switch trace info!!!

根因

1.13号主用主控板收到大量TC报文频繁刷新ARP表项触发已知问题进而导致异常重启,在14号主控板升主用主控板后,由于硬件存在故障,与所有接口板通信通道故障,导致所有接口板无心跳复位,近而导致下挂业务全部中断。

2.已知问题为:

解决方案

1.设备没有加载任何补丁,需要将设备加载V200R003SPH022,防止问题再次发生。

2.在设备上配置优化命令

1)stp tc-protection。保证设备频繁收到TC报文时,每2秒周期内最多只处理1次表项刷新,从而减少MAC、ARP表项频繁刷新对设备造成的CPU处理任务过多。

2)arp topology-change disable、mac-address update arp,当设备收到TC报文后,默认会清除MAC、老化ARP。当设备上的ARP表项较多时,ARP的重新学习会导致网络中的ARP报文过多。配置此两条命令后,在网络拓扑变化时,可以根据AMC地址的出接口变化刷新ARP表项出接口。可以减少大量不必要的ARP表项刷新。

3.替换14号主控板,信息如下:

[Board Properties]

BoardType=ES02SRUA

BarCode=030MQS10E3000365

Item=03030MQS

Description=Quidway S7700,ES02SRUA,Quidway S7706/S7712,Main Control Unit A

Manufactured=2014-03-16

VendorName=Huawei

IssueNumber=00

CLEICode=

BOM=

建议与总结

1.设备定期进行版本及补丁更新工作,防止已知问题再次发生。

2.根据用户网络特点,配置优化数据,防止突发情况,导致设备出现故障。

阅读剩余
THE END