S9703 ICMP应答延时大问题的处理

问题描述

1)问题涉及的设备及版本:

设备类型 版本 补丁
S9703 V200R007C00SPC500 V200R007C00SPH003

  2)网络拓朴:

 

华为两台S9703采用VRRP部署,S9703下接两台友商加密机,加密机采用双机主备方式部署在网络中。主、备加密机同时启用Ping检测到S9703VRRP虚地址的连通性,如果2S内VRRP主交换机设备没有回应Ping请求或者回应超过2S时,加密机就会主备倒换、影响业务。

   3)故障描述:

近期客户网管频繁出现大量告警、网络中断。每次有告警故障时,经过排查发现都是由于主加密机ping检测S9703的虚地址超时导致加密机主备倒换所引起的。

 

告警信息

处理过程

经过在加密机的上行接口与S9703的入接口做报文分析发现:

1)   加密机上报文显示收到的ICMP应答报文比较慢。

2)   在交换机上分析报文,同样显示ICMP应答报文比较慢。

3)   查看交换log日志信息,在PING应答延时大的时间点没有异常记录。查看cpu-defend统计,历史记录有大量报文上送cpu。

<ZJ_HUAWEI_9703_1>dis cpu-d  stat all

Warning: This feature is not supported on slot mainboard.

Statistics on slot 1:

--------------------------------------------------------------------------------

Packet Type          Pass(Packet/Byte)   Drop(Packet/Byte)  Last-dropping-time

--------------------------------------------------------------------------------

arp-miss                       7343836              960106  2016-09-14 04:10:06

891503214           325494473

arp-reply                      4876208               23516  2016-08-29 18:20:02

334333392             1599104

arp-request                  232496454             4233354  2016-04-09 11:21:47

15804222740           287865652

bgp                            3181563                 440  2016-08-03 00:51:46

371346685              260423

fib-hit                         161734               72703  2016-09-10 23:40:05

29547653           107473804

ftp                               2102                   0  -

138596                   0

gre-keepalive                        0                   0  -

0                   0

http                            274865                1212  2016-07-14 02:51:49

31326864              137559

https                            34238                   0  -

3527913                   0

hw-tacacs                      2716845                   0  -

212097479                   0

icmp                         287418486                  53  2016-08-23 17:30:00

22470675492               44572

isis                          45789391               40256  2015-09-30 06:21:42

26074346660            60818641

lnp                            2043171                   0  -

138935628                   0

mpls-fib-hit                  32739182                   0  -

3874680204                   0

mpls-ldp                      11232207                   0  -

856159527                   0

ntp                             115109                   0  -

11281711                   0

portal                               3                   0  -

1379                   0

1841                   0

snmp                          95751077                  53  2015-11-27 17:11:42

15349243345                5837

ssh                            1203605                   0  -

186181287                   0

tcp                             731760                1196  2016-08-09 00:00:00

51531196              206724

telnet                        13985488                4188  2016-09-06 16:00:04

905218755              281034

ttl-expired                   33014539              389563  2016-09-14 10:30:06

3707542127           415230981

vbst                         633733009                   0  -

43545519488                   0

vrrp                          79982015               12425  2016-08-12 16:40:00

5168775986              795200

wapi                                 0                   0  -

0                   0

--------------------------------------------------------------------------------

Statistics on slot 2:

--------------------------------------------------------------------------------

Packet Type          Pass(Packet/Byte)   Drop(Packet/Byte)  Last-dropping-time

--------------------------------------------------------------------------------

arp-miss                        880393                 642  2016-06-26 22:51:49

74735581              194991

arp-reply                        53388                   0  -

3416832                   0

arp-request                    1543151                   0  -

98761664                   0

bgp                                179                   0  -

11492                   0

fib-hit                          54886                  51  2016-06-02 15:41:49

7495294               23376

ftp                               1161                   0  -

78740                   0

http                             13253                   0  -

888195                   0

https                             2870                   0  -

190520                   0

hw-tacacs                            0                   0  -

0                   0

icmp                          33472905                  11  2015-12-21 06:31:42

3247336545               13846

ntp                               2380                   0  -

278338                   0

snmp                               931                   0  -

100534                   0

ssh                              11396                   0  -

770188                   0

tcp                             256108                   0  -

17004307                   0

telnet                          128531                   0  -

10159550                   0

ttl-expired                       1689                   0  -

111392                   0

4)   根据以上信息判断:出现PING回应延时大时,交换机没有丢包,根据cpu-defend统计记录有各种协议报文上送CPU,分析认为在延时大的时候上送CPU协议报文比较多,而交换机对ICMP报文处理的优先级较低,由于优先级低的协议报文得不到优先调度导致ICMP应答慢。

根因

经定位分析在大量的协议报文上送CPU处理的情况下,交换机对ICMP报文处理的优先级较低,由于CPU调度机制,ICMP报文得不到优先调度导致ICMP应答慢,延时抖动较大,导致加密机ping检测失败,引起主备倒换。

解决方案

交换机使能ICMP 快回功能(不会影响现网业务),同时加密机调整探测失败时间,可以尝试调整为15S观察下;

S9703使用ICMP快回功能命令:icmp-reply fast

建议与总结

阅读剩余
THE END
阿里云ECS特惠活动
阿里云ECS服务器 - 限时特惠活动

云服务器爆款直降90%

新客首单¥68起 | 人人可享99元套餐,续费同价 | u2a指定配置低至2.5折1年,立即选购享更多福利!

新客首单¥68起
人人可享99元套餐
弹性计费
7x24小时售后
立即查看活动详情
阿里云ECS服务器特惠活动