100.114 NTP alarm not cleared after swact

Bug #1834071 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

Brief Description
-----------------
After host-swact, 100.114 alarm "NTP configuration does not contain any valid or reachable NTP servers" appeared and was not cleared at the end of test suite.

Severity
--------
Minor

Steps to Reproduce
------------------
As description

TC-name: mtc/test_swact.py::test_swact_controller_platform

Expected Behavior
------------------
alarm should be cleared

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Lab-name: WCp_71-75

Branch/Pull Time/Commit
-----------------------
stx master as of 20190623T233000Z

Last Pass
---------
Lab: WCP_71_75
Load: 20190620T013000Z

Timestamp/Logs
--------------
[2019-06-24 10:02:14,234] 268 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-06-24 10:02:15,608] 387 DEBUG MainThread ssh.expect :: Output:

[sysadmin@controller-0 ~(keystone_admin)]$

[2019-06-24 10:02:22,555] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-06-24 10:24:35,412] 268 DEBUG MainThread ssh.send :: Send 'sudo ntpq -pn'
[2019-06-24 10:24:35,528] 387 DEBUG MainThread ssh.expect :: Output:
Password:
[2019-06-24 10:24:35,529] 268 DEBUG MainThread ssh.send :: Send 'Li69nux*'
[2019-06-24 10:24:35,667] 387 DEBUG MainThread ssh.expect :: Output:
     remote refid st t when poll reach delay offset jitter
==============================================================================
*192.168.204.4 206.108.0.133 2 u 125 128 366 0.015 4.705 0.397
+209.115.181.102 206.108.0.131 2 u 292 1024 377 43.119 6.158 3.426
+208.81.1.244 200.98.196.212 2 u 271 1024 377 43.886 -2.374 2.927
controller-0:~$

[2019-06-24 10:25:12,100] 268 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-06-24 10:25:13,439] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| e8b116bb-2f8e-4d5c-929d-4bed6dfd4edb | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2019-06-24T10:02:28.015494 |
+--------------------------------------+----------+------------------------------------------------------------------------+-----------------------+----------+----------------------------+
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
summary: - 100.114 alarm "NTP configuration does not contain any valid or reachable
- NTP servers" not cleared after swact
+ 100.114 NTP alarm not cleared after swact
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Kristine to triage

tags: added: stx.config
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Kristine Bujold (kbujold)
status: New → Triaged
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was reproduced with load 20190706T013000Z

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 gating until further investigation. This bug was seen more than once as per the latest comment above.

tags: added: stx.2.0
Bin Qian (bqian20)
Changed in starlingx:
assignee: Kristine Bujold (kbujold) → Bin Qian (bqian20)
Revision history for this message
Bin Qian (bqian20) wrote :

ntpq reported that 2 external servers were reachable, but peer controller was selected server. The ntp plugin which monitors the ntp state does not clear alarm when peer controller is the selected server.
Need to figure out why ntp selected peer controller instead of 2 other external servers.

Bin Qian (bqian20)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Bin Qian (bqian20) wrote :

controller-0 ntpd selected peer as best ntp server after controller-1 connected to a stratum 0 server, therefor it became stratum 1 server. with the much better round-trip delay (0.003 vs 43.x), the controller-1 was selected (by ntpd). Below is the status before and after controller-1 became stratum 1 server (connected to stratum 0 server), ntpd on controller-0 made the change to select controller-1 as its reference.

===============
2019-06-24T09:52:27.997 controller-0 collectd[96608]: info NTP query plugin server list: ['0.pool.ntp.org', '1.pool.ntp.org', '2.pool.ntp.org']
2019-06-24T09:52:28.014 controller-0 collectd[96608]: info NTPQ: +192.168.204.4 127.0.0.1 12 u 27 64 376 0.222 4.744 0.874
2019-06-24T09:52:28.014 controller-0 collectd[96608]: info NTPQ: *209.115.181.102 206.108.0.131 2 u 472 512 377 43.039 11.603 5.741
2019-06-24T09:52:28.014 controller-0 collectd[96608]: info NTPQ: +208.81.1.244 200.98.196.212 2 u 437 512 377 43.919 -1.642 8.695

2019-06-24T10:02:26.265 controller-1 collectd[13966]: info NTP query plugin server list: ['0.pool.ntp.org', '1.pool.ntp.org', '2.pool.ntp.org']
2019-06-24T10:02:26.273 controller-1 collectd[13966]: info ptp plugin PTP Service Disabled
2019-06-24T10:02:26.284 controller-1 collectd[13966]: info NTPQ: 192.168.204.3 192.168.204.4 3 u 14 64 377 0.056 -3.783 0.211
2019-06-24T10:02:26.284 controller-1 collectd[13966]: info NTPQ: *206.108.0.133 .PTP0. 1 u 37 64 377 7.866 1.445 0.195
2019-06-24T10:02:26.284 controller-1 collectd[13966]: info NTPQ: +208.81.1.244 200.98.196.212 2 u 33 64 377 43.840 -6.853 0.417
2019-06-24T10:02:26.284 controller-1 collectd[13966]: info NTPQ: +35.183.57.169 128.59.0.245 2 u 31 64 377 10.330 3.485 0.284

2019-06-24T10:02:27.997 controller-0 collectd[96608]: info NTP query plugin server list: ['0.pool.ntp.org', '1.pool.ntp.org', '2.pool.ntp.org']
2019-06-24T10:02:28.015 controller-0 collectd[96608]: info NTPQ: *192.168.204.4 206.108.0.133 2 u 40 64 376 0.003 3.739 0.190
2019-06-24T10:02:28.015 controller-0 collectd[96608]: info NTPQ: +209.115.181.102 206.108.0.131 2 u 5 512 377 43.039 11.603 4.837
2019-06-24T10:02:28.015 controller-0 collectd[96608]: info NTPQ: +208.81.1.244 200.98.196.212 2 u 516 512 377 43.919 -1.642 7.028
================

Our current ntp monitor (ntpq.py) algorithm raises alarm if peer is selected ntp server. This may not be correct as mentioned above. ntpq.py should raises alarm only when it does not directly or indirectly connect to an reliable ntp source which implies that the computer clock may be incorrect.

tags: added: stx.regression
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/673528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/673528
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=e5bf093cc8081b58bce14b111cf4cce702835f2c
Submitter: Zuul
Branch: master

commit e5bf093cc8081b58bce14b111cf4cce702835f2c
Author: Bin Qian <email address hidden>
Date: Thu Jul 25 11:52:22 2019 -0400

    Use ntpq refid to tell if peer controller reaches reliable time source

    This is a partial fix only for ipv4.

    The ntpq.py verify if a valid source is the reference of
    peer controller when the peer controller is selected as
    time server.
    This change will avoid raising false alarm when a
    controller uses peer controller as time server while
    the peer uses a reliable time source (e.g, external time
    server, or accurate time device).

    Partial-Bug: 1834071

    Change-Id: I9140e14b79cb09088c8061a06fae22df97526a70
    Signed-off-by: Bin Qian <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/676742

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/676742
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=6218755c3defc5fbde81b4e59c4be472d3f9c69a
Submitter: Zuul
Branch: master

commit 6218755c3defc5fbde81b4e59c4be472d3f9c69a
Author: Bin Qian <email address hidden>
Date: Wed Aug 14 10:51:12 2019 -0400

    Soften NTP alarm language for syncing with peer

    In IPv6 setup, NTP refid is hash result of reference's IPv6 address.
    In such case, do not try to intepret the refid to tell if the peer
    has a reliable source.

    When the NTP service uses peer controller as reference, the alarm
    is a reminder to the admin user instead of reporting an issue. This
    is a minor alarm.

    Closes-Bug: 1834071
    Change-Id: Ia2770ba7ed77640e58e8c35254a504b57487ff8f
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Bin, please cherrypick to the r/stx.2.0 branch before 2019-08-23

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/677250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/677251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (r/stx.2.0)

Reviewed: https://review.opendev.org/677250
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=444b0a08648ad4d4a6bcb8788de9dd42beb7a4ff
Submitter: Zuul
Branch: r/stx.2.0

commit 444b0a08648ad4d4a6bcb8788de9dd42beb7a4ff
Author: Bin Qian <email address hidden>
Date: Thu Jul 25 11:52:22 2019 -0400

    Use ntpq refid to tell if peer controller reaches reliable time source

    This is a partial fix only for ipv4.

    The ntpq.py verify if a valid source is the reference of
    peer controller when the peer controller is selected as
    time server.
    This change will avoid raising false alarm when a
    controller uses peer controller as time server while
    the peer uses a reliable time source (e.g, external time
    server, or accurate time device).

    Partial-Bug: 1834071

    Change-Id: I9140e14b79cb09088c8061a06fae22df97526a70
    Signed-off-by: Bin Qian <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/677251
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=dc375dc6cf5d7d926383e4a78ce2668dc1de1840
Submitter: Zuul
Branch: r/stx.2.0

commit dc375dc6cf5d7d926383e4a78ce2668dc1de1840
Author: Bin Qian <email address hidden>
Date: Wed Aug 14 10:51:12 2019 -0400

    Soften NTP alarm language for syncing with peer

    In IPv6 setup, NTP refid is hash result of reference's IPv6 address.
    In such case, do not try to intepret the refid to tell if the peer
    has a reliable source.

    When the NTP service uses peer controller as reference, the alarm
    is a reminder to the admin user instead of reporting an issue. This
    is a minor alarm.

    Depends-On: I9140e14b79cb09088c8061a06fae22df97526a70
    Closes-Bug: 1834071
    Change-Id: Ia2770ba7ed77640e58e8c35254a504b57487ff8f
    Signed-off-by: Bin Qian <email address hidden>

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
Revision history for this message
Paulina Flores (paulina-flores) wrote :

Tested on a Standard system by swacting between controllers and observing the alarms. Alarms "NTP cannot reach external time source; syncing with peer controller only" and "NTP configuration does not contain any valid or reachable NTP servers" both appeared. Took around twenty minutes after swacting to observe the alarms' behaviour just in case, but neither disappeared.

Alarm "NTP cannot reach external time source; syncing with peer controller only" appears as minor in severity, while "NTP configuration does not contain any valid or reachable NTP servers" appears as major.

Revision history for this message
Bin Qian (bqian20) wrote :

The alarms may be valid. To verify the alarms, please run command below on both controllers
ntpq -pn

to see if the result looks like below:
*192.168.204.4 206.108.0.133 2 u 125 128 366 0.015 4.705 0.397
+209.115.181.102 206.108.0.131 2 u 292 1024 377 43.119 6.158 3.426
+208.81.1.244 200.98.196.212 2 u 271 1024 377 43.886 -2.374 2.927

The line with leading '*' is the selected ntp time source, if there is no such line, then it results major alarm "NTP configuration does not contain any valid or reachable NTP servers"
In ipv4, if a controller has the peer controller as selected time source, and the peer is having above major alarm, it will result the minor alarm "NTP cannot reach external time source; syncing with peer controller only"
In ipv6, if a controller has the peer controller as selected time source, it will result the minor alarm "NTP cannot reach external time source; syncing with peer controller only"

Revision history for this message
Paulina Flores (paulina-flores) wrote :

Hi, I tried the command and this is the result:

controller-0:~$ ntpq -pn
     remote refid st t when poll reach delay offset jitter
==============================================================================
 10.10.56.4 10.10.56.3 13 u 26 64 377 0.110 -0.353 0.825

Revision history for this message
Bin Qian (bqian20) wrote :

It looks like the alarms are valid from the ntpq result, as there isn't a selected time source. I am guessing 10.10.56.4 is the mgmt ip of controller-1 and 10.10.56.3 is the mgmt ip of controller-0. I read this as controller-1 uses controller-0 as time source, but controller-0 does not have a valid time source reachable. So controller-0 raises the major alarm and the controller-1 raise the minor alarm, both alarms are expected.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.