Failed to casting message to the nailgun. RabbitMQ was dead

Bug #1461562 reported by Alexander Kurenyshev
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Vladimir Sharshov

Bug Description

Steps to reproduce:
Scenario:
1. Create cluster in Ha mode with 1 controller
2. Add 1 node with controller role
3. Add 1 node with compute role
4. Deploy cluster

Expected result:
Deploy pass

Actual result:
Deploy fails

At the Astute log:
2015-06-03T07:33:03 info: [663] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"a20646c6-40e0-4d8d-b3a5-c123ce4cda1d", "status"=>"ready", "progress"=>100}}
2015-06-03T07:33:03 warning: [663] Trying to reconnect to message broker. Retry 5 sec later...

At the rabbit log:
=ERROR REPORT==== 3-Jun-2015::07:31:33 ===
closing AMQP connection <0.350.0> (10.109.15.2:39783 -> 10.109.15.2:5672):
{heartbeat_timeout,running}

=INFO REPORT==== 3-Jun-2015::07:33:08 ===
accepting AMQP connection <0.1203.0> (10.109.15.2:40972 -> 10.109.15.2:5672)

Fuel used:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "497"
  build_id: "2015-06-02_16-28-25"
  nailgun_sha: "3830bdcb28ec050eed399fe782cc3dd5fbf31bde"
  python-fuelclient_sha: "4fc55db0265bbf39c369df398b9dc7d6469ba13b"
  astute_sha: "cbae24e9904be2ff8d1d49c0c48d1bdc33574228"
  fuel-library_sha: "d757cd41e4f8273d36ef85b8207e554e5422c5b0"
  fuel-ostf_sha: "f899e16c4ce9a60f94e7128ecde1324ea41d09d4"
  fuelmain_sha: "bcc909ffc5dd5156ba54cae348b6a07c1b607b24"

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :
Andrey Maximov (maximov)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Matthew Mosesohn (raytrac3r)
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

According to the logs, astute keeps casting messages to deploy_resp topic in RabbitMQ, but they never get collected by receiverd.
One suggestion could be to kill receiverd if it can't connect (and supervisor restarts it).
See there are only a couple of messages received by provision_resp, but no deploy_resp messages.

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Fuel Python Team (fuel-python)
tags: added: scale
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Dmitry Ilyin (idv1985)
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

This bug is not about scale. Nastya will create a new bug request for scale story. Hiding scale comments.

Changed in fuel:
assignee: Dmitry Ilyin (idv1985) → Fuel Python Team (fuel-python)
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Igor Kalnitsky (ikalnitsky)
Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

I have no idea what's really going on. Nailgun's receiver has re-connection code. Moreover, in case of re-connection or some error it should print error message to logs. Still, the logs are clear and it looks like receiver's waiting for messages like always,

From Astute POV, we see some AMQP connection problems, but then all goes fine.

If we take a look at Astute's log more carefully, we'll find out unusual messages. We have a huge list of consequent "casting message" calls. Normally, there should be no consequent calls. Normally, we execute some task and then cast message to nailgun with updated progress bar.

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Actually, I saw similar unusual Astute log in another bug.

https://bugs.launchpad.net/fuel/+bug/1459289

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Do we know how to reproduce it?

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

No, we don't. This is a floating bug, and it seems impossible to fix it without "bad" env where you can try to debug the issue.

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Since I can't say nothing more, move it to Incomplete.

P.S: If someone catch this issue, PLEASE do not destroy env, give it to me, so I can debug it with strace/gdb/etc.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Move it to 7.0, because it's something incredible rare. and we don't know repro steps.

Changed in fuel:
milestone: 6.1 → 7.0
Revision history for this message
vikas (vd44690) wrote :
Download full text (5.0 MiB)

I got the same error how can I resolve it . Attaching the astute log below

/var/log/docker-logs/astute/astute.log

# Logfile created on 2015-07-05 13:12:42 +0000 by logger.rb/44203
2015-07-05T13:12:42 info: [514] Starting with settings
--- !ruby/hash:SymbolTable
:puppet_timeout: 5400
:puppet_deploy_interval: 2
:puppet_fade_timeout: 120
:mc_retries: 10
:mc_retry_interval: 1
:puppet_fade_interval: 30
:provisioning_timeout: 5400
:reboot_timeout: 240
:dump_timeout: 3600
:keys_src_dir: "/var/lib/fuel/keys"
:puppet_ssh_keys:
- neutron
- nova
- ceph
- mysql
:puppet_keys:
- mongodb
:keys_dst_dir: "/var/lib/astute"
:max_nodes_per_call: 50
:max_nodes_to_provision: 50
:ssh_retries: 5
:ssh_retry_timeout: 30
:max_nodes_per_remove_call: 10
:nodes_remove_interval: 10
:max_nodes_net_validation: 10
:dhcp_repeat: 3
:iops: 54.25
:splay_factor: 180
:agent_nodiscover_file: "/etc/nailgun-agent/nodiscover"
:broker_host: 10.20.0.2
:broker_port: 5672
:broker_rest_api_port: 15672
:broker_username: naily
:broker_password: aBukMmzx
:broker_service_exchange: naily_service
:broker_queue: naily
:broker_publisher_queue: nailgun
:broker_exchange: nailgun

2015-07-05T13:12:42 info: [514] => Starting astute with 7 worker(s)
2015-07-05T13:12:42 info: [514] master process ready
2015-07-05T13:12:42 info: [516] => Starting worker (PID: 516, ID: 0)
2015-07-05T13:12:42 info: [516] Worker initialization
2015-07-05T13:12:42 info: [520] => Starting worker (PID: 520, ID: 1)
2015-07-05T13:12:42 info: [520] Worker initialization
2015-07-05T13:12:42 info: [523] => Starting worker (PID: 523, ID: 2)
2015-07-05T13:12:42 info: [523] Worker initialization
2015-07-05T13:12:42 info: [527] => Starting worker (PID: 527, ID: 3)
2015-07-05T13:12:42 info: [527] Worker initialization
2015-07-05T13:12:42 info: [534] => Starting worker (PID: 534, ID: 5)
2015-07-05T13:12:42 info: [534] Worker initialization
2015-07-05T13:12:42 info: [531] => Starting worker (PID: 531, ID: 4)
2015-07-05T13:12:42 info: [531] Worker initialization
2015-07-05T13:12:42 info: [538] => Starting worker (PID: 538, ID: 6)
2015-07-05T13:12:42 info: [538] Worker initialization
2015-07-05T20:59:49 info: [493] Starting with settings
--- !ruby/hash:SymbolTable
:puppet_timeout: 5400
:puppet_deploy_interval: 2
:puppet_fade_timeout: 120
:mc_retries: 10
:mc_retry_interval: 1
:puppet_fade_interval: 30
:provisioning_timeout: 5400
:reboot_timeout: 240
:dump_timeout: 3600
:keys_src_dir: "/var/lib/fuel/keys"
:puppet_ssh_keys:
- neutron
- nova
- ceph
- mysql
:puppet_keys:
- mongodb
:keys_dst_dir: "/var/lib/astute"
:max_nodes_per_call: 50
:max_nodes_to_provision: 50
:ssh_retries: 5
:ssh_retry_timeout: 30
:max_nodes_per_remove_call: 10
:nodes_remove_interval: 10
:max_nodes_net_validation: 10
:dhcp_repeat: 3
:iops: 222.28
:splay_factor: 180
:agent_nodiscover_file: "/etc/nailgun-agent/nodiscover"
:broker_host: 10.20.0.2
:broker_port: 5672
:broker_rest_api_port: 15672
:broker_username: naily
:broker_password: aBukMmzx
:broker_service_exchange: naily_service
:broker_queue: naily
:broker_publisher_queue: nailgun
:broker_exchange: nailgun

2015-07-05T20:59:49 info: [493] => Starting astute with 7 worker(s)
2015-07-05T20:59:49 info: [493] master process...

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Looks like same problem was here: https://bugs.launchpad.net/fuel/6.1.x/+bug/1467320

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

I have reproduced this bug on

[root@nailgun ~]# curl -sS http://10.108.0.2:8000/api/version/ | python -m json.tool
{
    "api": "1.0",
    "astute_sha": "e63709d16bd4c1949bef820ac336c9393c040d25",
    "auth_required": true,
    "build_id": "263",
    "build_number": "263",
    "feature_groups": [
        "mirantis"
    ],
    "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
    "fuel-library_sha": "1556601b9b7503285714d7d1e02cc0807b1c68f0",
    "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
    "fuel-ostf_sha": "582a81ccaa1e439a3aec4b8b8f6994735de840f4",
    "fuelmain_sha": "4dc6799370da4cddf06c04e4ecb7646102298535",
    "nailgun_sha": "b564ae20116297750bf6402b3a017e219bf4b468",
    "openstack_version": "2015.1.0-7.0",
    "production": "docker",
    "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
    "release": "7.0",
    "release_versions": {
        "2015.1.0-7.0": {
            "VERSION": {
                "api": "1.0",
                "astute_sha": "e63709d16bd4c1949bef820ac336c9393c040d25",
                "build_id": "263",
                "build_number": "263",
                "feature_groups": [
                    "mirantis"
                ],
                "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
                "fuel-library_sha": "1556601b9b7503285714d7d1e02cc0807b1c68f0",
                "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
                "fuel-ostf_sha": "582a81ccaa1e439a3aec4b8b8f6994735de840f4",
                "fuelmain_sha": "4dc6799370da4cddf06c04e4ecb7646102298535",
                "nailgun_sha": "b564ae20116297750bf6402b3a017e219bf4b468",
                "openstack_version": "2015.1.0-7.0",
                "production": "docker",
                "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
                "release": "7.0"
            }
        }
    }
}

Changed in fuel:
status: Incomplete → Confirmed
importance: High → Critical
tags: added: module-astute rabbitmq
removed: scale
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Also reproduced in this report: https://bugs.launchpad.net/fuel/+bug/1491375

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Bug https://bugs.launchpad.net/fuel/+bug/1491375 is reproduced many times on almost half of 7.0.swarm jobs.

For example, CI job:
https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.ha_destructive_ceph_neutron/45/console

##### rabbitmq log:

=ERROR REPORT==== 3-Sep-2015::04:25:24 ===
closing AMQP connection <0.345.0> (10.109.0.2:40620 -> 10.109.0.2:5672):
{heartbeat_timeout,running}

=INFO REPORT==== 3-Sep-2015::04:25:31 ===
accepting AMQP connection <0.2438.0> (10.109.0.2:44064 -> 10.109.0.2:5672)
################

This particular issue most possible was caused by huge size of a message that was casted from astute to nailgun.

Scenario of the issue:

1) Deploy cluster

2) When waiting timeout was reached in system test (cluster deploing was still in progress!) , system test started task for making diagnostic snapshot (time: 04:20)

3) Diagnostic snapshot was successfully created, and then message was casted to Nailgun (>4Mb of size, see the attach) (time: 04:25:20)

4) RabbitMQ closed the connection (time: 04:25:24)

5) Astute, while processing the deploy that was still in progress, reconnected to RabbitMQ and continue deploy:
## astute.log
2015-09-03T04:25:26 warning: [679] Trying to reconnect to message broker. Retry 5 sec later...

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

For details, /var/log folder from the job with failed diagnostic snapshot attached here.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/220146

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Well, today I tried to test our Astute connection errors handling and looks good. I tried to

* close Astute->RabbitMQ socket using gdb while Astute's processing message from nailgun (before sending back result)
* drop outgoing traffic

In both cases Astute tries to reconnect to message broker almost immediately.

So, I've prepared a patch that introduces more verbose logging https://review.openstack.org/#/c/220146/1 and I'm going to build ISO and ask QA to help me to reproduce this issue.

I want also to notice that perhaps we started facing this issue because Astute started to use almost 100% CPU. For instance, the Dennis case could be:

* Astute sends message to RabbitMQ (4Mb, huge enough)
* CPU is busy by other workers
* It requires some time to send it
* There's no heartbeats from Astute to RabbitMQ (because it's busy with sending message)
* RabbitMQ closes connection
* Astute reconnected, but after it's failed to send the original message. And there's no retries, apparently.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/220196

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

ok, Dennis' cases is the following -

* Astute takes 100% CPU usage (due to bug #1491794)
* Astute's sending a huge message and it did quite a time due to heavy loading.

As the result, Astute's unable to respond to RabbitMQ's heartbeat so the connection was dropped and message was lost.

That means, we need to remove heartbeats and implement resending message in such cases.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/220196
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=c9b54daf1e60fdf528dc02818bc113f1cf78c11a
Submitter: Jenkins
Branch: master

commit c9b54daf1e60fdf528dc02818bc113f1cf78c11a
Author: Igor Kalnitsky <email address hidden>
Date: Thu Sep 3 15:22:03 2015 +0000

    Revert "Specify heartbeat for connection to RabbitMQ"

    Unfortunately, such a small timeout for hearbeats cause an incredible
    regression for Astute <-> RabbitMQ connectivity.

    Basically, if Astute's sending a huge message to RabbitMQ and by some
    reason we aren't able to do it within 2 seconds (heartbeat timeout),
    the connection will be dropped by RabbitMQ server.

    In fact, that's happened on SWARM while

    * Astute takes almost 100% CPU usage (bug #1491794)
    * Astute's sending diagnostic snapshot result message to Nailgun
      (more than 4MB of data).

    Since heartbeats aren't solve the problem we have tried to fix, let's
    revert the original commit e24ca066bf6160bc1e419aaa5d486cad1aaa937d.

    Change-Id: I8734541c026b86506a6f735b1eb55371b393921c
    Related-Bug: #1461562

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Using patch https://review.openstack.org/#/c/220146/ on ISO#265, I tried to make a diagnistic snapshot from CLI: 'fuel snapshot'.

There is a result (lines with lot of data are not in full length to more suitable view): http://paste.openstack.org/show/444537/

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Ok, according to Sergii G. and Dennis D. , two reverts helped us with this issue. If there'll be no reproduces anymore, I'd close this issue or move it to 8.0 (because we faced it before two reverted patches).

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
importance: Critical → Medium
status: In Progress → Confirmed
assignee: Igor Kalnitsky (ikalnitsky) → Fuel Python Team (fuel-python)
milestone: 7.0 → 8.0
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Arthur Svechnikov (asvechnikov) wrote :
Changed in fuel:
importance: Medium → High
Revision history for this message
Mike Scherbakov (mihgen) wrote :

If issue occurs once in a thousand of installations (and we should do no less than that in a month timeframe), why do we think it's high? Arthur - can you please justify importance of this bug?

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

This bug connected with this fix: https://review.openstack.org/#/c/234665/

Dmitry Pyzhov (dpyzhov)
tags: added: team-enh
tags: added: team-enhancements
removed: team-enh
Revision history for this message
Arthur Svechnikov (asvechnikov) wrote :

Seems it was incorrect statistics from me about bug frequency. I move it back to medium due to rare reproduction

Changed in fuel:
importance: High → Medium
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/234665
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=b60624ee2c5f1d6d805619b6c27965a973508da1
Submitter: Jenkins
Branch: master

commit b60624ee2c5f1d6d805619b6c27965a973508da1
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Oct 12 19:25:00 2015 +0300

    Move from amqp-gem to bunny

    Differents:

    - separate independent chanel for outgoing report;
    - solid way to redeclare already existed queues;
    - auto recovery mode in case of network problem by default;
    - more solid, modern and simple library for AMQP.

    Also:

    - implement asynchronous logger for event callbacks.

    Short words from both gems authors:

    amqp gem brings in a fair share of EventMachine complexity which
    cannot be fully eliminated. Event loop blocking, writes that
    happen at the end of loop tick, uncaught exceptions in event
    loop silently killing it: it's not worth the pain unless
    you've already deeply invested in EventMachine and
    understand how it works.

    Closes-Bug: #1498847
    Closes-Bug: #1487397
    Closes-Bug: #1461562
    Related-Bug: #1485895
    Related-Bug: #1483182

    Change-Id: I52d005498ccb978ada158bfa64b1c7de1a24e9b0

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-astute (master)

Change abandoned by Vladimir Sharshov (<email address hidden>) on branch: master
Review: https://review.openstack.org/220146
Reason: We moved from ruby-amqp to bunny, so we need to refactor this change. I've created new bug for such propose https://bugs.launchpad.net/fuel/+bug/1521816

tags: added: on-verification
Revision history for this message
Sergey Novikov (snovikov) wrote :

Verified on MOS 8.0 RC#1

I executed steps to reproduce which described above several times and bug was not reproduced. Also I collected information about CPU workload and there are not any issues

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.