ceph-radosgw restart fails

Bug #1477225 reported by Andreas Hasenack
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
ceph (Ubuntu)
Fix Released
High
James Page
Trusty
Fix Released
High
Liam Young
Vivid
Fix Released
High
James Page
Wily
Fix Released
High
James Page

Bug Description

Upstream Bug: http://tracker.ceph.com/issues/11140

[Impact]

On 14.04 the restart target of the sysvinit script brings the service down
but sometimes fails to bring the service back up again. There is a race between stop and start and in the failure case the attempt to bring the service up runs before the service has been stopped and the start command is never issued:

The proposed fix updates /etc/init.d/radosgw so that the stop target
waits for up to 30 seconds for the service to stop cleanly

[Test Case]

Bundle:

openstack-services:
  services:
    mysql:
      branch: lp:~openstack-charmers/charms/trusty/percona-cluster/next
      constraints: mem=1G
      options:
        dataset-size: 50%
    ceph:
      branch: lp:~openstack-charmers/charms/trusty/ceph/next
      num_units: 3
      constraints: mem=1G
      options:
        monitor-count: 3
        fsid: 6547bd3e-1397-11e2-82e5-53567c8d32dc
        monitor-secret: AQCXrnZQwI7KGBAAiPofmKEXKxu5bUzoYLVkbQ==
        osd-devices: /dev/vdb
        osd-reformat: "yes"
        ephemeral-unmount: /mnt
    keystone:
      branch: lp:~openstack-charmers/charms/trusty/keystone/next
      constraints: mem=1G
      options:
        admin-password: openstack
        admin-token: ubuntutesting
    ceph-radosgw:
      branch: lp:~openstack-charmers/charms/trusty/ceph-radosgw/next
      options:
        use-embedded-webserver: True
  relations:
    - [ keystone, mysql ]
    - [ ceph-radosgw, keystone ]
    - [ ceph-radosgw, ceph ]
# kilo
trusty-kilo:
  inherits: openstack-services
  series: trusty
  overrides:
    openstack-origin: cloud:trusty-kilo
    source: cloud:trusty-kilo
trusty-icehouse:
  inherits: openstack-services
  series: trusty

$ juju-deployer -c next.yaml trusty-icehouse
$ juju ssh ceph-radosgw/0
$ sudo su -
# service radosgw status
/usr/bin/radosgw is running.
# service radosgw restart
Starting client.radosgw.gateway...
/usr/bin/radosgw already running.
/usr/bin/radosgw is running.
# service radosgw status
/usr/bin/radosgw is not running.
# apt-cache policy radosgw
radosgw:
  Installed: 0.80.10-0ubuntu0.14.04.1
  Candidate: 0.80.10-0ubuntu0.14.04.1
  Version table:
 *** 0.80.10-0ubuntu0.14.04.1 0
        500 http://nova.clouds.archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     0.79-0ubuntu1 0
        500 http://nova.clouds.archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages
root@juju-lytrusty-machine-4:~#

[Regression Potential]

 * The only change in behaviour that would result from this change is that
   running the stop target in the init script will wait for up to 30s before
   exiting rather than retuning immediatly. I cannot think of any use cases
   where this would be an issue.

[Original Bug Report]
job handler:
Jul 22 16:03:44 job-handler-1 ERR Failed to execute job: PUT request for http://10.96.4.129:80/swift/v1/simplestreams failed with code 500 Internal Server Error: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n <email address hidden> to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n</body></html>\n'#012Traceback (most recent call last):#012 File "/opt/canonical/landscape/canonical/landscape/model/activity/jobrunner.py", line 38, in run#012 yield self._run_activity(account_id, activity_id)#012HTTPError: PUT request for http://10.96.4.129:80/swift/v1/simplestreams failed with code 500 Internal Server Error: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n <email address hidden> to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n</body></html>\n'

Other logs attached.

Related branches

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
tags: removed: kanban
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

ceph-radosgw just died. Last log entries from /var/log/ceph/radosgw.log:
2015-07-22 15:01:33.303237 7f46bd7fa700 1 handle_sigterm
2015-07-22 15:01:33.396803 7f46e14aa7c0 1 final shutdown

And nothing after that. Landscape got the first error at 15:03:57, and failed continuously until the end.

I logged in on the unit, and there was no radosgw process running. I started one by running the contents of /var/www/s3gw.fcgi:
exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway

And then it worked.

The object-internal-error.tar.xz has the inner logs in landscape-0-inner-logs/. You can find the /var/log contents from the ceph-radosgw/0 unit in landscape-0-inner-logs/ceph-radosgw-0/var/log/ for example.

summary: - Internal server error when uploading to object store (ceph-radosgw)
+ ceph-radosgw died during deployment
information type: Proprietary → Public
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Re: ceph-radosgw died during deployment

Changing project to the ceph-radosgw charm

affects: landscape → ceph-radosgw (Juju Charms Collection)
Revision history for this message
Nobuto Murata (nobuto) wrote :

FWIW, I'm also getting 500 frequently with 'FastCGI: incomplete headers (0 bytes) received from server "/var/www/s3gw.fcgi"'. After doing `juju set ceph-radosgw use-embedded-webserver=true`(i.e. bypassing Apache + mod-fastcgi), the issue has gone.

I'm using cloud:trusty-kilo.

Revision history for this message
Alberto Donato (ack) wrote :

I had a similar issue with a ceph/ceph OSA deploy using current stable charms (specifically, cs:trusty/ceph-radosgw-15).

The autopilot fails while trying to upload simplestreams:

Aug 13 16:02:32 job-handler-1 INFO PUT http://10.1.48.88:80/swift/v1/simplestreams headers={'X-Container-Read': '.r:*'} auth_retry_attempts=0 blind_retry_attempts=0

Last entry in radosgw.log shows the server was stopped:

2015-08-13 15:39:21.500670 7f09d10b47c0 0 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3), process radosgw, pid 9937
2015-08-13 15:39:24.407231 7f09d10b47c0 0 framework: civetweb
2015-08-13 15:39:24.407246 7f09d10b47c0 0 framework conf key: port, val: 70
2015-08-13 15:39:24.407270 7f09d10b47c0 0 starting handler: civetweb
2015-08-13 15:39:28.187979 7f09ad7fa700 -1 failed to list objects pool_iterate returned r=-2
2015-08-13 15:39:28.187990 7f09ad7fa700 0 ERROR: lists_keys_next(): ret=-2
2015-08-13 15:39:28.187995 7f09ad7fa700 0 ERROR: sync_all_users() returned ret=-2
2015-08-13 15:40:19.341212 7f09acff9700 1 handle_sigterm
2015-08-13 15:40:19.341248 7f09acff9700 1 handle_sigterm set alarm for 120
2015-08-13 15:40:19.341251 7f09d10b47c0 -1 shutting down
2015-08-13 15:40:19.458224 7f09acff9700 1 handle_sigterm
2015-08-13 15:40:19.458252 7f09acff9700 1 handle_sigterm set alarm for 120
2015-08-13 15:40:20.046138 7f09d10b47c0 1 final shutdown

Revision history for this message
Alberto Donato (ack) wrote :
Nobuto Murata (nobuto)
tags: added: cpec
Revision history for this message
Liam Young (gnuoy) wrote :

This is not a charm bug. It looks like an upstart script issue:

# service radosgw status
/usr/bin/radosgw is not running.
# service radosgw start
Starting client.radosgw.gateway...
/usr/bin/radosgw is running.
# service radosgw status
/usr/bin/radosgw is running.
# service radosgw restart
Starting client.radosgw.gateway...
/usr/bin/radosgw already running.
/usr/bin/radosgw is running.
# service radosgw status
/usr/bin/radosgw is not running.

Changed in ceph-radosgw (Juju Charms Collection):
status: New → Invalid
Liam Young (gnuoy)
summary: - ceph-radosgw died during deployment
+ ceph-radosgw restart fails
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ceph (Ubuntu):
status: New → Confirmed
Liam Young (gnuoy)
affects: ceph-radosgw (Ubuntu) → ceph (Ubuntu)
Changed in ceph (Ubuntu):
status: New → Confirmed
Liam Young (gnuoy)
description: updated
description: updated
description: updated
James Page (james-page)
Changed in ceph (Ubuntu Wily):
status: Confirmed → Fix Released
Changed in ceph (Ubuntu Trusty):
status: New → Triaged
importance: Undecided → High
Changed in ceph (Ubuntu Wily):
importance: Undecided → High
Liam Young (gnuoy)
description: updated
Liam Young (gnuoy)
description: updated
Liam Young (gnuoy)
description: updated
James Page (james-page)
Changed in ceph (Ubuntu Wily):
status: Fix Released → Triaged
Changed in ceph (Ubuntu Vivid):
status: New → Triaged
importance: Undecided → High
James Page (james-page)
Changed in ceph (Ubuntu Wily):
assignee: nobody → James Page (james-page)
Changed in ceph (Ubuntu Vivid):
assignee: nobody → James Page (james-page)
Changed in ceph (Ubuntu Trusty):
assignee: nobody → Liam Young (gnuoy)
status: Triaged → In Progress
Changed in ceph (Ubuntu Vivid):
status: Triaged → In Progress
Changed in ceph (Ubuntu Wily):
status: Triaged → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 0.94.3-0ubuntu2

---------------
ceph (0.94.3-0ubuntu2) wily; urgency=medium

  * d/ceph.install: Drop ceph-deploy manpage from packaging, provided
    by ceph-deploy itself (LP: #1475910).

 -- James Page <email address hidden> Mon, 07 Sep 2015 14:42:03 +0100

Changed in ceph (Ubuntu Wily):
status: In Progress → Fix Released
tags: added: landscape-release-29
Revision history for this message
Chad Smith (chad.smith) wrote :

Will need to confirm once we have a 0.94.3-0ubuntu2 available for deployment

lp:1468335 seems very likely related

Revision history for this message
Chris J Arges (arges) wrote :

This is blocked in the unapproved queue because bug 1475247 and bug 1477174 have not yet been verified. Please test those bugs first.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Andreas, or anyone else affected,

Accepted ceph into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/0.94.3-0ubuntu0.15.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ceph (Ubuntu Vivid):
status: In Progress → Fix Committed
tags: added: verification-needed
David Britton (dpb)
tags: added: kanban-cross-team
David Britton (dpb)
tags: removed: landscape-release-29
Revision history for this message
Chris J Arges (arges) wrote :

Hello Andreas, or anyone else affected,

Accepted ceph into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/0.80.10-0ubuntu1.14.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ceph (Ubuntu Trusty):
status: In Progress → Fix Committed
Changed in ceph (Ubuntu Vivid):
status: Fix Committed → Fix Released
James Page (james-page)
Changed in ceph (Ubuntu Vivid):
status: Fix Released → Fix Committed
Revision history for this message
James Page (james-page) wrote :

Tested from trusty proposed - restarts of radosgw are reliable post upgrade.

tags: added: verification-done verification-needed-vivid
removed: verification-needed
Revision history for this message
James Page (james-page) wrote :

Also verified OK on vivid - restarts under systemd are now consistent.

tags: removed: verification-needed-vivid
Revision history for this message
Free Ekanayaka (free.ekanayaka) wrote :

@James: is there a plan to upload the fix to the kilo/liberty trusty cloud archive too? That'd be the only way the Landscape Openstack Autopilot could get it I think.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 0.80.10-0ubuntu1.14.04.3

---------------
ceph (0.80.10-0ubuntu1.14.04.3) trusty; urgency=medium

  * d/p/ceph-radosgw-init.patch: Cherry pick patch from upstream VCS to
    ensure that restarts of the radosgw wait an appropriate amount of time
    for the existing daemon to shutdown (LP: #1477225).

ceph (0.80.10-0ubuntu1.14.04.2) trusty; urgency=medium

  * Switch to two step 'zapping' of disks, ensuring that disks with invalid
    metadata don't cause hangs are fully cleaned and initialized prior
    to use (LP: #1475247).

 -- Liam Young <email address hidden> Mon, 07 Sep 2015 16:00:31 +0100

Changed in ceph (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 0.94.3-0ubuntu0.15.04.1

---------------
ceph (0.94.3-0ubuntu0.15.04.1) vivid; urgency=medium

  [ James Page ]
  * New upstream point release (LP: #1492227).
  * d/ceph.install: Drop ceph-deploy manpage from packaging, provided
    by ceph-deploy itself (LP: #1475910).

  [ Liam Young ]
  * d/p/ceph-radosgw-init.patch: Cherry pick patch from upstream VCS to
    ensure that restarts of the radosgw wait an appropriate amount of time
    for the existing daemon to shutdown (LP: #1477225).

 -- James Page <email address hidden> Mon, 07 Sep 2015 16:01:46 +0100

Changed in ceph (Ubuntu Vivid):
status: Fix Committed → Fix Released
Mathew Hodson (mhodson)
affects: ceph-radosgw (Juju Charms Collection) → ubuntu-translations
no longer affects: ubuntu-translations
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.