local provider is very slow to tranistion from agent-status: pending

Bug #1322302 reported by Daniel Westervelt
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Katherine Cox-Buday
juju-core (Ubuntu)
Fix Released
High
Unassigned

Bug Description

ubuntu@trusty-installer:~$ juju version
1.18.3-trusty-amd64
ubuntu@trusty-installer:~$ dpkg -l juju-core
ii juju-core 1.18.3-0ubuntu1~1 amd64 Juju is devops distilled - client
ubuntu@trusty-installer:~$ dpkg -l juju-local
ii juju-local 1.18.3-0ubuntu1~1 all dependency package for the Juju local provider

juju add-machine --constraints "mem=3G cpu-cores=3 root-disk=20G"
has a huge delay in agent-state pending especially when the kvm is up and running in fairly short order

kvm comes up in ~1 minutes, agent-status is stuck in pending for at least another 8-9 minutes after that

<hazmat> danwest, its normally async to actually coming up based on pinger/heartbeat.. but that length of time sounds like a worst case..

Revision history for this message
Daniel Westervelt (danwest) wrote :
Revision history for this message
Daniel Westervelt (danwest) wrote :

machnice log for kvm created by juju

ubuntu@ubuntu:/var/log/juju$ pastebinit machine-1.log
http://paste.ubuntu.com/7502559/

Revision history for this message
Daniel Westervelt (danwest) wrote :

machine-1.log (log for kvm created by juju - hard copy of pastebin from previous comment)

Nate Finch (natefinch)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Ian Booth (wallyworld) wrote :

When Juju uses LXC containers, the very first container is always slow to come up because it needs to download the image. We added clone support to make subsequent containers start much faster.

There are a couple of options to make the initial template download faster.
1. Juju supports an http-proxy environment config option (as well as apt-http-proxy).
https://juju.ubuntu.com/docs/howto-proxies.html
These proxy settings are propagated to the shell which calls lxc-create, which in turn calls wget to fetch the image.
So if you were to use a squid cache of some sort to provide the lxc images, and set up the proxy config, that would be a solution.

2. Have the Orange boxes configured to use a transparent proxy.
I believe Kapil is talking to Dustin about this option.

Having said all that, we also need to improve juju status to show more than just Pending. That's on the todo list.

If you do in fact have a cache all set up and it's still really slow, then we need to look deeper. The status changes from pending to started when the machine comes up and phones home. That can take a little time based on how fast the machine starts.

Can you advise how it behaves once you have a proxy all set up?

Revision history for this message
Daniel Westervelt (danwest) wrote :

It is KVM, not LXC and the image download is not the problem as the VM is up and running within one minute. I have even logged into it directly to verify . The problem is that juju sits in an agent-status: pending for up to nine minutes after that.

Revision history for this message
Daniel Westervelt (danwest) wrote :

another run, same issue but this time with debug level logging (attached)

Revision history for this message
Daniel Westervelt (danwest) wrote :

Also, in order to get rid of the one minute image download overhead I have done runs where I first synced the image locally with:

uvt-simplestreams-libvirt sync release=trusty arch=amd64

this increased the speed to which the KVM started but still had the post KVM start delay of 8-9 minutes.

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

Thanks for uploading the debug logs.. it looks like a 7m gap from the start of provisioning the kvm instance, till the agent first starts up. Can you attach the /var/log/cloud-init-output.log .. that will help identify whats taking the additional time.

Revision history for this message
Daniel Westervelt (danwest) wrote :

http://paste.ubuntu.com/7505203/ <- /var/log/cloud-init.log from the slow KVM

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

This seems to be causing most of the time loss. Is there a local proxy/mirror ? Is the image out of date?

May 22 20:57:25 ubuntu [CLOUDINIT] util.py[DEBUG]: apt-upgrade [eatmydata apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet dist-upgrade] took 353.061 seconds

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

confirmed this is not a juju issue. seems to be a nested virt perf or other local setup issue.

Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :

If this is not a juju issue can we close/reject the bug?

Revision history for this message
Ian Booth (wallyworld) wrote :

Marking as incomplete unless we identify a problem we can fix in juju

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Daniel Westervelt (danwest) wrote :

slowness not attributed to juju but to uvt-tool and cloudinit.

affects: juju-core → uvtool
Changed in uvtool:
assignee: nobody → Canonical Server Team (canonical-server)
Revision history for this message
Robie Basak (racb) wrote :

I think investigation needs to start from a qemu performance perspective. It may be that uvtool needs to tell libvirt to tell qemu to do something differently, but in that case I need to know what.

affects: uvtool → qemu
affects: qemu → qemu (Ubuntu)
Changed in qemu (Ubuntu):
status: Incomplete → New
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Can someone show the actual dpkg term log to show how much needed upgrading?

Can you confirm that kvm is being run with acceleration?

Since you're using uvtool I assume you're also using a qcow2, which further degrades performance...

Revision history for this message
Adam Stokes (adam-stokes) wrote :

Reproducer steps:

Create a trusty virtual machine using the defaults provided with uvt tool.
ssh into that virtual machine and setup juju environments.yaml to be a local provider with a container-type: kvm

juju bootstrap
juju add-machine

Revision history for this message
Adam Stokes (adam-stokes) wrote :

environments.yaml:

default: local

environments:
  local:
    type: local
    container: kvm
    lxc-use-clone: true

Revision history for this message
Ryan Harper (raharper) wrote :

It takes 164 seconds in cloud init to bring up the first machine.

What's the outer VM config (resources), likewise, what's the host machine resource level?

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (4.2 KiB)

Summary
-------

Add-machine: Wed May 28 14:31:56 UTC 2014 (0)
KVM Ssh'able: Wed May 28 14:33:15 UTC 2014 (+79)
Cloud-init done: Wed May 28 14:35:14 UTC 2014 (+119)
Juju status started: Wed May 28 14:35:38 UTC 2014 (+24)

Total time: 222 seconds: 3 min, 40 seconds.

----------------------------------------
ubuntu@kvmjuju:~$ date; juju add-machine --constraints "cpu-cores=4 root-disk=20G"
Wed May 28 14:31:56 UTC 2014
ubuntu@kvmjuju:~$ date; juju status
Wed May 28 14:32:02 UTC 2014
environment: local
machines:
  "0":
    agent-state: started
    agent-version: 1.18.3.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
  "5":
    instance-id: pending
    series: trusty
services: {}
ubuntu@kvmjuju:~$ date; juju status
Wed May 28 14:32:24 UTC 2014
environment: local
machines:
  "0":
    agent-state: started
    agent-version: 1.18.3.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
  "5":
    agent-state: pending
    instance-id: ubuntu-local-machine-5
    series: trusty
    hardware: arch=amd64 cpu-cores=4 mem=512M root-disk=20480M
services: {}
ubuntu@kvmjuju:~$ juju ssh 5 date
ERROR machine "5" has no public address
ubuntu@kvmjuju:~$ date; juju status
Wed May 28 14:33:15 UTC 2014
environment: local
machines:
  "0":
    agent-state: started
    agent-version: 1.18.3.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
  "5":
    agent-state: pending
    instance-id: ubuntu-local-machine-5
    series: trusty
    hardware: arch=amd64 cpu-cores=4 mem=512M root-disk=20480M
services: {}
ubuntu@kvmjuju:~$ date; juju status
Wed May 28 14:34:36 UTC 2014
environment: local
machines:
  "0":
    agent-state: started
    agent-version: 1.18.3.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
  "5":
    agent-state: pending
    instance-id: ubuntu-local-machine-5
    series: trusty
    hardware: arch=amd64 cpu-cores=4 mem=512M root-disk=20480M
services: {}
ubuntu@kvmjuju:~$ date; juju status
Wed May 28 14:35:14 UTC 2014
environment: local
machines:
  "0":
    agent-state: started
    agent-version: 1.18.3.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
  "5":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.18.3.1
    dns-name: 10.0.3.67
    instance-id: ubuntu-local-machine-5
    series: trusty
    hardware: arch=amd64 cpu-cores=4 mem=512M root-disk=20480M
services: {}
ubuntu@kvmjuju:~$ date; juju status
Wed May 28 14:35:19 UTC 2014
environment: local
machines:
  "0":
    agent-state: started
    agent-version: 1.18.3.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
  "5":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.18.3.1
    dns-name: 10.0.3.67
    instance-id: ubuntu-local-machine-5
    series: trusty
    hardware: arch=amd64 cpu-cores=4 mem=512M root-disk=20480M
services: {}
ubuntu@kvmjuju:~$ date; juju status
Wed May 28 14:35:21 UTC 2014
environment: local
machines:
  "0":
    agent-state: started
    agent-version: 1.18.3.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
  "5":
    agent-state: down
    agent-state-info: (start...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :

One modification to the outer KVM that may help here:

adding --unsafe-cache to your first level KVM will use host ram to cache guest writes to the qcow2 image -- this means what it means, if kill your VMs or host machine has power failure, you'll lose data. With that in place the same test resulted in add-machine to status: started of: 149 seconds.

Revision history for this message
Ian Booth (wallyworld) wrote :

A note from the juju side of the fence - Juju currently just syncs the latest released image when calling uvt-simplestreams-libvirt. It could be configured to add a --source arg to get a daily image if required. But configuration adds complexity so if the released kvm images were updated more often, I guess that would be preferable.

Revision history for this message
Daniel Westervelt (danwest) wrote :

looks like part of the problem was that we were running with constraints on cpu greater then the level1 KVM had allocated to it.

base KVM had 1 cpu, juju was asking for three:

juju add-machine --constraints "mem=3G root-disk=20G cpu-cores=3"

if we modify the base KVM to have 3 cpus the time drops from 13.5 miutes to 4.5 minutes.

Adding cache='unsafe' further reduces the time down to 3 minutes.

That said, I still think we need to have an option to tell juju to pull the latest daily images in order to reduce the number of updates applied via cloudinit.

uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily release=precise arch=amd64

^^ This would be the uvtool cmdline to sync the latest daily

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1322302] Re: local provider is very slow to tranistion from agent-status: pending

So it looks like this should be reclassified against juju again?

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

using latest dailies images and/or hacking juju to not blindly issue
apt-get update/upgrade is something i often to do safe a minute per unit
(x100s). its also the only way to get kernel updates sans rebooting a
machine post deploy. a daily should get us to the same place as the last
release plus upgrades, however it does not get qa'd by the cloud images
team.

On Thu, May 29, 2014 at 11:33 AM, Serge Hallyn <email address hidden>
wrote:

> So it looks like this should be reclassified against juju again?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1322302
>
> Title:
> local provider is very slow to tranistion from agent-status: pending
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1322302/+subscriptions
>

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Ok, then it sounds like the bug should be either invalid or wontfix - be it against qemu or juju. Probably both, since it is the case for both.

As I've been repeatedly told, nested qemu (especially on intel) is a convenience/developer feature, and all but unsupported upstream. So we either need to take it upon ourselves to move resources to address it ourselves (as much as possible), or accept facts as they are when designing on top of it.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Actually the first two suggestions in comment #23 seem perfectly valid against juju - do not overprovision, at least, and perhaps, conditionally, use cache=unsafe. I would assume that in a developer situation, if power is lost during a juju deploy, all vms would get wiped and the charms re-deployed from scratch anyway.

Revision history for this message
Daniel Westervelt (danwest) wrote :

cache=unsafe only helps a litte. we either need to support daily or more frequent images by default, mainly to avoid the cost of kernel updates or support a feature in juju to allow us to configure what images we want to pull.

as a side note, I am happy to get ride of nested kvm if we can get all Openstack services running in containers.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Daniel Westervelt (<email address hidden>):
> cache=unsafe only helps a litte.

In comment #23 you said it brought the time from 4.5 to 3 minutes, a 33%
improvement.

> we either need to support daily or more
> frequent images by default, mainly to avoid the cost of kernel updates
> or support a feature in juju to allow us to configure what images we
> want to pull.

That still all falls under the purvue of juju. All I am pointing out is
that the bug should be marked affecting juju.

> as a side note, I am happy to get ride of nested kvm if we can get all
> Openstack services running in containers.

One step in this direction would be if we could use mountlo, which is
fuse-based, in a container. I haven't had a chance to test it, but it
is packaged and in theory it should work.

But that is unrelated to the fact that this bug's status against
qemu is either invalid (upstream) or wontfix (ubuntu package).

affects: qemu (Ubuntu) → juju-core (Ubuntu)
Changed in juju-core (Ubuntu):
assignee: Canonical Server Team (canonical-server) → nobody
Ian Booth (wallyworld)
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
milestone: none → next-stable
Revision history for this message
Ian Booth (wallyworld) wrote :

The ability to specify a different image stream to allow daily images to be used query simpleastreams images was added in 1.17.2, but this was for cloud nodes started by the provider directly in the cloud, not images fetched by uvt-simplestreams-libvirt as is done when creating a kvm container.

We can extend the kvm provisioning component of Juju to also honour the "image-stream" configuration setting.

Additionally, there is now also agreement that we are to introduce configuration settings to disable apt update/upgrade steps at bootstrap, reducing provisioning time even further.

Changed in juju-core:
milestone: next-stable → 1.21-alpha1
Changed in juju-core:
assignee: nobody → Katherine Cox-Buday (cox-katherine-e)
status: Triaged → In Progress
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

The first half of this fix has been landed into trunk. When users specify a stream in the "image-stream" config setting, Juju will use this to determine which simple stream to sync. By specifying "daily", new machines will receive a more up-to-date image thus shortening the time it takes to perform apt-get update/upgrade.

The second half which is designed to make OS update behavior configurable is in the works.

Revision history for this message
Robie Basak (racb) wrote :

Note that bug 1317680 may be a potential issue here. If you end up in the situation that you've previously synced the same release from both a daily and a released stream, then an attempt to create a container will be unambiguous and thus fail. In that bug the issue was that the user had manually done it locally. Here, the problem may be that Juju trips itself up by ending up in that situation, if Juju mixes use of released and daily streams.

I suppose a wider issue here is that we don't currently have a good answer inside simplestreams for syncing images from multiple streams into the same pool.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Thanks for the information, Robie. I am assigned to 1317680 as well, so I'll keep your comments in mind.

Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Oops, looks like I was mistaken. I'm not assigned. I see you cross-linked your comment, so whomever gets assigned should have enough to go on.

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Curtis Hovey (sinzui)
Changed in juju-core (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.