[SRU] dnsmasq fails at leasing issues when using vlan mode

Bug #1006898 reported by Chuck Short
52
This bug affects 5 people
Affects Status Importance Assigned to Milestone
dnsmasq (Ubuntu)
Fix Released
Medium
Unassigned
Precise
Won't Fix
High
Unassigned

Bug Description

** Issue **

There is an issue with the way nova uses dnsmasq in VLAN mode. It starts
up a single copy of dnsmasq for each vlan on the network host (or on
every host in multi_host mode). The problem is in the way that dnsmasq
binds to an ip address and port[2]. Both copies can respond to broadcast
packet, but unicast packets can only be answered by one of the copies.

In nova this means that guests from only one project will get responses
to their unicast dhcp renew requests. Unicast projects from guests in
other projects get ignored. What happens next is different depending on
the guest os. Linux generally will send a broadcast packet out after
the unicast fails, and so the only effect is a small (tens of ms) hiccup
while interface is reconfigured. It can be much worse than that,
however. I have seen cases where Windows just gives up and ends up with
a non-configured interface.

This bug was first noticed by some users of openstack who rolled their
own fix. Basically, on linux, if you set the SO_BINDTODEVICE socket
option, it will allow different daemons to share the port and respond to
unicast packets, as long as they listen on different interfaces. I
managed to communicate with Simon Kelley, the maintainer of dnsmasq and
he has integrated a fix[3] for the issue in the current version[1] of
dnsmaq.

[3] http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=9380ba70d67db6b69f817d8e318de5ba1e990b12

** Development Fix **

This has been fixed in quantal with the newer version of dnmasq.

** Stable Fix **

I have backported the patch which fixes this issue, I have attached the debdiff and the buildlog.

** Test Case **

1. Install openstack with vlan mode.
2. Watch instances loose their IP addresses.

** Regression Potential **

Minimal, most installations dont use this type of networking.

Revision history for this message
Scott Moser (smoser) wrote :

this looks like something we should pull in.
Since Ubuntu has unmodified debian package, and debian maintainer is upstream maintainer, we should probably let the quantal package get synced from debian. Then, we can patch the 12.04 Ubuntu version in an SRU.

@Simon,
  If you're reading this, do you have plans for a 2.6.2 release and subsequent 2.6.2-1 upload soon?

Changed in dnsmasq (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Scott Moser (smoser)
Changed in dnsmasq (Ubuntu Precise):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Simon Kelley (simon-thekelleys) wrote : Re: [Bug 1006898] Re: [SRU] dnsmasq fails at leasing issues when using vlan mode

On 31/05/12 14:57, Scott Moser wrote:
> this looks like something we should pull in.
> Since Ubuntu has unmodified debian package, and debian maintainer is upstream maintainer, we should probably let the quantal package get synced from debian. Then, we can patch the 12.04 Ubuntu version in an SRU.
>
> @Simon,
> If you're reading this, do you have plans for a 2.6.2 release and subsequent 2.6.2-1 upload soon?

I do. There are a few nasty bugs in 2.61 in the new DHCPv6 and router
advertisement code, I plan to release 2.62 to address these in the next
few days.

Cheers,

Simon.

James Page (james-page)
Changed in dnsmasq (Ubuntu Precise):
milestone: none → ubuntu-12.04.1
Revision history for this message
Thierry Carrez (ttx) wrote :

2.62 is in Quantal

Changed in dnsmasq (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
ROCHE (guyroche08-6) wrote : re: [[Bug 1006898] Re: [SRU] dnsmasq fails at leasing issues when using vlan mode]
Download full text (3.8 KiB)

Hi,
Thanks to your work. It is very bad to do not have sound in this version of Ubuntu (kernel 3 2 025 and version 12.10 (Quanta). This problem comes with tha updating 3 2 024 to 3 2 025. 3 2 025 seems to forget the alsa.

For memory :
Matching subscriptions: No Audio after update kernel 3 2 025 in Ubuntu 12 04 32bits. Good in kernel 3 2 024 ! Alsa 1 025 not compiled in kernel 3 2 025. It's the same problem with ubuntu 12 10 alpha I tried today too !(but kernel 3 4 ...). A+
> https://bugs.launchpad.net/bugs/1006898

an info too :
dans la console cette ligne :
« cat /proc/asound/version »
Si cela te donne un truc de ce genre, alors le problème vient d'ailleurs :
Advanced Linux Sound Architecture Driver Version 1.0.25.
Compiled on Mar 9 2012 for kernel 3 2 025-generic PAE
Si la ligne ne e renvoie que la ligne :
Advanced Linux Sound Architecture Driver Version 1.0.24.
Alors pas besoin de cherche midi à 14 heures, ton son ne pourra fonctionné vu qu'Alsa n'est pas compilé avec le kernel que tu utilise à ce moment là.

Sorry, it's in french but i have only "Advanced Linux Sound Architecture Driver Version 1.0.24." when I tap " cat /proc/asound/version". And the sound driver installed is Version 1.0.25 !

Best regards.

Guy Roche

mail <email address hidden>
mail <email address hidden>

domicile 0324 376446
mobile 0619 178018

> Message du 15/06/12 17:22
> de : "ThierryCarrez"<email address hidden>
> à : <email address hidden>
> cc :
> objet : [Bug 1006898] Re: [SRU] dnsmasq fails at leasing issues when using vlan mode
>
>
> 2.62 is in Quantal
>
> ** Changed in: dnsmasq (Ubuntu)
> Status: Triaged => Fix Released
>
> --
> You received this bug notification because you are subscribed to Ubuntu
> ubuntu-12.04.1.
> Matching subscriptions: No Audio after update kernel 3 2 025 in Ubuntu 12 04 32bits. Good in kernel 3 2 024 ! Alsa 1 025 not compiled in kernel 3 2 025. It's the same problem with ubuntu 12 10 alpha I tried today too !(but kernel 3 4 ...). A+
> https://bugs.launchpad.net/bugs/1006898
>
> Title:
> [SRU] dnsmasq fails at leasing issues when using vlan mode
>
> Status in “dnsmasq” package in Ubuntu:
> Fix Released
> Status in “dnsmasq” source package in Precise:
> Triaged
>
> Bug description:
> There is an issue with the way nova uses dnsmasq in VLAN mode. It starts
> up a single copy of dnsmasq for each vlan on the network host (or on
> every host in multi_host mode). The problem is in the way that dnsmasq
> binds to an ip address and port[2]. Both copies can respond to broadcast
> packet, but unicast packets can only be answered by one of the copies.
>
> In nova this means that guests from only one project will get responses
> to their unicast dhcp renew requests. Unicast projects from guests in
> other projects get ignored. What happens next is different depending on
> the guest os. Linux generally will send a broadcast packet out after
> the unicast fails, and so the only effect is a small (tens of ms) hiccup
> while interface is reconfigured. It can be much worse than that,
> however. I have seen cases where Windows just gives up and ends up with
>...

Read more...

Revision history for this message
Chuck Short (zulcss) wrote :

** Issue **

There is an issue with the way nova uses dnsmasq in VLAN mode. It starts
up a single copy of dnsmasq for each vlan on the network host (or on
every host in multi_host mode). The problem is in the way that dnsmasq
binds to an ip address and port[2]. Both copies can respond to broadcast
packet, but unicast packets can only be answered by one of the copies.

In nova this means that guests from only one project will get responses
to their unicast dhcp renew requests. Unicast projects from guests in
other projects get ignored. What happens next is different depending on
the guest os. Linux generally will send a broadcast packet out after
the unicast fails, and so the only effect is a small (tens of ms) hiccup
while interface is reconfigured. It can be much worse than that,
however. I have seen cases where Windows just gives up and ends up with
a non-configured interface.

This bug was first noticed by some users of openstack who rolled their
own fix. Basically, on linux, if you set the SO_BINDTODEVICE socket
option, it will allow different daemons to share the port and respond to
unicast packets, as long as they listen on different interfaces. I
managed to communicate with Simon Kelley, the maintainer of dnsmasq and
he has integrated a fix[3] for the issue in the current version[1] of
dnsmaq.

[3] http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=9380ba70d67db6b69f817d8e318de5ba1e990b12

** Development Fix **

This has been fixed in quantal with the newer version of dnmasq.

** Stable Fix **

I have backported the patch which fixes this issue, I have attached the debdiff and the buildlog.

** Test Case **

1. Install openstack with vlan mode.
2. Watch instances loose their IP addresses.

** Regression Potential **

Minimal, most installations dont use this type of networking.

Revision history for this message
Chuck Short (zulcss) wrote :
Revision history for this message
Chuck Short (zulcss) wrote :
Revision history for this message
Chris Halse Rogers (raof) wrote :

This seems like an important bug to fix, but I have reservations about changing dnsmasq's behaviour in a stable update. When you say ‘most installations don't use this type of networking’, what do you mean by ‘most’, is it plausible that someone has relied on this behaviour, and if someone had relied on this behaviour how would this change affect them?

Could this be more safely worked-around in openstack?

Revision history for this message
Christian Parpart (trapni) wrote :

Hey,

sorry, with "most" I meant "the documentation recommends using VlanManager" (that is, VLAN mode) for networking.

Although, you cannot "rely" on such a behaviour, IMHO, because it absolutely makes no sense to let hosts (that send a DHCPREQUEST) not receive their DHCPACK.

Revision history for this message
Christian Parpart (trapni) wrote :

> Could this be more safely worked-around in openstack?

forgot to comment on this one, well, I am no OpenStack expert, however, OpenStack nova-network relies on dnsmasq for propagating IP addresses via DHCP to their (KVM/...) instances, and OpenStack supports simple networking (w/o VLAN) and VLAN-networking, and thus, I don't see how OpenStack could work around this except using a different software than dnsmasq (something that actually works) - or don't use VLAN at all.

Revision history for this message
Chuck Short (zulcss) wrote :

Roaf,

What I mean for "most". I mean we dont recommend that people use VLAN but some people do use it, and are not able to use vlan with the dnsmasq in precise without this fix.

Regards
chuck

Revision history for this message
Chris Halse Rogers (raof) wrote :

Well, what I meant was: the code that you're touching is in the dnsmasq-base package, and dnsmasq-base is installed on *all* Ubuntu systems, as a dependency of network-manager. It seems that the worst-case regression potential is that we break DNS on all Ubuntu systems, which would be bad :)

lxc and libvirt have run into the same problems, and they added their network interfaces to the global dnsmasq blacklist, which at least means that the behaviour is only changed for users who install lxc or libvirt.

Revision history for this message
Christian Parpart (trapni) wrote :

And that means what?

Will you (Ubuntu) ignore the bug and leave the patching up to the libvirt/lxc Ubuntu users?

I am confused. :-)

Revision history for this message
Steve Langasek (vorlon) wrote :

Chuck, please put SRU information in the bug description, not in a comment - it becomes hard to find this information when there are a dozen more comments from testers.

description: updated
Revision history for this message
Steve Langasek (vorlon) wrote :

Please also complete the test case with explicit information about how users can verify the *fix* for this bug.

Revision history for this message
Steve Langasek (vorlon) wrote :

I'm afraid I also don't understand this problem statement:

> There is an issue with the way nova uses dnsmasq in VLAN mode. It starts
> up a single copy of dnsmasq for each vlan on the network host (or on
> every host in multi_host mode). The problem is in the way that dnsmasq
> binds to an ip address and port[2]. Both copies can respond to broadcast
> packet, but unicast packets can only be answered by one of the copies.

What exactly is the network configuration that allows this to happen? Does the host have multiple vlan interfaces using the same IP address?

That's the only scenario I see in which SO_BINDTODEVICE should make a difference; but I don't understand why you would be using the same IP address on multiple interfaces, virtual or otherwise.

Revision history for this message
Steve Langasek (vorlon) wrote :

... and now I've reviewed the debdiff, and found it to not match the upstream commit. This part of the patch to src/network.c is missing:

@@ -254,6 +261,7 @@ static int iface_allowed(struct irec **irecp, int if_index,
       iface->addr = *addr;
       iface->netmask = netmask;
       iface->tftp_ok = tftp_ok;
+ iface->dhcp_ok = dhcp_ok;
       iface->mtu = mtu;
       iface->dad = dad;
       iface->done = 0;

This means the value of dhcp_ok on each interface is *undefined*, and this SRU would cause dnsmasq to *randomly* stop doing DHCP on configured interfaces.

Rejecting from the queue.

Revision history for this message
Steve Langasek (vorlon) wrote :

Before the SRU team will reconsider an SRU for this, based on the above I would also expect to see a regression test plan that accounts for making sure dnsmasq continues to work correctly in configurations other than the openstack one.

Revision history for this message
Chuck Short (zulcss) wrote :

Ill fix this up do as requested.

Changed in dnsmasq (Ubuntu Precise):
assignee: nobody → Stéphane Graber (stgraber)
Revision history for this message
Stéphane Graber (stgraber) wrote :

Assigned this bug to myself when going through the buglist as it was in my usual package list, though based on past comments, I'm now re-assigning to Chuck as he's more familiar with the issue.

I'll be interested in looking at the diff before it gets pushed to our users though. As Steve said, we have dnsmasq running on most Ubuntu systems (all desktops have it by default) and we really don't want to risk a regression for these.

Changed in dnsmasq (Ubuntu Precise):
assignee: Stéphane Graber (stgraber) → Chuck Short (zulcss)
James Page (james-page)
Changed in dnsmasq (Ubuntu Precise):
milestone: ubuntu-12.04.1 → precise-updates
Revision history for this message
Luc (gmi68745) wrote :

Was this update to dnsmasq released in 12.04.1 ?

root@ubuntu:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.1 LTS
Release: 12.04
Codename: precise
root@ubuntu:~# apt-cache policy dnsmasq
dnsmasq:
  Installed: 2.59-4
  Candidate: 2.59-4
  Version table:
 *** 2.59-4 0
        500 http://us.archive.ubuntu.com/ubuntu/ precise/universe amd64 Packages
        100 /var/lib/dpkg/status

Revision history for this message
Stéphane Graber (stgraber) wrote :

No

Revision history for this message
Michael S. Moody (michael-sykosoft) wrote :

This is hitting a lot of openstack users who chose 12.04 due to the announcement of backporting openstack to precise for 3 years: https://wiki.ubuntu.com/ServerTeam/CloudArchive

We fall into that category. It's pretty much impossible to use stock versions of dnsmasq and openstack in 12.04. This adds a significant burden to our team. This is a pretty serious problem that needs to find its way into 12.04 (precise), sooner rather than later, especially since it affects the recommended openstack essex configuration (vlan). It's causing our Windows Server instances to fatally lose their IP address config. This essentially makes our cloud unpredictably unstable, with instances coming and going randomly. Yes, there are potential workarounds (at least for openstack), but they're ugly:

https://lists.launchpad.net/openstack/msg11696.html

nova.conf
# release leases immediately on terminate
force_dhcp_release=true
# one week lease time
dhcp_lease_time=604800
# two week disassociate timeout
fixed_ip_disassociate_timeout=1209600

Again, this needs to be fixed considering this is LTS, and considering that precise is supposed to be a solid foundation upon which to build an openstack cloud.

Revision history for this message
Soren Hansen (soren) wrote :

Chuck, this is still assigned to you. Is it going anywhere?

Revision history for this message
Chuck Short (zulcss) wrote :

No we are probably going to be backporting it to the cloud archive.

Revision history for this message
Michael S. Moody (michael-sykosoft) wrote :

Did this make it into 12.04.2 LTS? We still experience breakage here, and must manually apply a newer dnsmasq out of band (which causes all sorts of other administration burdens). I don't see it in cloud-archive either according to the most recent comment 1 month ago.

James Page (james-page)
Changed in dnsmasq (Ubuntu Precise):
importance: Medium → High
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

The attached patch, applies upstream 9380ba70d67db6b69f817d8e318de5ba1e990b12 into precise.

Revision history for this message
Chris J Arges (arges) wrote :

Jorge, can you reformat the debdiff so it just adds the debian patch and doesn't also modify the original source? Thanks

Revision history for this message
Sebastien Bacher (seb128) wrote :

Jorge, Chuck, can you get that update/uploaded?

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Attached patch for precise.

Changed in dnsmasq (Ubuntu Precise):
assignee: Chuck Short (zulcss) → Jorge Niedbalski (niedbalski)
status: Triaged → In Progress
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

@arges,

I re-uploaded the patch, rebasing with current -updates and fixing the issue you detected. Please sponsor.

Thanks.

Revision history for this message
Martin Pitt (pitti) wrote :

Note that the patch is a no-op. The source package format is 1.0 and it doesn't call quilt explicitly. The precise package uses inline patches. I tried to apply it manually, but it doesn't fit at all..

Can you please backport the patch to the precise version and change it inline?

Changed in dnsmasq (Ubuntu Precise):
status: In Progress → Incomplete
Revision history for this message
Martin Pitt (pitti) wrote :

Unsubscribing sponsors, please re-subscribe when you attach a working patch. Thanks!

Changed in dnsmasq (Ubuntu Precise):
assignee: Jorge Niedbalski (niedbalski) → nobody
Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in dnsmasq (Ubuntu Precise):
status: Incomplete → Won't Fix
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Christian,
I saw both, the close and your question.
I'd assume that Steve is updating those in a semi-automated fashion, there is no way he could read all those bugs.

Per bug status this one is marked fixed and only had a bug task open to backport to Precise.
In fact the upstream commit that fixes this:

commit 9380ba70d67db6b69f817d8e318de5ba1e990b12
Author: Simon Kelley <email address hidden>
Date: Mon Apr 16 14:41:56 2012 +0100

    Set SO_BINDTODEVICE on DHCP sockets when doing DHCP on one interface
    only. Fixes OpenSTack use-case.

Has landed in 2.61.

And all Ubuntu release are >2.61 nowadays

 dnsmasq | 2.68-1 | trusty | source
 dnsmasq | 2.68-1 | trusty/universe | all
 dnsmasq | 2.68-1ubuntu0.2 | trusty-security | source
 dnsmasq | 2.68-1ubuntu0.2 | trusty-security/universe | all
 dnsmasq | 2.68-1ubuntu0.2 | trusty-updates | source
 dnsmasq | 2.68-1ubuntu0.2 | trusty-updates/universe | all
 dnsmasq | 2.75-1 | xenial | source
 dnsmasq | 2.75-1 | xenial/universe | all
 dnsmasq | 2.75-1ubuntu0.16.04.10 | xenial-security | source
 dnsmasq | 2.75-1ubuntu0.16.04.10 | xenial-security/universe | all
 dnsmasq | 2.75-1ubuntu0.16.04.10 | xenial-updates | source
 dnsmasq | 2.75-1ubuntu0.16.04.10 | xenial-updates/universe | all
 dnsmasq | 2.79-1 | bionic | source
 dnsmasq | 2.79-1 | bionic/universe | all
 dnsmasq | 2.79-1ubuntu0.4 | bionic-security | source
 dnsmasq | 2.79-1ubuntu0.4 | bionic-security/universe | all
 dnsmasq | 2.79-1ubuntu0.4 | bionic-updates | source
 dnsmasq | 2.79-1ubuntu0.4 | bionic-updates/universe | all
 dnsmasq | 2.79-1ubuntu0.5 | bionic-proposed | source
 dnsmasq | 2.79-1ubuntu0.5 | bionic-proposed/universe | all
 dnsmasq | 2.80-1.1ubuntu1 | focal | source
 dnsmasq | 2.80-1.1ubuntu1 | focal/universe | all
 dnsmasq | 2.80-1.1ubuntu1.4 | focal-security | source
 dnsmasq | 2.80-1.1ubuntu1.4 | focal-security/universe | all
 dnsmasq | 2.80-1.1ubuntu1.4 | focal-updates | source
 dnsmasq | 2.80-1.1ubuntu1.4 | focal-updates/universe | all
 dnsmasq | 2.84-1ubuntu2 | hirsute | source
 dnsmasq | 2.84-1ubuntu2 | hirsute/universe | all
 dnsmasq | 2.84-1ubuntu2.1 | hirsute-security | source
 dnsmasq | 2.84-1ubuntu2.1 | hirsute-security/universe | all
 dnsmasq | 2.84-1ubuntu2.1 | hirsute-updates | source
 dnsmasq | 2.84-1ubuntu2.1 | hirsute-updates/universe | all
 dnsmasq | 2.85-1ubuntu2 | impish | source
 dnsmasq | 2.85-1ubuntu2 | impish/universe | all

Therefore, yes the assumption would be that it is fixed in all remaining active releases.
I've not done a practical check with a full testbed setup, but code-wise it should indeed be good now.

Revision history for this message
Christian Parpart (christianparpart) wrote :

Many thanks Chris.

I just felt triggered by this closing because I remember how much pain this bug caused me while working in a data center back in the days. :)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.