ipconfig does not honour user-requested timeouts in some cases

Bug #1947099 reported by Khaled El Mously
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
klibc (Ubuntu)
Won't Fix
Medium
Unassigned
Bionic
Fix Released
Medium
Khaled El Mously

Bug Description

[Impact]
In some cases, ipconfig can take a longer time than the user-specified timeouts, causing unexpected delays.

[Test Plan]

- Check that the ipconfig utility is able to obtain an IP through dhcp:

# ip l set dev ens9 down
# date; /usr/lib/klibc/bin/ipconfig ens9; date

- Check if it respects the timeout (i.e. 2 seconds in the example below):

# ip l set dev ens9 down
# date; /usr/lib/klibc/bin/ipconfig -t 2 ens9; date

- The original issue is that timeout is being ignored when not receiving any reply from a DHCP Server, so for this test you have to stop your DHCP Server (i.e. dnsmasq) and then test the timeouts:

# ip l set dev ens9 down
# date; /usr/lib/klibc/bin/ipconfig -t 2 ens9; date

# ip l set dev ens9 down
# date; /usr/lib/klibc/bin/ipconfig -t 5 ens9; date

Make sure it times out respectivelly in 2 and 5 seconds.

[racb: pending agreement with the SRU team; please see comment 37]

Any situation where ipconfig encounters an error sending the DHCP packet, it will automatically set a delay of 10 seconds, which could be longer than the user-specified timeout. It can be reproduced by creating a dummy interface and attempting to run ipconfig on it with a timeout value of less than 10:

# ip link add eth1 type dummy
# date; /usr/lib/klibc/bin/ipconfig -t 2 eth1; date
Thu Nov 18 04:46:13 EST 2021
IP-Config: eth1 hardware address ae:e0:f5:9d:7e:00 mtu 1500 DHCP RARP
IP-Config: no response after 2 secs - giving up
Thu Nov 18 04:46:23 EST 2021

^ Notice above, ipconfig thinks that it waited 2 seconds, but the timestamps show an actual delay of 10 seconds.

[Where problems could occur]
Please see reproduction steps above. We are seeing this in production too (see comment #2).

[Other Info]
A patch to fix the issue is being proposed here. It is a safe fix - it only checks before going into sleep that the timeout never exceeds the user-requested value.

[Original Description]

In some cases, ipconfig can take longer than the user-specified timeouts, causing unexpected delays.

in main.c, in function loop(), the process can go into process_timeout_event() (or process_receive_event() ) and if it encounters an error situation, will set an attempt to "try again later" at time equal now + 10 seconds by setting

s->expire = now + 10;

This can happen at any time during the main event loop, which can end up extending the user-specified timeout if "now + 10" is greater than "start_time + user-specified-timeout".

I believe a patch like the following is needed to avoid this problem:

--- a/usr/kinit/ipconfig/main.c
+++ b/usr/kinit/ipconfig/main.c
@@ -437,6 +437,13 @@ static int loop(void)

                        if (timeout > s->expire - now.tv_sec)
                                timeout = s->expire - now.tv_sec;
+
+ /* Compensate for already-lost time */
+ gettimeofday(&now, NULL);
+ if (now.tv_sec + timeout > start + loop_timeout) {
+ timeout = loop_timeout - (now.tv_sec - start);
+ printf("Lowered timeout to match user request = (%d s) \n", timeout);
+ }
                }

I believe the current behaviour is buggy. This is confirmed when the following line is executed:

                        if (loop_timeout >= 0 &&
                            now.tv_sec - start >= loop_timeout) {
                                printf("IP-Config: no response after %d "
                                       "secs - giving up\n", loop_timeout);
                                rc = -1;
                                goto bail;
                        }

'loop_timeout' is the user-specified time-out. With a value of 2, in case of error, this line prints:

IP-Config: no response after 2 secs - giving up

So it thinks that it waited 2 seconds - however, in reality it had actually waited for 10 seconds.

The suggested code-change ensures that the timeout that is actually used never exceeds the user-specified timeout.

[ Regression potential ]

This change ensures that user-specified timeouts are never exceeded, which is a problem that appears to happen only in case of interface errors.
It may be that someone is relying on current behaviour where they receive DHCP offers after their specified timeout (but within the 10-second error timeout). However, 1) that is buggy behaviour and should be exposed. Such a user would need to update their specified timeout to make it long enough to receive the DHCP offer (setting the timeout to 10 would keep the existing behaviour). 2) I think it is unlikely that such a scenario exists at all. The 10-second timeout problem happens when there are problems with the interface that prevent it from even sending out the DHCP request. I think it is very unlikely (or even, impossible) that DHCP offers would be received on a dead interface.

Based on the above points, I consider the regression potential to be very low for this change. I do not expect anyone who is currently using ipconfig successfully to notice this change.

I believe the only difference introduced by this is the reduction of delays caused by dead or problematic network interfaces. Those error delays are shortened such that they never exceeed user-specified timeouts.

description: updated
Revision history for this message
Khaled El Mously (kmously) wrote :

Adding a debdiff of the patch as well. I am not sure if I generated it correctly though..

Revision history for this message
Khaled El Mously (kmously) wrote :

For some background info, this issue is affecting one of the instance types of a cloud partner. The instance has 18 physical ethernet interfaces (bonded), but only one of which has a media connection. On boot-up, ipconfig is called to do DHCP for all 18 devices. 17 of them fail to send a DHCP request (because no media is connected), and so each one ends up taking 10 seconds, even though the user-specified timeout for each interface was only 2. This delays the entire boot-up sequence by about 3 minutes.

Mathew Hodson (mhodson)
Changed in klibc (Ubuntu):
importance: Undecided → Medium
Changed in klibc (Ubuntu Bionic):
importance: Undecided → Medium
Revision history for this message
Eric Desrochers (slashd) wrote :

@Khaled El Mously (kmously)

I don't see this piece of code in the klibc upstream project[0]

Is this something you only reproduce in Bionic/18.04LTS ?
Since your fix is not found anywhere else, I would be tempted to assume that it could be affecting other releases of Ubuntu and possibly more.

Could you please erport a bug and submit your patch by making sure the upstream klibc project adopts it ? It would help get some traction in order to SRU this patch into Ubuntu.

If it needs to be a UBUNTU_SAUCE only, please provide rationale.

- Eric

[0] - https://git.kernel.org/pub/scm/libs/klibc/klibc.git/about/

Eric Desrochers (slashd)
tags: added: sts
Eric Desrochers (slashd)
description: updated
Revision history for this message
Khaled El Mously (kmously) wrote :

Hi @Eric - thanks for the follow-up and sorry for the delayed reply.

> I don't see this piece of code in the klibc upstream project[0]

Are you referring to the piece of code that I am suggesting as a fix? If so, it makes sense that it doesn't exist elsewhere, which is why I am suggesting it be incorporated into klibc. I have checked upstream klibc 2.0.4 which is the version that Bionic uses and it applies cleanly there.

I am not sure that it would help us to get the fix upstreamed in a new version of klibc since the problem is only affecting Bionic, and Bionic uses klibc 2.0.4 + some Ubuntu changes. I assumed we can just add the additional fix as another Ubuntu-specific change on top ?

Revision history for this message
Eric Desrochers (slashd) wrote :

Khaled El Mously (kmously),

Thanks for the update. I'll review this and talk with sil2100, an SRU verification member.

Could you please help to fill the SRU template in the description above.
Extra documentations can be found here: https://wiki.ubuntu.com/StableReleaseUpdates#SRU_Bug_Template

What make Bionic more susceptible to this particular problem ? Bionic kernel version in use ? else ?

Revision history for this message
Eric Desrochers (slashd) wrote :

I still believe, it would be a good practice to submit the patch to upstream.

Revision history for this message
Eric Desrochers (slashd) wrote :

And let them judge if it's worth an upstream adoption or not as a first exercise. Then we can take a decision for Ubuntu.

Revision history for this message
Khaled El Mously (kmously) wrote :

Hello @Eric.

> What make Bionic more susceptible to this particular problem ? Bionic kernel version in use ? else ?

I believe klibc/ipconfig itself is susceptible to the problem in Bionic and Focal and elsewhere. However, Focal (which uses klibc 2.0.7) uses dhclient for networking initialization instead of ipconfig, therefore I am not really concerned about fixing the ipconfig issue in Focal since it is not causing any noticeable problems.

I have checked the HEAD of the klibc project and it appears that ipconfig is still susceptible to the timeout problem. Sure I can propose the patch upstream if you like - however, even if they accept it I believe we will still need to apply it separately for Bionic because I do not expect that Bionic will receive any major updates for klibc, so it will not get any upstream fixes. It will remain on 2.0.4 as far as I know. It is currently version 2.0.4-9ubuntu2 in Bionic which means we are carrying (2?) Ubuntu-specific patches on top of 2.0.4. I believe this fix can be added as well to produce 2.0.4-9ubuntu3.

I will attempt to upstream this fix to klibc, but I believe the change to Bionic should happen in parallel/independently since the upstream patch will not make its way back to Bionic (which is stuck at 2.0.4, as mentioned above).

Revision history for this message
Khaled El Mously (kmously) wrote :

(updated the bug description with further details according to the SRU template, including reproduction steps)

description: updated
description: updated
Revision history for this message
Eric Desrochers (slashd) wrote :

> I will attempt to upstream this fix to klibc, but I believe the change to Bionic should happen in parallel/independently since the upstream patch will not make its way back to Bionic (which is stuck at 2.0.4, as mentioned above).

Yes that is the plan to SRU the fix on top of what is currently found in Bionic's klibc.
And I'll gladly sponsor your patch in Bionic.

The idea with upstream is to make our contribution available to other who may suffer from the same issue, make sure it won't re-occur in a future release of Ubuntu as we will sync from debian which sync from upstream, ... and also have some kind of upstream adoption/approval/review of your proposal fix.

Let's give a few days after the submission for the upstream to have some time to review your patch. If it takes too long for them to get back to us, I'll sponsor the patch as-is, knowing that it was submitted to upstream with a reasonable amount of time given to them to review.

- Eric

Changed in klibc (Ubuntu Bionic):
assignee: nobody → Khaled El Mously (kmously)
status: New → In Progress
Revision history for this message
Brian Murray (brian-murray) wrote :

I'm unsubscribing sponsors given that slashd is handling this.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hey guys, sorry for taking so long to comment. I agree with Eric here that we should get this patch upstreamed. Since we're only seeing this issue on bionic, I think it's safe to only backport it to bionic - but preferably after it's submitted upstream and some upstream people eyeballing it. It's always useful to get upstream's input on the changes we want to propose here + this will not be a dangling bionic-only change that way.

Revision history for this message
Khaled El Mously (kmously) wrote :
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Since the patch has been submitted upstream and there was no hard objection to it, I think you can proceed in preparing the SRU and getting it uploaded for SRU review. Just be sure to include the upstream link in the Forwarded: section of the patch. Thanks!

Revision history for this message
Dariusz Gadomski (dgadomski) wrote (last edit ):

I have prepared a patch for bionic and tested it:
# ip link add eth3 type dummy
# date; /usr/lib/klibc/bin/ipconfig -t 2 eth3; date
śro, 16 lut 2022, 15:20:26 CET
IP-Config: eth3 hardware address ba:e2:3c:e9:db:e4 mtu 1500 DHCP RARP
Lowered timeout to match user request = (2 s)
IP-Config: no response after 2 secs - giving up
śro, 16 lut 2022, 15:20:28 CET

I have sponsored the patch for review (2.0.4-9ubuntu2.18.04.1).

Revision history for this message
Robie Basak (racb) wrote :

> However, Focal (which uses klibc 2.0.7) uses dhclient for networking initialization instead of ipconfig...

If no users in Focal could be affected by the bug then I'm fine with accepting the fix to Bionic only. But could you please expand on the above statement? Under what conditions does Focal use dhclient instead of ipconfig? Presumably through initramfs-tools and ip=/ip6=? Are there any other situations where users could reasonably be using klibc/ipconfig on Focal? Is it possible for them to be doing this but not using initramfs-tools?

It would help me understand if you could present a complete user story to explain the user impact. Right now the bug description doesn't explain the full scenario under which they would be impacted by this bug.

> [Where problems could occur]

I don't think you've really answered this question. What different scenarios, from a user perspective (ie. user stories), can reasonably be predicted that will hit this ipconfig DHCP-related timeout case?

Robie Basak (racb)
Changed in klibc (Ubuntu Bionic):
status: In Progress → Incomplete
Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :
Download full text (4.7 KiB)

Hi Robie,

The user story here is about improving the time it takes to boot a Bionic instance on Oracle Cloud in a specific bare metal shape, called BM.GPU4.8. This is a pretty large instance, with 18x Ethernet controller [0200]: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]:

https://www.oracle.com/uk/cloud/compute/gpu.html

Comparing the time it takes to boot Bionic with a CentOS instance we can see a big delta both in kernel and userspace:

CentOS:
Startup finished in 4min 3.548s (firmware) + 2min 23.525s (loader) + 3min 17.984s (kernel) + 1min 37.825s (initrd) + 38.723s (userspace) = 12min 1.608s
multi-user.target reached after 38.714s in userspace

Ubuntu:
Startup finished in 4min 621ms (firmware) + 2min 23.174s (loader) + 7min 34.767s (kernel) + 5min 48.219s (userspace) = 19min 46.782s
graphical.target reached after 5min 47.854s in userspace

The userspace difference is a cloud-init problem that is being handled by other launchpad bugs. Here we are trying to focus on the kernel difference, which further was narrowed down to be related to klibc and it is what we are trying to address with this bug. Some details:

Looking into details, here are the dmesg output from CentOS and Ubuntu, for kernel comparison:

Ubuntu: https://pastebin.canonical.com/p/X7GFVjdV3b/
CentOS: https://pastebin.canonical.com/p/gNhFwJjyjw/

Looking there, one area that looks promising is around mlx5_core driver, where Ubuntu spends 218 seconds, while CentOS spends 44 seconds

Khalid did a nice job investigating and found that the problem came from ipconfig from initramfs. Here are the details:

The delay from the kernel side is due to the fact that every interface (18 in total) takes over 10 seconds each to initialize.

The interfaces are brought up and used for DHCP as part of the iscsi initialization, during the initramfs stage. Specifically the script ```/usr/share/initramfs-tools/scripts/local-top/iscsi``` which is included in the initramfs and is installed as part of open-iscsi

To initialize the network interfaces this script calls other helper functions that are part of the initramfs, specifically "configure_networking" in ```/usr/share/initramfs-tools/scripts/functions``` which is installed as part of initramfs-tools-core.

The configure_networking function initializes networking by looping over all interfaces and issuing this call:

```
ipconfig -t 2 <iface>
```

where it is calling the ipconfig utility (part of klibc-utils) which also gets bundled in the initramfs. The -t 2 specifies a timeout of 2 seconds.

This ipconfig tool attempts to perform DHCP on the given interface, within the specified timeout (always 2 seconds for each interface)

Normally, ipconfig does not perform only one DHCP attempt - it performs multiple DHCP attempts, each attempt waiting for an exponentially increasing amount of time (1s, 2s, 4s, ..etc) up until the total amount of time has passed that is equal the user-specified timeout (2 in this case, so it should wait 1 second, followed by 1 second, for a total of 2 seconds).

However, it appears ipconfig has a bug. In case it encounters an error sending a DHCP request on an interface (which is the case for 17 ou...

Read more...

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Thank you for the detailed explanation Fabio, the impact of the bug is now much better understood. Can I ask for an update of the Impact field in the SRU description above? In a summary, but also getting into a bit more details about how this differs from focal+ etc.

One thing I worry about in this SRU is basically what has been mentioned by Ben Hutchings as a follow up to the patch forwarded upstream:
https://lists.zytor.com/archives/klibc/2021-December/004635.html

I worry that since this issue in ipconfig has been around for *so long* that people could have actually started depending on the behavior being as-is. I don't think klibc's ipconfig is too widely used (as klibc-utils only has like 2 reverse-dependencies), but it's still something to consider.
I think I'm not generally against this SRU. Similar to Robie I would like some regression potential analysis done and included in the [Where problems could occur] section - one of them being what I mentioned above. Can you think of any other issues this change could cause? In integration or in different parts of the code? Can we mitigate this somehow by performing some additional testing?

If you could modify the two sections of the SRU description, I'd be ready to accept this into bionic-proposed.

Revision history for this message
Khaled El Mously (kmously) wrote :

Hello @Łukasz Zemczak

I understand your concern about breaking existing behaviour.

I would like to clarify 2 things:
 1) The change introduced here only ensures that user-requested timeouts are never exceeded. That means that if something is currently accepting DHCP offers after the specified timeout has expired, then that is buggy behaviour. It would make sense that such behaviour would change after this change. If such a user exists, they would need to change their timeout values so as to be long enough to receive the DHCP offer that they want. I think it makes sense that such buggy behaviour would get exposed with this fix.
 2) That said, I consider it very unlikely for that scenario to even exist. The main reason is that the bug (going over the specified timeout) happens in the case where there is a problem bringing up the interface, so the ipconfig logic simply says "try again in 10 seconds" in that case, even if that 10 seconds is longer than the entire specified timeout. Since this happens when there are interface problems, it can be safely assumed that no DHCP offers will be received anyway on that problematic interface (in fact, no DHCP request was even sent out). These 10 seconds are just dead time, basically.

So I would consider it unlikely that anyone who is successfully using ipconfig today would even notice a change. This patch should only affect those 10-seconds delays caused by problematic interfaces not normal functioning interfaces, (and even then, it would only affect them if the user-specified timeout is less than 10 seconds).

I am not sure what you mean by how it differs from focal+. The last time I checked, the timeout-bug was affecting ipconfig in both focal and bionic (and upstream HEAD). However, we are only interested in Bionic since Focal does not even use ipconfig it uses dhclient instead.

I will add a regression-potential section to the description with more info. Thanks.

I will try to expand the regression-potential section with additional info about this.

Revision history for this message
Khaled El Mously (kmously) wrote :

A [ Regression potential ] section was added to the description.

@Łukasz Zemczak let me know if there are other details you would like. Thank you!

description: updated
Revision history for this message
Robie Basak (racb) wrote :

Thank you for the detailed additional information. I think I understand the situation properly now, together with the need for the fix.

> That means that if something is currently accepting DHCP offers after the specified timeout has expired, then that is buggy behaviour.

We sometimes run into the case where we're concerned about users being regressed by the very fix for the buggy behaviour itself. I think we need to take a pragmatic view in these cases. But unless there's some specific scenario we're concerned about, I think it's OK to not worry about the general case. As you say, if there's a user who is specifying a 2 second timeout but then relying on a DHCP reply coming after 3 seconds, then we will regress them but I think this is reasonable under the circumstances. I think it's sufficient to have considered and documented this case carefully, as you now have done (thanks!)

We also need to be careful to ensure that the code that we ship in the archive isn't itself already making such an assumption. But I think you've sufficiently covered that already.

It might be worth verifying that this is exactly the result of the change being landed here though. During SRU verification, in addition to the test plan you've already specified, would it be practical and/or useful to verify that, with a timeout of 2s, a DHCP reply sent after 1.5s works, but a DHCP reply sent after 2.5s does not? This would help us ensure that we are introducing the exact change we intend to introduce.

I would also like you to verify your actual failing user story please - that your boot time does actually speed up with that hardware.

With those two requests added to your Test Plan I am in principle fine to proceed. I still need to review your patch in detail but I've unfortunately run out of time for that today. I'd also like Łukasz's ack please.

If another SRU team member looks again before I do, feel free to take over and accept without my further input if you feel that's appropriate.

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :
Download full text (7.7 KiB)

I've setup a Lab with dnsmasq acting as DHCP Server, which I can use the dhcp-reply-delay option to introduce a delay between the DHCPDISCOVER and DHCPOFFER, as in the example below:

Mar 30 18:26:34 focal-dhcpsrv dnsmasq-dhcp[2470]: DHCPDISCOVER(ens3) 52:54:00:d7:10:13
Mar 30 18:26:34 focal-dhcpsrv dnsmasq-dhcp[2470]: 4195928115 reply delay: 3
Mar 30 18:26:37 focal-dhcpsrv dnsmasq-dhcp[2470]: DHCPOFFER(ens3) 192.168.150.119 52:54:00:d7:10:13

Just minor note that the delay only happens between DHCPDISCOVER and DHCPOFFER, but not between DHCPREQUEST and DHCPACK:

Mar 30 18:22:13 focal-dhcpsrv dnsmasq-dhcp[2470]: DHCPREQUEST(ens3) 192.168.150.119 52:54:00:d7:10:13
Mar 30 18:22:13 focal-dhcpsrv dnsmasq-dhcp[2470]: DHCPACK(ens3) 192.168.150.119 52:54:00:d7:10:13 ubuntu

So, if this is a new interface (new mac) looking for an IP, the delay will be added, but if this is an interface which was previously configured and is asking to re-use the same IP (DHCPREQUEST), there won't be any delays.

I believe this Lab will help reproducing the issue with and without the proposed patch.

1. In a ideal scenario, where no delays were added (dhcp-reply-delay is commented out), ipconfig will take approximately to do its job: it sends a DHCPREQUEST, then DHCPDISCOVER and then another DHCPREQUEST in this process:

- From the ipconfig perspective:

root@ubuntu:/etc/dhcp# date; /usr/lib/klibc/bin/ipconfig ens9; date
Wed Mar 30 19:25:44 UTC 2022
IP-Config: ens9 hardware address 52:54:00:f3:4e:0e mtu 1500 DHCP RARP
IP-Config: ens9 complete (dhcp from 192.168.150.2):
 address: 192.168.150.105 broadcast: 192.168.150.255 netmask: 255.255.255.0
 gateway: 192.168.150.1 dns0 : 192.168.150.2 dns1 : 0.0.0.0
 domain : sts.lab
 rootserver: 192.168.150.2 rootpath:
 filename :
Wed Mar 30 19:25:54 UTC 2022

- From the dhcp server perspective:

Mar 30 19:25:45 focal-dhcpsrv dnsmasq-dhcp[2656]: DHCPREQUEST(ens3) 192.168.150.105 52:54:00:f3:4e:0e
Mar 30 19:25:45 focal-dhcpsrv dnsmasq-dhcp[2656]: DHCPACK(ens3) 192.168.150.105 52:54:00:f3:4e:0e ubuntu
Mar 30 19:25:54 focal-dhcpsrv dnsmasq-dhcp[2656]: DHCPDISCOVER(ens3) 52:54:00:f3:4e:0e
Mar 30 19:25:54 focal-dhcpsrv dnsmasq-dhcp[2656]: DHCPOFFER(ens3) 192.168.150.105 52:54:00:f3:4e:0e
Mar 30 19:25:54 focal-dhcpsrv dnsmasq-dhcp[2656]: DHCPREQUEST(ens3) 192.168.150.105 52:54:00:f3:4e:0e
Mar 30 19:25:54 focal-dhcpsrv dnsmasq-dhcp[2656]: DHCPACK(ens3) 192.168.150.105 52:54:00:f3:4e:0e

2. Adding a 5 seconds delay in dnsmasq (dhcp-reply-delay=5) and without enforcing a timeout in ipconfig:

- From ipconfig perspective:

root@ubuntu:/etc/dhcp# date; /usr/lib/klibc/bin/ipconfig ens9; date
Wed Mar 30 19:34:09 UTC 2022
IP-Config: ens9 hardware address 52:54:00:f3:4e:0e mtu 1500 DHCP RARP
IP-Config: ens9 complete (dhcp from 192.168.150.2):
 address: 192.168.150.105 broadcast: 192.168.150.255 netmask: 255.255.255.0
 gateway: 192.168.150.1 dns0 : 192.168.150.2 dns1 : 0.0.0.0
 domain : sts.lab
 rootserver: 192.168.150.2 rootpath:
 filename :
Wed Mar 30 19:34:27 UTC 202...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Fabio,

Thanks for the lab setup for testing this!

> Anyway, if someone can build a klibc-utils package for me with the proposed patch,
> I can test the ipconfig behavior in this Lab.

Sure, there you go!

sudo add-apt-repository ppa:mfo/lp1947099
sudo apt install klibc-utils

> For testing the boot time optimization, I believe I will need a linux-oracle kernel built
> with the patch, since initramfs will get the ipconfig binary from bin/ipconfig inside the initrd.

If you update the initramfs after installing klibc-utils
(I guess this should be done automatically via triggers,
but apparently it isn't) then it picks up the new binary.

cheers,
Mauricio

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

I tried using klibc-utils from the ppa (containing the patch):

root@ubuntu:~# sudo apt-cache policy klibc-utils
klibc-utils:
  Installed: 2.0.4-9ubuntu2.18.04.1
  Candidate: 2.0.4-9ubuntu2.18.04.1
  Version table:
 *** 2.0.4-9ubuntu2.18.04.1 500
        500 http://ppa.launchpad.net/mfo/lp1947099/ubuntu bionic/main amd64 Packages
        100 /var/lib/dpkg/status
     2.0.4-9ubuntu2 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages

[1] But now when I try to obtain an IP using dhcp without specifying any timeout, it dumps lots of "Lowered timeout to match user request" messages. Is that expected?

https://pastebin.ubuntu.com/p/5Tpc5Rwdkq/

[2] Other than that, it worked well. We can see from the pastebin below that the timeout is getting respected:

https://pastebin.ubuntu.com/p/TJbCzx7hFC/

[3] And also the main issue was the timeout being ignored when not receiving any reply from a DHCP Server, which is now also getting honored:

https://pastebin.ubuntu.com/p/BvKk6fJJDc/

I will try to test this package in a OCI BM.GPU4.8 instance to see the improvements with the main optimization, but it would be good to get clarification on [1]

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

I've tested the klibc-utils patch using Mauricio's ppa:

sudo add-apt-repository ppa:mfo/lp1947099
sudo apt install klibc-utils
sudo update-initramfs -u -k all

And I can confirm that it does improve the boot time in more than 3 minutes, without causing any noticeable issues.

- Without the patch:

https://pastebin.ubuntu.com/p/6M7M2FfCGQ/

root@ubuntu1804:~# sudo systemd-analyze time
Startup finished in 4min 2.098s (firmware) + 2min 23.684s (loader) + 6min 226ms (kernel) + 38.766s (userspace) = 13min 4.776s
graphical.target reached after 38.374s in userspace

We can see it spent 6min 226ms in kernel, and looking through the serial console (or dmesg) it was running the ipconfig commands for each of the Mellanox NICs from 211.972259 up to 386.155116 = 174.182857 seconds

- With the patch:

https://pastebin.ubuntu.com/p/JxDs8WPqc4/

root@ubuntu1804:~# sudo systemd-analyze time
Startup finished in 4min 890ms (firmware) + 2min 23.278s (loader) + 3min 46.413s (kernel) + 40.734s (userspace) = 10min 51.317s
graphical.target reached after 40.354s in userspace

We can see the kernel time decreased to a bit more than 3 minutes and we spent only 41 seconds (as opposed to 174 seconds) in the ipconfig commands:

210.675050 to 252.432441 = 41.757391 seconds

That definitely has helped to resolve the issue we are chasing here.

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

@Łukasz / @Robie, do you think the above comments are enough to proceed with this SRU?

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Fabio,

> [1] But now when I try to obtain an IP using dhcp without specifying any timeout,
> it dumps lots of "Lowered timeout to match user request" messages.
> Is that expected?
>
> https://pastebin.ubuntu.com/p/5Tpc5Rwdkq/

No, this is a regression without `-t`, and has to be fixed before SRU.

...

Without `-t`, loop_timeout == -1, which makes the new check pass/true.

(because `start` ~= `now` (seconds range), even for the theoretically
lowest possible value of `timeout` (zero; note it's not loop_timeout).

We need a gating `loop_timeout >= 0` check, as used in the inner loop.

- if (now.tv_sec + timeout > start + loop_timeout) {
+ if (loop_timeout >= 0 && now.tv_sec + timeout > start + loop_timeout) {

...

While checking the code in detail, I was a bit concerned that,
in the first pass, the `prev = now` assignment in the inner loop
now takes the newly used `now` (this patch), not the one from before
the while loop.

However, that shouldn't matter much, as it's only used for time/delta
reporting (disabled in non-debug build; but just for correctness) and
even so, for the first pass the time difference between the first/start
`now` and that assignment in the new check doesn't seem to be significant.

...

That one line change worked fine in a quick test here.

Could you test that, by, I guess (sorry, I have to EOD):

... unpacking the source package from the unapproved queue
(dpkg-source -x *.dsc && cd klibc-*),
create a new patch with `quilt new ipconfig-timeout-fix-fix.patch`,
add file with `quilt add usr/kinit/ipconfig/main.c`,
change that file,
update patch with `quilt refresh`,
update changelog `DEB_EMAIL="Your Name <email address hidden>" TZ=UTC dch -l+test 'd/p/ipcfg-tmo-f-f.patch: Test fix'`
sudo apt build-dep -y klibc
dpkg-buildpackage -us -uc

Thanks!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Fabio is out today, but he mentioned yesterday that the test patch worked fine locally, and he'd get access to an oracle test system again, for more tests.

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

Thank you, Mauricio, for the build process details and for adding the update here. I'm including some evidence of my tests showing that the patch you suggested did work well:

Details of the build process:

https://pastebin.ubuntu.com/p/dmVWH2fxpy/

Test package installed:

https://pastebin.ubuntu.com/p/qxyWBGrm3P/

Working well without specifying a timeout:

https://pastebin.ubuntu.com/p/byftnFk33C/

Timeout is also being honored:

https://pastebin.ubuntu.com/p/spFsTRhXTz/

Testing the original issue (timeout being ignored when not receiving any reply from a DHCP Server), which is now also getting honored:

https://pastebin.ubuntu.com/p/KHVRj7BdRw/

I'm working to get access to another instance to test that the boot time optimiizations are still fine.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey, Fabio!

That looks good; thanks!

I have built/uploaded to ppa:mfo/lp1947099v2 and for SRU,
as the tests on Oracle have `-t` in, thus no code change
from previous tests, effectively.

Per our chat, for completeness/correctness you'll test on
oracle again, and then we can move the SRU upload forward.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

We'll have to respin/re-upload with a different version, due to the recent security update [1].

Let's do that after the testing on Oracle, if at all possible, as that may be asked by the SRU team for accepting it, and some members may be busy with the jammy release, so let's save one request. :)

I have asked for the existing uploads to be removed (#ubuntu-release).

[1] https://ubuntu.com/security/notices/USN-5379-1

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

I've tested the new patch from ppa:mfo/lp1947099v2 and I can confirm it resolves the problem:

- Without the patch:

https://pastebin.ubuntu.com/p/RksNcBGSzn/

It took 396,940865−220,447147 = 176,493718 seconds in the IP-Config section. Total boot time:

ubuntu@gpu48-ubuntu18:~$ sudo systemd-analyze time
Startup finished in 4min 1.355s (firmware) + 2min 24.433s (loader) + 6min 8.464s (kernel) + 41.466s (userspace) = 13min 15.719s
graphical.target reached after 41.068s in userspace

- With the patch:

https://pastebin.ubuntu.com/p/46nVYCYyDZ/

It took 246,556749−212,019368 = 34,537381 seconds in the IP-Config section. Total boot time:

ubuntu@gpu48-ubuntu18:~$ sudo systemd-analyze time
Startup finished in 4min 1.246s (firmware) + 2min 24.170s (loader) + 3min 42.915s (kernel) + 39.010s (userspace) = 10min 47.343s
graphical.target reached after 38.634s in userspace

ubuntu@gpu48-ubuntu18:~$ sudo apt-cache policy klibc-utils
klibc-utils:
  Installed: 2.0.4-9ubuntu2.18.04.1
  Candidate: 2.0.4-9ubuntu2.18.04.1
  Version table:
 *** 2.0.4-9ubuntu2.18.04.1 500
        500 http://ppa.launchpad.net/mfo/lp1947099v2/ubuntu bionic/main amd64 Packages
        100 /var/lib/dpkg/status
     2.0.4-9ubuntu2.1 500
        500 http://me-dubai-1-ad-1.clouds.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     2.0.4-9ubuntu2 500
        500 http://me-dubai-1-ad-1.clouds.archive.ubuntu.com/ubuntu bionic/main amd64 Packages

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Thanks, Fabio!

With the regression fixed and Oracle test case working as expected,
I've updated the versioning and patch series, and uploaded for SRU.

tags: added: sts-sponsor-mfo
Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

Should this bug be changed to Fix Committed at this point?

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Fabio,

That happens when the package lands in -proposed.

It looks like the SRU vanguards this Wed/Thu are Robie/Lukasz,
which is good as they are familiar with the discussion/changes,
so I'll ask if they have SRU review cycles to for this upload.

Thanks for following up!

Changed in klibc (Ubuntu Bionic):
status: Incomplete → Confirmed
Revision history for this message
Robie Basak (racb) wrote :

It's not clear to me if upstream have accepted the patch. If not, https://lists.zytor.com/archives/klibc/2021-December/004635.html sounds like it's a deliberate upstream design decision not to.

In Ubuntu, we might decide to maintain the patch as a delta but then drop that delta in subsequent releases when we no longer need that functionality. This is already the case for >18.04, and for 18.04 we're adding the delta in an SRU.

Therefore, Won't Fix for Ubuntu for the development release I think is appropriate. It it does get fixed upstream then our status can be updated, but either way it's not really relevant to Ubuntu any more.

In comment 22 I asked:

> ...would it be practical and/or useful to verify that, with a timeout of 2s, a DHCP reply sent after 1.5s works, but a DHCP reply sent after 2.5s does not?

Have you covered this in your test plan please? I don't see this discussed anywhere.

Please could you also update the Test Plan section in the SRU description to cover the testing you are committing to do?

Apart from that, +1, so I'll accept now to save us time. We do need to continue discussion to resolve the Test Plan though. That needs to be resolved to the SRU team's satisfaction before we will release the build to the updates pocket.

description: updated
Changed in klibc (Ubuntu):
status: New → Won't Fix
Changed in klibc (Ubuntu Bionic):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Khaled, or anyone else affected,

Accepted klibc into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/klibc/2.0.4-9ubuntu2.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

I've tested the package from -proposed and I can confirm it fixes the problem:

Installed from -proposed:

root@ubuntu:~# apt-cache policy klibc-utils
klibc-utils:
  Installed: 2.0.4-9ubuntu2.2
  Candidate: 2.0.4-9ubuntu2.18.04.1
  Version table:
     2.0.4-9ubuntu2.18.04.1 500
        500 http://ppa.launchpad.net/mfo/lp1947099/ubuntu bionic/main amd64 Packages
 *** 2.0.4-9ubuntu2.2 500
        500 http://archive.ubuntu.com/ubuntu bionic-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     2.0.4-9ubuntu2.1 500
        500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     2.0.4-9ubuntu2 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages

Working well without timeout:

root@ubuntu:~# ip l set dev ens9 down
root@ubuntu:~# date; /usr/lib/klibc/bin/ipconfig ens9; date
Thu May 12 12:54:48 UTC 2022
IP-Config: ens9 hardware address 52:54:00:f3:4e:0e mtu 1500 DHCP RARP
IP-Config: ens9 complete (dhcp from 192.168.150.2):
 address: 192.168.150.105 broadcast: 192.168.150.255 netmask: 255.255.255.0
 gateway: 192.168.150.1 dns0 : 192.168.150.2 dns1 : 0.0.0.0
 domain : sts.lab
 rootserver: 192.168.150.2 rootpath:
 filename :
Thu May 12 12:55:01 UTC 2022

Timeout being honored:

root@ubuntu:~# ip l set dev ens9 down
root@ubuntu:~# date; /usr/lib/klibc/bin/ipconfig -t 2 ens9; date
Thu May 12 12:55:44 UTC 2022
IP-Config: ens9 hardware address 52:54:00:f3:4e:0e mtu 1500 DHCP RARP
Lowered timeout to match user request = (2 s)
IP-Config: no response after 2 secs - giving up
Thu May 12 12:55:46 UTC 2022

Testing the original issue (timeout being ignored when not receiving any reply from a DHCP Server), which is now also getting honored:

root@ubuntu:~# ip l set dev ens9 down
root@ubuntu:~# date; /usr/lib/klibc/bin/ipconfig -t 2 ens9; date
Thu May 12 12:56:32 UTC 2022
IP-Config: ens9 hardware address 52:54:00:f3:4e:0e mtu 1500 DHCP RARP
Lowered timeout to match user request = (2 s)
IP-Config: no response after 2 secs - giving up
Thu May 12 12:56:34 UTC 2022

root@ubuntu:~# ip l set dev ens9 down
root@ubuntu:~# date; /usr/lib/klibc/bin/ipconfig -t 5 ens9; date
Thu May 12 12:56:43 UTC 2022
IP-Config: ens9 hardware address 52:54:00:f3:4e:0e mtu 1500 DHCP RARP
Lowered timeout to match user request = (5 s)
IP-Config: no response after 5 secs - giving up
Thu May 12 12:56:48 UTC 2022

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

Hello Robie,

I've validated that the package from -proposed works well, testing in my VM based environment. I haven't tested it on Oracle bare metal (where the original issue happened) as that is a type of instance hard to get access to. Given that the test packages had proven to fix the original problem that we were targetting to address, I believe the package from -proposed should also work well. Let me know if it is necessary to also test that it addresses the original issue with Oracle BM host and I'll get access to one and validate.

Regarding your question below:

> ...would it be practical and/or useful to verify that, with a timeout of 2s, a DHCP reply sent after 1.5s works, but a DHCP reply sent after 2.5s does not?

That has been addressed in comment #30 (and now also in #39 with the package from -proposed).

Regarding your question if upstream had accepted the patch, I would have to defer that to Khaled, but I also agree that it seems a deliberate upstream design decision not to accept it.

Regarding your comment below:

> In Ubuntu, we might decide to maintain the patch as a delta but then drop that delta in subsequent releases when we no longer need that functionality.

That is indeed just needed for Bionic. To confirm/clarify my understanding, if this moves to -updates, in theory this is not going to be reverted (from Bionic) in the future, right? Once it lands there, it is expected that any newer klibc-utils packages that gets released to Bionic will continue to carry this patch?

Thank you in advance!

Regards,
Fabio Martins

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

I've also updated the [Test Plan] section of the bug description

description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Setting verification-done-bionic per comments 39/40.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package klibc - 2.0.4-9ubuntu2.2

---------------
klibc (2.0.4-9ubuntu2.2) bionic; urgency=medium

  [ Khaled Elmously ]
  * d/p/lp1947099-honour-user-requested-timeouts-in-all-cases.patch:
    Honour user-specified timeouts even in error cases. (LP: #1947099)

  [ Mauricio Faria de Oliveira ]
  * d/p/lp1947099-fix-for-no-timeout-specified.patch: Check for an
    user-specified timeout before checking/adjusting timeout values.

 -- Mauricio Faria de Oliveira <email address hidden> Mon, 25 Apr 2022 11:39:01 -0300

Changed in klibc (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for klibc has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.