local resolver stub fails to handle multiple TCP dns queries

Bug #1811471 reported by Dan Streetman
38
This bug affects 13 people
Affects Status Importance Assigned to Milestone
systemd
Fix Released
Unknown
resolvconf (Ubuntu)
Triaged
Undecided
Unassigned
Bionic
Triaged
Undecided
Unassigned
Cosmic
Triaged
Undecided
Unassigned
Disco
Won't Fix
Undecided
Unassigned
systemd (Ubuntu)
Fix Released
High
Dimitri John Ledkov
Bionic
Fix Released
High
Dan Streetman
Cosmic
Fix Released
High
Dan Streetman
Disco
Fix Released
High
Dimitri John Ledkov

Bug Description

[Impact]

The systemd local 'stub' resolver handles all local DNS queries (by default configuration used in Ubuntu), and essentially proxies all requests to its configured upstream DNS resolvers.

Most local DNS resolution by applications uses glibc's getaddrinfo() function. This function is configured in various ways by the /etc/resolv.conf file, which tells glibc what nameserver/resolver to contact as well as how to talk to the name server.

By default, glibc performs UDP DNS queries, with a single DNS query per UDP packet. The UDP packet size is limited per DNS spec to 512 bytes. For some DNS lookups, a 512 byte UDP packet is not large enough to contain the entire response - for example, an A record lookup with a large number (e.g. 30) of A record addresses. This number of A record entries is possible in some cases of load balancing. When the DNS UDP response size is larger than 512 bytes, the server puts as much response as it can into the DNS UDP response, and marks the "trunacted" flag. This lets glibc know that the DNS UDP packet did not contain the entire response for all the A records.

When glibc sees a UDP response that is "trunacted", by default it ignores the contents of that response and issues a new DNS query, using TCP instead of UDP. The TCP packet size has a higher size limit (though see bug 1804487 which is a bug in systemd's max-sizing of TCP DNS packets), and so *should* allow glibc to receive the entire DNS response.

However, glibc issues DNS queries for both A and AAAA records. When it uses UDP, those DNS queries are separate (i.e. one UDP DNS packet with a single A query, and one UDP DNS packet with a single AAAA query). When glibc uses TCP, it puts both DNS queries into a single TCP DNS packet - the RFC refers to this as "pipelining" (https://tools.ietf.org/html/rfc7766#section-6.2.1.1) and states that clients SHOULD do this, and that servers MUST expect to receive pipelined queries and SHOULD respond to all of them. (Technically pipelining can be separate DNS queries, one per TCP packet, but both using the same TCP connection - but the clear intention of pipelining is to improve TCP performance, and putting both DNS queries into a single TCP packet is clearly more performant than using separate TCP packets).

Unfortunately, systemd's local stub resolver has only very basic support for TCP DNS, and it handles TCP DNS queries almost identically to UDP DNS queries - it reads the DNS query 2-byte header (containing the length of the query data), reads in the single DNS query data, performs lookup and sends a response to that DNS query, and closes the TCP connection. It does not check for "pipelined" queries in the TCP connection.

That would be bad enough, as glibc is (rightly) expecting a response to both its A and AAAA queries; however what glibc gets is a TCP connection-reset error. That is because the local systemd stub resolver has closed its TCP socket while input data was still pending (i.e. it never even read the second pipelined DNS query). When the kernel sees unread input bytes in a TCP connection that is closed, it sends a TCP RST to the peer (i.e. glibc) and when the kernel sees the RST, it dumps all data in its socket buffer and passes the ECONNRESET error up to the application. So glibc gets nothing besides a connection reset error.

Note also that even if the systemd local stub resolver's socket flushes its input buffer before closing the TCP connection (which will avoid the TCP RST), glibc still expects responses to both its A and AAAA queries before systemd closes the TCP connection, and so a simple change to systemd to flush the input buffer is not enough to fix the bug (and would also not actually fix the bug since glibc would never get the AAAA response).

[Test Case]

This can be reproduced on any system using a local systemd stub resolver, when using an application that uses getaddrinfo() - such as ssh, telnet, ping, etc - or with a simple C program that uses getaddrinfo(). The dns name looked up must have enough A records to overflow the 512 byte maximum for a UDP DNS packet; e.g.:

$ ping testing.irongiantdesign.com
ping: testing.irongiantdesign.com: Temporary failure in name resolution

Alternately, and trivially, glibc can be forced to always use TCP DNS queries by editing the /etc/resolv.conf file and adding:
options use-vc

With that option, glibc will fail to lookup 100% of DNS names, since all lookups will use TCP to talk to the local systemd stub resolver, which as explained above fails to ever correctly answer glibc's pipelined TCP DNS queries.

Note that in default Ubuntu installs, /etc/resolv.conf is a symlink to ../run/systemd/resolve/stub-resolv.conf, which systemd thinks it owns 100% - so any manual changes to the file may be overwritten at any time. There is no way (that I can find) to tell systemd to add any resolv.conf options (like 'use-vc') to its managed stub-resolv.conf file, so this test case requires re-editing the /etc/resolv.conf file intermittently, each time systemd overwrites it.

Note also that the patch used to work around this (see Other Info below) will fix the case of lookup failures for very long A records; but the workaround will not help at all with the test case of using 'option use-vc'. That test case will continue to fail for 100% of dns lookups.

[Regression Potential]

To workaround this, the patch enables edns0 in systemd's stub resolver resolv.conf file. This could cause problems for any system code that does not expect the resolv.conf file to include a new line/option, or could introduce problems with edns0 lookups, since glibc was not previously using edns0.

[Other Info]

This bug exists upstream, with proposed patches to add dns tcp pipeline support:
https://github.com/systemd/systemd/pull/11512

The specific bug of TCP DNS fallback not working for DNS responses larger than 512 bytes can be worked around by editing the /etc/resolv.conf file to add:
options edns0

The EDNS0 option causes glibc to fall back to attempting UDP EDNS0 query (which has a higher max packet size than the default 512 byte UDP DNS). The systemd stub resolver does support EDNS0. However, this workaround only temporarily works - as explained above, by default /etc/resolv.conf is a symlink to a file that systemd overwrites intermittently, which will remove the EDNS0 option.

The upstream patch that will be used to work around this bug in exactly that way (i.e. adding option edns0 to resolv.conf) is:
https://github.com/systemd/systemd/commit/93158c77bc69fde7cf5cff733617631c1e566fe8

That patch is already included in Debian and so no Debian bug is required for this bug (since the only fix for this specific bug will be sru'ing the edns0 workaround)

Since Xenial and Trusty do not use the systemd stub resolver (by default) I marked this Invalid for those releases.

Dan Streetman (ddstreet)
Changed in systemd (Ubuntu Trusty):
status: New → In Progress
Changed in systemd (Ubuntu Xenial):
status: New → In Progress
Changed in systemd (Ubuntu Bionic):
status: New → In Progress
Changed in systemd (Ubuntu Cosmic):
status: New → In Progress
Changed in systemd (Ubuntu Disco):
status: New → In Progress
Changed in systemd (Ubuntu Trusty):
assignee: nobody → Dan Streetman (ddstreet)
Changed in systemd (Ubuntu Xenial):
assignee: nobody → Dan Streetman (ddstreet)
Changed in systemd (Ubuntu Bionic):
assignee: nobody → Dan Streetman (ddstreet)
Changed in systemd (Ubuntu Cosmic):
assignee: nobody → Dan Streetman (ddstreet)
Changed in systemd (Ubuntu Disco):
assignee: nobody → Dan Streetman (ddstreet)
importance: Undecided → High
Changed in systemd (Ubuntu Cosmic):
importance: Undecided → High
Changed in systemd (Ubuntu Bionic):
importance: Undecided → High
Changed in systemd (Ubuntu Xenial):
importance: Undecided → High
Changed in systemd (Ubuntu Trusty):
importance: Undecided → High
Changed in systemd:
status: Unknown → New
Revision history for this message
Brian Murray (brian-murray) wrote :

Adding "options edns0" to /etc/resolv.conf ended up resolving bug 1805027 for me.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I am happy to add "options edns0" in the generated file by resolved.

But we also need to file this case upstream, and start implementing pipelined requests handling in resolved too.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :
Revision history for this message
Dan Streetman (ddstreet) wrote :

> Also is this then just not a simple cherrypick of:
>
> https://github.com/systemd/systemd/commit/93158c77bc69fde7cf5cff733617631c1e566fe8

that's one way to work around it, although glibc is not necessarily the only thing that might do pipelined TCP dns lookups to the local stub resolver (though I have no examples of anything else that does). It certainly should fix/workaround this for Ubuntu installs using the default systemd-resolved setup and only having issues with getaddrinfo() failures.

I still plan to fix systemd's stub resolver to correctly respond to pipelined TCP dns queries.

Dan Streetman (ddstreet)
description: updated
Dan Streetman (ddstreet)
Changed in systemd (Ubuntu Xenial):
status: In Progress → Invalid
Changed in systemd (Ubuntu Trusty):
status: In Progress → Invalid
Changed in systemd (Ubuntu Xenial):
importance: High → Undecided
Changed in systemd (Ubuntu Trusty):
importance: High → Undecided
assignee: Dan Streetman (ddstreet) → nobody
Changed in systemd (Ubuntu Xenial):
assignee: Dan Streetman (ddstreet) → nobody
description: updated
description: updated
Changed in systemd (Ubuntu Disco):
status: In Progress → Fix Committed
assignee: Dan Streetman (ddstreet) → Dimitri John Ledkov (xnox)
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Dan, or anyone else affected,

Accepted systemd into cosmic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/239-7ubuntu10.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-cosmic to verification-done-cosmic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-cosmic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in systemd (Ubuntu Cosmic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-cosmic
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Dan, or anyone else affected,

Accepted systemd into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/237-3ubuntu10.12 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in systemd (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Cosmic Verified
ii systemd 239-7ubuntu10.7 amd64 system and service manager
ping testing.irongiantdesign.com
PING testing.irongiantdesign.com (253.0.0.15) 56(84) bytes of data.

tags: added: verification-done-cosmic
removed: verification-needed-cosmic
Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Bionic verified
ii systemd 237-3ubuntu10.12 amd64 system and service manager

$ ping testing.irongiantdesign.com
PING testing.irongiantdesign.com (253.0.0.6) 56(84) bytes of data.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Revision history for this message
Dan Streetman (ddstreet) wrote :

autopkgtest regression failure analysis/justifications for bionic:

systemd/s390x - failing since last november.

gvfs/s390x - failing since 2017.

snapd/s390x - flaky test that fails more than 1/2 the time since forever.
snapd/ppc64el - same as s390x

for cosmic:

gvfs/s390x - almost always failed since forever.

other bionic and cosmic autopkgtest regressions look like flaky tests, or autopkgtest system failures (e.g. can't reach apt repository). i have retried them all - will analyze again if the retest fails.

Revision history for this message
Dan Streetman (ddstreet) wrote :

bionic regressions:

systemd on all archs have failed for months. tests should be ignored.

snapd on all archs have failed intermittently for very long time. tests are flaky and should be ignored.

remaining tests being retried:

linux-gcp-edge (system problem - oom while testing)

linux (flaky tests - intermittently fails for a long time)

linux-oracle (system problem - out of disk space while testing)

cosmic regressions:

hddemux on all archs started failing recently; the version in -proposed appears to be fixed, so the failure of this pkg can be ignored as it's not caused by this sru.

remaining tests being retried:

apt (flaky test - fails intermittently in the same way for a while)

linux (flaky tests - intermittently fails for a long time)

snapd/amd64 (flaky test, test watchdog has 1 second timeout, and timed out)

systemd (test output hard to read - seems to be timeout, likely overloaded test system)

Revision history for this message
Dan Streetman (ddstreet) wrote :

hddemux failure should be ignored; its autopkgtests are fixed in -proposed with bug 1814062

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 240-5ubuntu3

---------------
systemd (240-5ubuntu3) disco; urgency=medium

  * debian/tests: blacklist upstream test-24-unit-tests on ppc64le.
    Fails, not a regression as it's a new test case, which was never before
    executed on ppc64le.
    File: debian/tests/upstream
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=8062b9a2712c390010d2948eaf764a1b52e68715

 -- Dimitri John Ledkov <email address hidden> Sat, 02 Feb 2019 11:05:12 +0100

Changed in systemd (Ubuntu Disco):
status: Fix Committed → Fix Released
Revision history for this message
Dan Streetman (ddstreet) wrote :

All remaining bionic and cosmic autopkgtest regression failures should be ignored.

bionic regressions:

systemd on all archs have failed for months, ignore

linux-gcp-edge fails due to timeout in test while rebuilding; ignore

linux has flaky tests - intermittently fails for a long time, ignore

linux-oracle fails due to out of disk space while rebuilding; ignore

gvfs/s390x has always failed, ignore

cosmic regressions:

gvfs/s390x has always failed, ignore

systemd has failed intermittently for months; ignore

hddemux fails due to bug 1814062, ignore

linux has flaky tests - intermittently fails for a long time, ignore

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 237-3ubuntu10.12

---------------
systemd (237-3ubuntu10.12) bionic; urgency=medium

  * d/p/resolve-enable-EDNS0-towards-the-127.0.0.53-stub-res.patch
    getaddrinfo() failures when fallback to dns tcp queries, so enable
    edns0 in resolv.conf (LP: #1811471)

  [ Victor Tapia ]
  * d/p/resolved-Increase-size-of-TCP-stub-replies.patch
    dns failures with edns0 disabled and truncated response (LP: #1804487)

 -- Dan Streetman <email address hidden> Tue, 29 Jan 2019 14:26:48 -0500

Changed in systemd (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of the Stable Release Update for systemd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 239-7ubuntu10.7

---------------
systemd (239-7ubuntu10.7) cosmic; urgency=medium

  * d/p/resolve-enable-EDNS0-towards-the-127.0.0.53-stub-res.patch
    getaddrinfo() failures when fallback to dns tcp queries, so enable
    edns0 in resolv.conf (LP: #1811471)

  [ Victor Tapia ]
  * d/p/resolved-Increase-size-of-TCP-stub-replies.patch
    dns failures with edns0 disabled and truncated response (LP: #1804487)

 -- Dan Streetman <email address hidden> Tue, 29 Jan 2019 14:19:39 -0500

Changed in systemd (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Revision history for this message
Steve Roberts (drgrumpy) wrote :

It seems this breaks dns lookups on some system, see #1817903

Changed in systemd:
status: New → Fix Released
Revision history for this message
Dan Streetman (ddstreet) wrote :

The fix (workaround) for this bug in bionic and cosmic was to add 'options edns0' to the /etc/resolv.conf file via the systemd stub-resolv.conf file.

However, when the resolvconf package is installed, due to bug 1817903, the 'options edns0' is stripped out of the /etc/resolv.conf file.

This means anyone on bionic or cosmic that has the resolvconf package installed will not have 'options edns0' in their /etc/resolv.conf file, and will again experience this bug.

In disco, systemd-resolved has DNS TCP pipelining correctly implemented, so this bug will not affect disco, regardless of whether edns0 is specified in /etc/resolv.conf.

Mathew Hodson (mhodson)
no longer affects: systemd (Ubuntu Trusty)
no longer affects: systemd (Ubuntu Xenial)
Revision history for this message
Steve Langasek (vorlon) wrote :

The conclusion of a very long IRC discussion about how to fix this is that we should change the resolvconf package in the presence of resolved to emit only 127.0.0.53 into /etc/resolv.conf, and redirect all other servers to resolved.

Steve Langasek (vorlon)
Changed in resolvconf (Ubuntu Bionic):
status: New → Triaged
Changed in resolvconf (Ubuntu Cosmic):
status: New → Triaged
Changed in resolvconf (Ubuntu Disco):
status: New → Triaged
Changed in resolvconf (Ubuntu):
status: New → Triaged
tags: added: id-5cde5f8331588344774efccb
Revision history for this message
Marin Nedea (marin-n) wrote :

Before trying to handle this as a BUG (and I know the behavior points to a bug) please have a look at https://github.com/Azure/WALinuxAgent/issues/1673

Steve Langasek (vorlon)
Changed in resolvconf (Ubuntu Disco):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.