Reliable network connectivity for apt-daily

Bug #1699850 reported by Julian Andres Klode
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd
New
Unknown
apt (Ubuntu)
Fix Released
High
Julian Andres Klode

Bug Description

[Impact]

apt-daily.service is launched by a timer that depends on network-online.target (after the fixes for bug 1686470 are in everywhere)

At boot that is mostly sufficient for it to have network online, but it does not seem to work all the time, and we might be disagreeing with network-manager and friends what online state means.

At resume time, network-online.target is still active, so the service is started as soon as possible when it tries to catch up. Depending on the timing, the network connectivity might not be there yet, and it will fail and only retry 12 hours later.

[Proposed solution]
Introduce a new apt-helper wait-online that waits for the machine being online, using both network-manager and systemd-networkd helpers. If the service is active, we use the respective online wait helper to wait for it to signal onlineness. Once all helpers have reported onlineness, we continue.

[Original proposal, to be done later]
original plan:
It tries to connect() to remote hosts specified in sources.list until one connection works or a TIMEOUT is reached. The proposed algorithm looks something like this:

while (time elapsed < TIMEOUT):
  for each entry:
    host = getaddrinfo()
    if host failed:
      continue
    fd = connect to it
    if fd is invalid:
      continue

    all fds += fd

    if poll(all fds, 100 ms timeout) finds a connected one:
      exit(0)

exit(42) # timeout

There are two things to consider:
* getaddrinfo() and connect() may fail if network is not up yet, so we need to retry (we might need to sleep somewhere)
* If poll() fails, we likely sleep enough, so no extra sleep needed.

I believe the time out should be something like 30s.

On the systemd service side, we add:
  ExecStartPre=/usr/lib/apt/apt-helper wait-online
  RestartForceExitStatus=42
  RestartSec=15m

To retry the service after 15 minutes.

[Test case]
* Start apt-daily.service after turning off network -> It should wait (in ExecStartPre)
* Turn on network -> apt-daily.service should start

[Regression potential]
There might be increased I/O activity after resume, if that did not work before. The helper is launched in an ExecStartPre unit and failures are marked as ignored by "-". systemd automatically kills all ExecStartPre processes when the main ExecStartPre process exits, so there is no chance of ending with some child process still running.

Changed in apt (Ubuntu):
assignee: nobody → Julian Andres Klode (juliank)
description: updated
Changed in apt (Ubuntu):
status: New → Triaged
importance: Undecided → High
description: updated
description: updated
Revision history for this message
Julian Andres Klode (juliank) wrote :

This sort of depends on https://github.com/systemd/systemd/issues/2582 as we can't restart oneshot units apparently. In the meantime, maybe we could pull the service in on resumes, and use a long time out for the helper, with it retrying until the timeout happens (like an hour or so).

Ideas welcome.

Revision history for this message
Julian Andres Klode (juliank) wrote :

Also note that getaddrinfo() is blocking, so that's not really going to work.

description: updated
description: updated
Revision history for this message
Julian Andres Klode (juliank) wrote :

For users running docker, nm-online will always report onlineness, but this still catches everyone else until we have time to work on the more complicated solution :)

Changed in apt (Ubuntu):
status: Triaged → In Progress
Changed in apt (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package apt - 1.5~rc2

---------------
apt (1.5~rc2) unstable; urgency=medium

  [ Julian Andres Klode ]
  * Actually install apt_auth.conf manual page (Closes: #873934)
  * test: Workaround gpgv warning
  * apt-daily: Wait for network before daily updates.
    Introduce a new helper, apt-helper wait-online that uses
    NetworkManager and/or systemd-networkd to wait for them
    reporting online, with a time out of 30 seconds; and run
    that helper before running the daily update script. (LP: #1699850)
  * apt-daily: Pull in network-online.target in service, not timer
  * Do not warn about duplicate "legacy" targets (Closes: #839259)
    (LP: #1697120)
  * cdrom: Don't hardcode "Files" field for copying source files
  * ftparchive: Do not pass through disabled hashes in Sources (Closes: #872963)
  * Directly link against libudev on Linux systems - this does not affect
    public API and ABI, but protected pkgUdevCdromDevices function pointers
    were renamed and are now always NULL, even if Dlopen returns true.

  [ Christos Trochalakis ]
  * doc: correct '--allow-releaseinfo-change-*' typos (Closes: #873914)

  [ Frans Spiesschaert ]
  * Dutch program translation update (Closes: #874285)
  * Dutch manpage translation update (Closes: #874293)

  [ David Kalnischkies ]
  * don't write & chmod /dev/null log files
  * don't ask an uninit _system for supported archs (LP: #1613184)

 -- Julian Andres Klode <email address hidden> Sat, 09 Sep 2017 21:47:14 +0200

Changed in apt (Ubuntu):
status: Fix Committed → Fix Released
Changed in systemd:
status: Unknown → New
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.