Nodeinfo returns wrong NUMA topology / bad virtualization performance

Bug #1446177 reported by FliTTi
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Incomplete
Medium
Stefan Bader

Bug Description

1)
Description: Ubuntu 14.04.2 LTS
Release: 14.0

2)
libvirt-bin:
  Installed: 1.2.2-0ubuntu13.1.10
  Candidate: 1.2.2-0ubuntu13.1.10
  Version table:
 *** 1.2.2-0ubuntu13.1.10 0
        500 http://archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     1.2.2-0ubuntu13.1.7 0
        500 http://security.ubuntu.com/ubuntu/ trusty-security/main amd64 Packages
     1.2.2-0ubuntu13 0
        500 http://archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages

4+5)
"virsh nodeinfo" shows wrong NUMA topology:

:~$ virsh nodeinfo
CPU model: x86_64
CPU(s): 64
CPU frequency: 1400 MHz
CPU socket(s): 1
Core(s) per socket: 64
Thread(s) per core: 1
NUMA cell(s): 1
Memory size: 528376144 KiB

but it should be "NUMA cell(s): 8".

I'm running a 3.13.0-48-generic equipped with _four_ physical AMD Opteron 6282 SE to a Supermicro H8QG6-F. libvirt-bin version: 1.2.2-0ubuntu13.1.10

NUMA ctl gives me eight cells:
:~$ numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 64431 MB
node 0 free: 58505 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 64510 MB
node 1 free: 45068 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 64510 MB
node 2 free: 62910 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 64510 MB
node 3 free: 20460 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 64510 MB
node 4 free: 63512 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 64510 MB
node 5 free: 52723 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 64510 MB
node 6 free: 64084 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 64494 MB
node 7 free: 62013 MB
node distances:
node 0 1 2 3 4 5 6 7
  0: 10 16 16 22 16 22 16 22
  1: 16 10 22 16 22 16 22 16
  2: 16 22 10 16 16 22 16 22
  3: 22 16 16 10 22 16 22 16
  4: 16 22 16 22 10 16 16 22
  5: 22 16 22 16 16 10 22 16
  6: 16 22 16 22 16 22 10 16
  7: 22 16 22 16 22 16 16 10

As well as lstopo is resulting in eight NUMA cells (see attached picture).

But "virsh nodeinfo" shows only one NUMA cell, one socket and 64 CPUs (single threaded). Performance of virtualized machines on this host is very bad.

I think it has something to do with the identical core_ids per physical socket and module.

:~$ cat /proc/cpuinfo | egrep "processor|physical id|siblings|core id|cpu
see appendix

Many Thanks!

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: libvirt-bin 1.2.2-0ubuntu13.1.10
ProcVersionSignature: Ubuntu 3.13.0-48.80-generic 3.13.11-ckt16
Uname: Linux 3.13.0-48-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.10
Architecture: amd64
Date: Mon Apr 20 13:04:31 2015
InstallationDate: Installed on 2012-12-07 (864 days ago)
InstallationMedia: Ubuntu-Server 12.04.1 LTS "Precise Pangolin" - Release amd64 (20120817.3)

SourcePackage: libvirt

Revision history for this message
FliTTi (m-flittner) wrote :
description: updated
Revision history for this message
FliTTi (m-flittner) wrote :

Supplement: It seems that "virsh capabilities" get it right (See attachment). Eight NUMA cells...

Virtualization machine config with virt-manager 0.9.5-1ubuntu3

Crazy. Thank you for your guidance.

Note: capabilities detects a Opteron_G4 socket - but on that board is a Opteron_G34.

Revision history for this message
FliTTi (m-flittner) wrote :
description: updated
FliTTi (m-flittner)
description: updated
description: updated
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Assigning to Stefan only to reproduce+confirm.

Changed in libvirt (Ubuntu):
assignee: nobody → Stefan Bader (smb)
Revision history for this message
Stefan Bader (smb) wrote :

I cannot confirm, but then there are many differences:

Kernel: 3.13.0-52-generic #85-Ubuntu
libvirt: 1.2.2-0ubuntu13.1.10
CPU: AMD Opteron(tm) Processor 6128
Board: Supermicro H8SGL-F

virsh nodeinfo
CPU model: x86_64
CPU(s): 8
CPU frequency: 800 MHz
CPU socket(s): 1
Core(s) per socket: 4
Thread(s) per core: 1
NUMA cell(s): 2
Memory size: 32881044 KiB

numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 15983 MB
node 0 free: 15727 MB
node 1 cpus: 4 5 6 7
node 1 size: 16126 MB
node 1 free: 15726 MB
node distances:
node 0 1
  0: 10 20
  1: 20 10

I wonder, whether the wrong info persists a restart of libvirtd (service libvirt-bin restart)...

Revision history for this message
FliTTi (m-flittner) wrote :

Same behavior after
sudo stop libvirt-bin
sudo start libvirt-bin

:~$ virsh nodeinfo
CPU model: x86_64
CPU(s): 64
CPU frequency: 2600 MHz
CPU socket(s): 1
Core(s) per socket: 64
Thread(s) per core: 1
NUMA cell(s): 1
Memory size: 528376144 KiB

:~$ numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 64431 MB
node 0 free: 30682 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 64510 MB
node 1 free: 14047 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 64510 MB
node 2 free: 32492 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 64510 MB
node 3 free: 34 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 64510 MB
node 4 free: 29763 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 64510 MB
node 5 free: 13510 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 64510 MB
node 6 free: 29585 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 64494 MB
node 7 free: 33997 MB
node distances:
node 0 1 2 3 4 5 6 7
  0: 10 16 16 22 16 22 16 22
  1: 16 10 22 16 22 16 22 16
  2: 16 22 10 16 16 22 16 22
  3: 22 16 16 10 22 16 22 16
  4: 16 22 16 22 10 16 16 22
  5: 22 16 22 16 16 10 22 16
  6: 16 22 16 22 16 22 10 16
  7: 22 16 22 16 22 16 16 10

Any ideas?

Revision history for this message
Stefan Bader (smb) wrote :

If I parse the libvirt code right, then nodeinfo would get its data from parsing /sys/devices/system/node/... (potentially falling back to /sys/device/system/cpu/...). I do not see right away any debug data added but there should be some error messages if libvirt thinks something went wrong. Probably related to linuxNodeInfoCPUPopulate or virNodeParseNode functions.

So I would check whether sysfs looks consistent (I guess it will as numactl finds the right values), then check /var/log/libvirt/libvirtd.log for hints, possibly change the log_level to 1 in /etc/libvirt/libvirtd.conf (though this produces a lot of log).

Revision history for this message
Laz Peterson (laz-v) wrote :

I have this issue as well. This issue has persisted from many Ubuntu versions ago, and has always made it extremely difficult to deal with NUMA configuration. Oddly enough, after testing Ubuntu 15.04 last weekend, it seems the issue is gone.

All of the servers that we have affected by this have Supermicro motherboards. (We only have Supermicro, so I can't tell you otherwise.)

Disabling NUMA in BIOS shows the right socket information, however all of the cores are listed in node 0, instead of being listed in their separate node. (Of course.)

Re-enabling NUMA in BIOS shows wrong socket information, but with proper cores split between the right nodes.

Might be something directly related to the architecture of Supermicro motherboard. Or possibly we can get a confirmation this happens on another manufacturer's board?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in libvirt (Ubuntu):
status: New → Confirmed
Revision history for this message
Laz Peterson (laz-v) wrote :

I might add, I am using Intel processors, not AMD.

Revision history for this message
Laz Peterson (laz-v) wrote :
Download full text (12.2 KiB)

Number of cells seems right, but number of sockets is definitely wrong.

OS: Ubuntu 14.04.2 LTS
Kernel: 3.16.0-38-generic
Most updated versions of all related packages as of May 26, 2015.

root@vm0:/media/scripts/vm# virsh capabilities
<capabilities>

  <host>
    <uuid>00000000-0000-0000-0000-0cc47a4c5e42</uuid>
    <cpu>
      <arch>x86_64</arch>
      <model>SandyBridge</model>
      <vendor>Intel</vendor>
      <topology sockets='1' cores='12' threads='2'/>
      <feature name='invpcid'/>
      <feature name='erms'/>
      <feature name='bmi2'/>
      <feature name='smep'/>
      <feature name='avx2'/>
      <feature name='bmi1'/>
      <feature name='fsgsbase'/>
      <feature name='abm'/>
      <feature name='pdpe1gb'/>
      <feature name='rdrand'/>
      <feature name='f16c'/>
      <feature name='osxsave'/>
      <feature name='movbe'/>
      <feature name='dca'/>
      <feature name='pcid'/>
      <feature name='pdcm'/>
      <feature name='xtpr'/>
      <feature name='fma'/>
      <feature name='tm2'/>
      <feature name='est'/>
      <feature name='smx'/>
      <feature name='vmx'/>
      <feature name='ds_cpl'/>
      <feature name='monitor'/>
      <feature name='dtes64'/>
      <feature name='pbe'/>
      <feature name='tm'/>
      <feature name='ht'/>
      <feature name='ss'/>
      <feature name='acpi'/>
      <feature name='ds'/>
      <feature name='vme'/>
    </cpu>
    <power_management>
      <suspend_disk/>
      <suspend_hybrid/>
    </power_management>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='2'>
        <cell id='0'>
          <memory unit='KiB'>131928440</memory>
          <cpus num='24'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0,24'/>
            <cpu id='1' socket_id='0' core_id='1' siblings='1,25'/>
            <cpu id='2' socket_id='0' core_id='2' siblings='2,26'/>
            <cpu id='3' socket_id='0' core_id='3' siblings='3,27'/>
            <cpu id='4' socket_id='0' core_id='4' siblings='4,28'/>
            <cpu id='5' socket_id='0' core_id='5' siblings='5,29'/>
            <cpu id='6' socket_id='0' core_id='8' siblings='6,30'/>
            <cpu id='7' socket_id='0' core_id='9' siblings='7,31'/>
            <cpu id='8' socket_id='0' core_id='10' siblings='8,32'/>
            <cpu id='9' socket_id='0' core_id='11' siblings='9,33'/>
            <cpu id='10' socket_id='0' core_id='12' siblings='10,34'/>
            <cpu id='11' socket_id='0' core_id='13' siblings='11,35'/>
            <cpu id='24' socket_id='0' core_id='0' siblings='0,24'/>
            <cpu id='25' socket_id='0' core_id='1' siblings='1,25'/>
            <cpu id='26' socket_id='0' core_id='2' siblings='2,26'/>
            <cpu id='27' socket_id='0' core_id='3' siblings='3,27'/>
            <cpu id='28' socket_id='0' core_id='4' siblings='4,28'/>
            <cpu id='29' socket_id='0' core_id='5' siblings='5,29'/>
            <cpu id='30' socket_id='0' core_id='8' siblings='6,30'/>
            <cpu id='31' socket_id='0' core_id='9' siblings='7,31'/>
            <cpu id='32' socket_...

Revision history for this message
Stefan Bader (smb) wrote :

Laz, if you could post here the output of

(sudo grep -r . /sys/devices/system/node/; sudo grep -r . /sys/devices/system/cpu) >sysfscpu.txt

so I can compare that against the libvirt output. I suspect there is already a problem there (which may lead to some BIOS table problem) and libvirt just picks up the wrong data from those entries.

Revision history for this message
FliTTi (m-flittner) wrote :

Output from my side (without sudo)

Revision history for this message
Stefan Bader (smb) wrote :

Weird. The initial scan should be able to open /sys/devices/system/node and then iterate over node0..7. Inside for example /sys/devices/system/node/node0/ there should be (the output unfortunately does not show as those are links) links for each cpu associated with that node. So in case of node0 there should be cpu0..7 pointing to ../../cpu/cpu0...7. Can you confirm those are there?
The only cases that would drop back to hard set the nodes to 1 seem to be:
- not being able to read /sys/devices/system/node
- failing to find /sys/devices/system/node/node<num>/cpu<num> entries
- failing to read from /sys/devices/system/node/node<num>/cpu<num>/topology/physical_package_id
  (sockets = max(physical_package_id)+1 -> should be 4 according to sysfs data)
- failing to allocate the socket bitmap which seems unlikely

And there is really nothing related to sysfs parsing in /var/log/libvirt/libvirt.log?

Revision history for this message
Laz Peterson (laz-v) wrote :

Hmm, I am having issues uploading files, as well as downloading FliTTi's to view. Seems to be an issue with the launchpadlibrarian.net?

I can post as plain text if you like. Or I will try uploading sysfscpu.txt later on.

Revision history for this message
Laz Peterson (laz-v) wrote :

Here we go.

Revision history for this message
Laz Peterson (laz-v) wrote :

We are moving our equipment to the datacenter tomorrow, so won't have much more input until after then. But all of the links etc are all where they are expected to be. Not sure if the libvirt user can read those, but I'm sure it can since my standard user can.

Running a 'find . -name physical_package_id -exec cat {} \;' from /sys/devices/system shows 0's and 1's. So according to your formula there, we would definitely be seeing a "2" for my sockets. Shows only 1. According to the other data, that 1 is not from the "max" value, it is simply just putting 1 as a static number.

I am not sure what socket bitmap is, or anything else about it. :-) Or I might have something to comment on this.

Also, one thing to keep in mind, all of this has been fixed since Ubuntu 15.04. So something in the pipe between all previous versions and then has allowed this to start working. This is possibly kernel related or even in another package that libvirt depends on.

Revision history for this message
Laz Peterson (laz-v) wrote :

Stefan if you would like to poke around, I have a server we are taking out of production (also Supermicro) that has this issue as well. I can provide you access at that time. Possibly 1 week from now.

Revision history for this message
Stefan Bader (smb) wrote :

Some of the comments I made serve more to help my own memory jumping on and off the issue. ;) Basically the nodeinfo looks like libvirt ran into thinking that there is no node subtree in sysfs. Then it sets the number of nodes to 1 and directly scans the cpu subtree. I quickly did scan the upstream git tree of libvirt and did not find something obvious. I may have missed things or maybe it is the kernel that changed for the better.
What I got in mind would be to make a special version of libvirt that has a lot logging around the area of obtaining the nodeinfo. If this can be tried on a non-production server, I think this will be less painful for all. If its ok for everyone to wait that long. I would go and prepare the test package(s) (Trusty/14.04 version) and post a link to them here. The debugging would be on error level, so the log level of libvirtd can be kept to errors (avoiding to fill up the log with other stuff).

Revision history for this message
Laz Peterson (laz-v) wrote :

I would be more than happy to test for you Stefan. As long as it is cookie cutter for a non-guru like myself, you just tell me what to do.

I will have a non-production server ready to rock in roughly a week from now. Thank you for all of your efforts!

Revision history for this message
Laz Peterson (laz-v) wrote :

Also, in the meantime (if my test server becomes available before the test packages), I can install upstream libvirt/qemu packages to see if the fix came from there or if the fix came from elsewhere.

Revision history for this message
Stefan Bader (smb) wrote :

Ok, the first round of packages is there at: [http://people.canonical.com/~smb/lp1446177/ ]. Download the two debs and install them with "dpkg -i *.deb". Then go through the following commands (to get a cleaner log):

sudo service libvirt-bin stop
sudo cp /dev/null /var/log/libvirt/libvirtd.log
sudo service libvirt-bin start

After that, please attach a copy of the /var/log/libvirt/libvirtd.log to the report. Thanks!

Revision history for this message
Stefan Bader (smb) wrote :

Marking this incomplete for the time being. Until we get new info.

Changed in libvirt (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Laz Peterson (laz-v) wrote :

Yes, Stefan, I am in the datacenter right now getting the new servers online. In about a week or so, I will install your new binaries and post results.

Thank you!

Revision history for this message
Laz Peterson (laz-v) wrote :

Here we go Stefan. The log file is short and sweet -- hopefully it gets you the information you are looking for!

Revision history for this message
Laz Peterson (laz-v) wrote :

A much more comprehensive log to show the changes in the log as it goes. This might be better.

Revision history for this message
Stefan Bader (smb) wrote :

Hm, depending on the outcome of nodeinfo (if that is still wrong) this may suggest us "this is not the functions you are looking for"... :/ According to the log this is a 2 node (on two sockets) with 12 cores each. 12 sounds a bit like AMD. It may or may not be important but I think the wrong info so far was on Supermicro Intel boards... So just to be sure, on that host for which you posted the logs, does nodeinfo show 1 node only?

Revision history for this message
Laz Peterson (laz-v) wrote :

Here is some more information attached.

Also, the CPU is actually Intel 6-core with HT so it appears as 12-core. I can disable HT and see what it reports. Also, running kernel 3.16.0-38-generic.

Back to another piece of interesting information, when I installed Ubuntu 15.04 (just for fun to see if there was anything new that I might want to take advantage of) all of these functions work just fine. We could possibly go down this route to find out which package or part of the system allows libvirt to find this information properly.

Without going fully to 15.04 (as that defeats the purpose of LTS), what would be a good package-by-package upstream path to try? I can start with kernel and work my way from there maybe?

Surprising to me why not many others have this type of issue. The only common factor I can tell (including other forum posts long since forgotten) is Supermicro motherboard.

Revision history for this message
Laz Peterson (laz-v) wrote :

I have tried two things. Upgrading to kernel 3.19.8 and also disabling Hyper-Threading. Neither has any effect.

Here is attached libvirtd.log, initial part of log is with kernel 3.19.8 which I tried first. Second part of log, which starts at 13:40:28 is with HT disabled.

Revision history for this message
Stefan Bader (smb) wrote :

I have to go back to the libvirt code tomorrow. If I remember correctly it was using the physical_id info (which is also seen in /proc/cpuinfo) as the number of sockets. At least in the 14.04 version of libvirt. Might be something to explicitly check in newer versions. Since you already tested a recent kernel we can at least rule out that side. So it is very likely a change in libvirt.
Unfortunately I think it will not be as straight forward to try a newer libvirt as that likely has more dependencies. When I look into more details tomorrow, I try to build a 15.04 version of libvirt in a 15.04 build environment. The result should go only in a test environment as it will never get updated (just a cautious warning).

Revision history for this message
Laz Peterson (laz-v) wrote :

Yes I have a dedicated test environment strictly for this issue. :-)

Would you like me to prepare a 15.04 default install ready for your updated images?

Much appreciate all of your help Stefan!

Revision history for this message
Stefan Bader (smb) wrote :

No, I would prefer if you stick to the 12.04 base (optionally remove the 3.19 kernel or just boot into the original 3.13) and install the special linbirt-1.2.12 from 15.04 for 14.04. I hope this works since I had to drop systemd/cgrougmanager related changes that would not compile in the old environment. So its only compile tested right now. Fingers crossed it should produce similar debug messages about parsing sysfs.

Revision history for this message
Stefan Bader (smb) wrote :

linbirt of course should read libvirt but those keys move silently around in the morning. :-P

Revision history for this message
Laz Peterson (laz-v) wrote :

I will do whatever you think is the best for diagnosing the problem.

So you would like me to go to 12.04 then, yes? Or keep at 14.04.

Revision history for this message
Stefan Bader (smb) wrote :

Darn, sorry. I meant 14.04/Trusty which we are at.

Revision history for this message
Laz Peterson (laz-v) wrote :

Ah, no prob. Then I will go back to kernel 3.13 and we will go from there. Thanks Stefan!

Revision history for this message
Laz Peterson (laz-v) wrote :

Ok back to 3.13.0-53-generic kernel, awaiting your command.

Revision history for this message
Stefan Bader (smb) wrote :

Download and try to install
http://people.canonical.com/~smb/lp1446177/libvirt-bin_1.2.12-0ubuntu0.14.04.13dbg1_amd64.deb and
http://people.canonical.com/~smb/lp1446177/libvirt0_1.2.12-0ubuntu0.14.04.13dbg1_amd64.deb

Which hopefully does work, not show the issue and produce a bit of the same info as the other debug packages did in /var/log/libvirt/libvirtd.log.

Revision history for this message
Laz Peterson (laz-v) wrote :

Here's the libvirtd.log. Has same issue with NUMA.

Revision history for this message
Stefan Bader (smb) wrote :

Sorry, I completely missed the response hitting my inbox. That this still shows the NUMA issues is a bit unexpected. So looking at the log I must have done something wrong with the RC printout of:

parse node returns (RC=-1) sockets=1 cores=6 threads=2

If that really was -1 it would not have iterated over the second node. The output is interesting. So the content of sockets, cores, and threads seems to match the info you gave in comment #29. From that cpuinfo there I assume the node0 directory contains links to cpu0-5 and cpu12-17. So the 6 cores with hyperthreading yielding 12 logical cpus on that node. Likewise node1 would have links for cpu6-11 and cpu18-23. Which all looks good and sensible. And from what is currently logged I can not see how the nodeinfo would get wrong. With one exception, but that would cause a different wrong info of node=1 sockets=1 cores=>24 and threads=1. This check is new in libvirt 1.2.12, so would not explain the previous wrong count. It basically checks whether nodes*sockets*cores*thread is the same as cpus+offline (should be 2*1*6*2 == 24+0 from the log).

Can you let me know what nodeinfo actually printed?

Revision history for this message
Laz Peterson (laz-v) wrote :

Why hello Stefan, glad to know you are still with us! :-)

laz@dev-vm0:~$ virsh nodeinfo
CPU model: x86_64
CPU(s): 24
CPU frequency: 1500 MHz
CPU socket(s): 1
Core(s) per socket: 6
Thread(s) per core: 2
NUMA cell(s): 2
Memory size: 198069904 KiB

laz@dev-vm0:~$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0 1
nodebind: 0 1
membind: 0 1

root@dev-vm0:/proc/sys# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Stepping: 7
CPU MHz: 1200.000
BogoMIPS: 4002.28
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-5,12-17
NUMA node1 CPU(s): 6-11,18-23

Revision history for this message
Laz Peterson (laz-v) wrote :

As I mentioned before, we did buy new server hardware to host the VMs. This hew hardware also has the same NUMA node issue.

I initially installed 14.04.2 on those new servers, then when the NUMA issue was there, I thought "what the hey" and installed 15.04 just for fun. The problem was gone -- NUMA nodes were automatically generated and maintained, though it put everything on node 0 and nothing on node 1 (not sure if that is by design?).

Anyhow, noticed a lot of issues with the new migration engine that version uses, so I decided to downgrade back to 14.04 and just bite the bullet manually setting my CPU configurations.

So whatever extra help you need from me, I am more than happy to spend time doing any tasks you like. Just let me know. Thanks again Stefan!

Revision history for this message
Stefan Bader (smb) wrote :

Oh I was not really gone. Just distracted... which is not helping either. :)

So I am a bit confused now because in comment #40 you said "has the same issues with NUMA". But the nodeinfo from comment #42 (if that is from the test machine running 14.04 + the test libvirt) looks ok. At least it has 2 NUMA nodes as expected.

Revision history for this message
Laz Peterson (laz-v) wrote :

My apologies. I meant to say that only libvirt has the issue with detecting and properly using NUMA nodes. All other NUMA functions with the system work as expected.

Revision history for this message
Stefan Bader (smb) wrote :

Yes, but (and sorry for being pedantic) "virsh nodeinfo" is libvirt and at least the start of this report was that nodeinfo would return only one node even when there were more than one. Whether it correctly makes use of that knowledge or not would be another issue.

Revision history for this message
Laz Peterson (laz-v) wrote :

Yes, so I guess that is the big confusing question. Why does "virsh nodeinfo" show the right information, but libvirt doesn't/can't use it.

Revision history for this message
Stefan Bader (smb) wrote :

I did a bit more research on that topic and it might be the answer is that this is not really supported. While one can add the following to the guest config

<numatune>
  <memory mode='strict' placement='auto'/>
</numatune>

this requires an external helper (numad) which is not available in Debian or Ubuntu. Not sure about the background of that. Without that, it might be the only way might be to manually spread guests by using nodeset="<node>" instead of placement element (as "numatune <domain> --nodeset <node> --config" in virsh would do). And that is only memory. It appears like one might also need to explicitly pin the VCPUs: eg. <vcpu placement='static' cpuset='0-3'>4</vcpu>.

For the nodeinfo part, it does sound to me (from a comment before the additional sanity check in 1.2.12) that there might be a chance that it cannot represent all topologies and might be wrong deliberately. Not in the cases here which show up correctly with the newer version of libvirt. Just fixing it might be pointless if that will not have any impact on automatic placement.

I have not finally made up my mind on how to proceed from here. Just wanted to give some quick feedback.

Revision history for this message
Laz Peterson (laz-v) wrote :

Thank you for the update Stefan. Yes, I tried to compile and run numad on Ubuntu but I had no luck there.

Also correct, I am explicitly pinning CPUs at this time. As far as I can tell, it is the only option.

As far as the big question mark in my head goes ... This function works as expected on Ubuntu 15.04. I have not tried 14.10. So something between then and now (something, somewhere !?) has changed to allow this.

Revision history for this message
Laz Peterson (laz-v) wrote :

Err, most importantly, my (selfish) opinion is that something of this magnitude should not be fixed upstream, but in a current Ubuntu LTS release. :-)

Revision history for this message
Stefan Bader (smb) wrote :

The problem now is that because this went forth and back and sideways at least I am now quite confused. :(

1. numa info returned by nodeinfo incorrect
  -> 15.04. ok, before that incorrect for some Supermicro boards

2. numa information not automatically used
  -> from my experiments and internet search this never was working and is not working
      either in upcoming 15.10 as it would require numad available in the distro (at build
      of libvirt and on the virt host).

For OpenStack nova [1] there was some work done for their Juno series. That possibly did not get into 14.04 but also sounds independent to what standalone libvirt would or could do.

3. Manual tuning for numa appears to be possible and working even in 14.04. But since my
    nodeinfo is ok, I cannot say whether that really affects numa tuning. Some comments in
    the code sound like the info that matters would be the capabilities one. Which according
    to some comments here is ok even if nodeinfo is not.

So for memory, would numatune on a running but not tuned guest return a range covering the correct available set of nodes? Which then can be tuned (with --config) to be limited to a defined node. That only works the next time the related qemu task is started (so probably needs a shutdown + start). Can be verified with numastat.

For the VCPUs a cpuset can be added to the vcpu xml element. Unfortunately there does not seem to be a command doing so. Which is inconvenient. One could use vcpupin but that needs to be done for each vcpu individually. With vcpuinfo this can be verified.

So if manual memory or cpu pinning is broken in a supported release, this would be important to fix (though should get a different bug report to keep confusion low). For the nodeinfo part it depends on whether this has influence on the actual tuning. If it has not it still would be good to resolve it but less urgent.

I suspect the memory pinning is the part that is more likely the problem as vcpu pinning does not really care about nodes for its function. And the way memory pinning is done only looks to work when the qemu process gets started within a contrainted memory cgroup. So modifying a running guest would have no effect but that consistently on all releases as that seems to depend on numad.

[1] https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement

Revision history for this message
Laz Peterson (laz-v) wrote :

Hello Stefan,

Yes, now that you mention it, it seems that "Generate from host NUMA configuration" in 15.04 simply puts everything in node 0. At least, that's what I'm seeing. While that's a little better than just spanning the entire guest across both nodes (as a "default"), leaving an entire second node available for sunshine and rainbows is not a desired function.

Manual pinning seems to be the only way to go. Unfortunately, this puts a heavy strain on managing those resources -- I have numerous scrap papers laying all over my office with CPU and memory counts under node columns, with arrows pointed left and right. It's comical to think about, but ...

Very surprising that libvirt and Ubuntu are not able to recognize the available NUMA resources when starting guests and automatically placing them in the node that will be most appropriate for their requirements.

I do understand regarding your comment about manual pinning being broken. And from what I can tell, that is working fine. So essentially the LTS release seems covered.

Now that you speak about vcpu and cpuset, I am now very curious to try running VMs with a much different topology entirely unknown to libvirt. To manually (hoping that in the future this will be a feature) make physical cores available as "cores" and HT cores available as "threads". I don't really know much about any of this function, maybe it already exists or maybe I've just lost my marbles.

In all of the tuning aspects of libvirt, I am always concerned with the "quality" of HT compared to the physical core. There are a few things here, one is that I do not want to waste a potentially usable thread by disabling HT. But second, under heavy load I would prefer a process on a guest with 2 cores to be pushed to a physical core with the respective physical HT as a part of that single core, while also making available the second physical core with its HT as a physical part of that.

As I type this email, I have a database server that is happily pinned to only HT cores right now. I can't imagine that would be detrimental to its function, but preferably I would like some sort of policy to ensure each guest is operating on at least one legitimate physical core, and not entirely on the core leftovers.

Maybe I am thinking too far down the road here :-). You enlighten me greatly Stefan, your wisdom us always appreciated.

Thanks again.
~Laz

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.