[Hyper-V] Mellanox VF driver does not support >16 vCPUs

Bug #1667007 reported by Joshua R. Poulson
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
Medium
Joseph Salisbury
Xenial
In Progress
Medium
Joseph Salisbury

Bug Description

In the course of enabling SR-IOV on Azure, discovered that the Mellanox Driver with 16 or more vCPUs fails. Mellanox has submitted the following patch upstream to correct this problem.

Prerequisite: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1650058

I will post the upstream commit once it lands.

Revision history for this message
Joshua R. Poulson (jrp) wrote :
Revision history for this message
Joshua R. Poulson (jrp) wrote :

(lkml)

From: Jack Morgenstein <email address hidden>

When creating EQs to handle CQ completion events for the PF
or for VFs, we create enough EQE entries to handle completions
for the max number of CQs that can use that EQ.

When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource tracker).
Therefore, when creating an EQ, the number of EQE entries that the VF
should request for that EQ is the CQ quota value (and not the total
number of CQs available in the FW).

Under SRIOV, the PF, also must use its CQ quota, because
the resource tracker also controls how many CQs the PF can obtain.

Using the FW total CQs instead of the CQ quota when creating EQs resulted
wasting MTT entries, due to allocating more EQEs than were needed.

Fixes: 5a0d0a6161ae ("mlx4: Structures and init/teardown for VF resource quotas")
Signed-off-by: Jack Morgenstein <email address hidden>
Reported-by: Dexuan Cui <email address hidden>
Signed-off-by: Tariq Toukan <email address hidden>

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1667007

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Joshua R. Poulson (jrp)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
tags: added: kernel-da-key kernel-hyper-v xenial
tags: added: patch
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with the requested patch, which can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1667007/xenial/

Can you test this kernel an see if it resolves this bug?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a v2 of the test kernel. This kernel included the patch for this bug and all the prereq patches from bug 1650058. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1667007/xenial/

Revision history for this message
Joshua R. Poulson (jrp) wrote :

Thanks! We'll give it a try.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with all the patches from the following bugs:

bug 1670518
  PCI: hv: Allocate physically contiguous hypercall params buffer
  PCI: hv: Make unnecessarily global IRQ masking functions static
  PCI: hv: Delete the device earlier from hbus->children for hot-remove
  PCI: hv: Fix hv_pci_remove() for hot-remove

bug 1672785
  net/mlx4_core: Avoid delays during VF driver device shutdown

bug 1667531
  tools: hv: Enable network manager for bonding scripts on RH
  [net-next] tools: hv: Add clean up function for Ubuntu config
  bcc5a76 tools: hv: Add a script to help bonding synthetic and VF NICs

bug 1667527
 4a9b0933bdfc PCI: hv: Use device serial number as PCI domain

bug 1667007
 d3de209 net/mlx4_core: Use cq quota in SRIOV when creating completion EQs

bug 1650058
 14c84da90b0d net/mlx4_en: Fix bad WQE issue
 c46100f413ca net/mlx4_core: Fix racy CQ (Completion Queue) free
 f4f73e2e6308 net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
 3c05ac20fe6e net/mlx4_core: Avoid command timeouts during VF driver device shutdown

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/HyperVCombined/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.