udevadm parser picking wrong identifiers for devices

Bug #1211521 reported by Daniel Manrique
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Checkbox
Fix Released
High
Daniel Manrique

Bug Description

In the course of debugging bug 1211369 and bug 1210485, I added some more thorough testing to the checkbox udev parser.

https://code.launchpad.net/~roadmr/checkbox/more-parser-tries

This hasn't been proposed for merge yet, because the point of this bug is that the parser is picking incorrect data for some devices. For instance, the udev data from the Dell Vostro V131 says it has a wired and a wireless device (it has two wireless devices, I'm ignoring the wimax device for now, focusing on the core bug):

E: ID_MODEL_FROM_DATABASE=RTL8111/8168 PCI Express Gigabit Ethernet controller
E: ID_VENDOR_FROM_DATABASE=Realtek Semiconductor Co., Ltd.
E: PCI_ID=10EC:8168

However, the parser returns the PCI IDs from the *bus*, which is an Intel device:

[<UdevadmDevice: bus: pci id [8086:1c12] RTL8111/8168 PCI Express Gigabit Ethernet controller>]

Something similar happens with the wireless Atheros device:

E: ID_MODEL_FROM_DATABASE=AR9285 Wireless Network Adapter (PCI-Express)
E: ID_VENDOR_FROM_DATABASE=Atheros Communications Inc.
E: PCI_ID=168C:002B
<UdevadmDevice: bus: pci id [8086:1c16] AR9285 Wireless Network Adapter (PCI-Express)>

Also, for some systems it's unable to find the PCI IDs at all, on a Lenovo T420 and T430 it can't get the IDs for the wired network device.

The branch linked to at the beginning of the comment has the data files from the real systems, cd checkbox-old and run:

PYTHONPATH=. ./setup.py test --test-suite checkbox.parsers.tests.test_udevadm

this will run the test suite for the parser and show the errors.

Related branches

Revision history for this message
Daniel Manrique (roadmr) wrote :

OK, I seem to have found the guilty commit (man I could have done a bisect for this). Anyway, it's this one:

Revision:
2206.2.28
13-07-08 11:47 AM

checkbox-old:parsers:udevadm: Clean the network devices path

To stay compatible with checkbox-old local jobs using NETWORK path to
find the registered interface.
---------------------

If I revert that commit, all the tests (save for one which I'm still looking into) pass, so at least for the systems I declared per-device checks for, I'm getting the correct IDs and data. For instance, for the T420 which is our "control", I now have:

            {
                "bus": "pci", #Correct PCI bus, yay!
                "category": "WIRELESS",
                "driver": "iwlwifi",
                "path": "/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/net/wlan0", #Unclean path ;( need to find a way to fix this
                "product": "Centrino Advanced-N 6205 [Taylor Peak]",
                "product_id": 133, #Note this is the correct product_id per the PCI device database
                "subproduct_id": null,
                "subvendor_id": null,
                "vendor": "Intel Corporation",
                "vendor_id": 32902 #Also correct, it's 0x8086
            },

This other device also looks good:

            {
                "bus": "pci",
                "category": "NETWORK",
                "driver": "e1000e",
                "path": "/devices/pci0000:00/0000:00:19.0/net/eth0",
                "product": "82579LM Gigabit Network Connection",
                "product_id": 5378,
                "subproduct_id": null,
                "subvendor_id": null,
                "vendor": "Intel Corporation",
                "vendor_id": 32902
            },

Compare this with borked data with the faulty commit in place:

            {
                "bus": "pci",
                "category": "WIRELESS",
                "driver": "iwlwifi",
                "path": "/devices/pci0000:00/0000:00:1c.1/0000:03:00.0", #The path looks better in theory
                "product": "Centrino Advanced-N 6205 [Taylor Peak]",
                "product_id": 7186, #Wrong product ID, matches the PCI bridge
                "subproduct_id": null,
                "subvendor_id": null,
                "vendor": "Intel Corporation",
                "vendor_id": 32902
            },

And this:
            {
                "bus": "net", #Wrong bus, we want pci
                "category": "NETWORK",
                "driver": "e1000e",
                "path": "/devices/pci0000:00/0000:00:19.0", #Path also looks nicer but ...
                "product": "82579LM Gigabit Network Connection",
                "product_id": null, #This is bad :(
                "subproduct_id": null,
                "subvendor_id": null,
                "vendor": "Intel Corporation",
                "vendor_id": null #This is also bad.
            },

OK, my next step is to fix the one test problem with the removed commit, then go back and try to replicate the path cleaning behavior while keeping the rest of the correct data (or understanding the rationale behind that commit and deciding if we can revert it outright).

Changed in checkbox:
status: New → In Progress
assignee: nobody → Daniel Manrique (roadmr)
importance: Undecided → High
Revision history for this message
Daniel Manrique (roadmr) wrote :

OK, the tests are now working if commit 2206.2.28 is reverted. I'm now looking at the local jobs and how they use the udev_resource output (udev_resource also uses the udev parser) to generate test names.

With commit 2206.2.28 in place, running networking/info job, the generated jobs look like this (I'm just showing the relevant part, mainly name, requirements and command):

 plugin: manual
 name: networking/info_eth2
 requires: device.path == "/devices/pci0000:00/0000:00:19.0"
 command: network_info eth2

With the commit reverted, this happens:

 plugin: manual
 name: networking/info_
 requires: device.path == ""
 command: network_info

So clearly something is not working as it should with that commit reverted.

The command used to generate the list of entries for the substitution should output a path, a space, and an interface name. These values get substituted into the template by the run_templates command. So isolating the data to substitute, the command is this:

udev_resource | filter_templates -w "category=NETWORK" | awk "/path: / { print \$2 }" | xargs -n 1 sh -c "for i in \`ls /sys\$0/net 2>/dev/null\`; do echo \$0 \$i; done"

Again, if this is run with an udev_resource that links to the parser with the offending commit, it works fine:
/devices/pci0000:00/0000:00:19.0 eth2
but removing that commit makes things fail, it outputs nothing.

Revision history for this message
Daniel Manrique (roadmr) wrote :

The filtered udevadm entry with category=NETWORK is this with the offending commit:

category: NETWORK
driver: e1000e
bus: net
interface: eth2
product: 82577LM Gigabit Network Connection
vendor: Intel Corporation
path: /devices/pci0000:00/0000:00:19.0

With it removed:

interface: eth2
path: /devices/pci0000:00/0000:00:19.0/net/eth2
vendor_id: 32902
driver: e1000e
bus: pci
category: NETWORK
product: 82577LM Gigabit Network Connection
vendor: Intel Corporation
product_id: 4330

Differences: the path changes, the bus is different, and the second entry (bad commit removed) has vendor_id and product_id.

Revision history for this message
Daniel Manrique (roadmr) wrote :

OK, the command is looking for a net subdirectory in the given /sys/ path, under that it'd find the interface name(s). So the old path works:

/devices/pci0000:00/0000:00:19.0 <- we have net under here

but the new one fails:

/devices/pci0000:00/0000:00:19.0/net/eth2 <- we are already in net/whatever, so looking for $path/net fails (finds nothing).

Revision history for this message
Daniel Manrique (roadmr) wrote :

OK, so the way the parser determines the IDs for a device (network, wireless or disk) is by taking the device path, and chopping off parts of it until it arrives at the actual device that has the IDs.

This is because, for instance, the udev entry for /devices/pci0000:00/0000:00:19.0/net/eth2 doesn't have the vendor and product IDs; rather, they exist under the /devices/pci0000:00/0000:00:19.0/ entry.

The problem is that the lookup was being performed from the actual entry for the device (again, /devices/pci0000:00/0000:00:19.0/). So it would look up from that to find the parent device and take that as the IDs. That's why were seeing interfaces "inheriting" the IDs of their parent buses. This was confusing when both bus and interface were from the same vendor (Intel bus and Intel Centrino wireless). It became a bit more obvious when I looked at a Realtek wireless hooked into an Intel bus.

So the potentially easy solution is to just revert the commit that "cleans up" the paths, thus producing something like:
/devices/pci0000:00/0000:00:19.0/net/eth2.

The problem when I did that is that, as seen on comment #2, it resulted in jobs not being generated. This is because the jobs expect the "root" path of a device (I'm inventing terminology here); they expect the /devices/pci0000:00/0000:00:19.0/ path and look for net/* under that. So clearly, if I start that search in /devices/pci0000:00/0000:00:19.0/net/eth2, it will not find the expected subdirectories and will produce nothing.

So we wanted two different paths: one for external purposes, pointing to the "root" device, and another for matching and stack purposes (the parser uses a stack and it had data at a different level than expected). SInce all the matching behavior is internal to the udev parser, I implemented a _raw_path private property and changed all the references within the parser to use this instead of the cleaned-up path property. All external users of the parser still reference the path property, so they should still get the cleaned-up version.

A seemingly minor side-effect is that when using the raw path, some disk devices aren't being cleaned up (duplicate device filtering from commit 2206.2.32). I'll post some examples in a moment to see how this looks on a processed submission.

Revision history for this message
Daniel Manrique (roadmr) wrote :

OK so I have three versions of the udevadmparser code:

1- old parser as of checkbox rev 2234, this is the baseline behavior but it's prior to Sylvain's updates so the device information is somewhat quirky.

2- New parser as of checkbox rev 2235, this includes Sylvain's updates, it had the problem with the network devices we've been analyzing, but seems to produce the correct device paths for local jobs to work.

3- Fixed parser, to be proposed, we want this to produce the improved device output from 2235, but with the correct device paths. As explained, removing the commit that "cleans up" the network paths fixed network device identifiers and buses, but at the expense of messing up the paths and breaking checkbox jobs.

First I used the three versions of the parser to generate the local jobs as used by checkbox. The good news is that all three versions do generate the same jobs. So for job processing purposes, all three versions are equivalent, which is good. Here's an example of what it generates:

 plugin: manual
 name: networking/info_eth2
 requires: device.path == "/devices/pci0000:00/0000:00:19.0"
 command: network_info eth2
 _description:
  PURPOSE:
      This test will check the different NIC
  STEPS:
      1. Please verify the following information for NIC eth2
  INFO:
      $output
  VERIFICATION:
      Is this correct?

 plugin: shell
 name: disk/stats_sda
 requires:
  device.path == "/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0"
  block_device.sda_state != 'removable'
 user: root
 command: disk_stats_test sda
 description: This test checks disk stats, generates some activity and rechecks stats to verify they've changed. It also verifies that disks appear in the various files they're supposed to.

More to come in a following comment, to avoid overly long comments.

Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi, now I'm comparing the device data as obtained with the checkbox_parser script and dumped to json format. All I'm changing is the parser version.

The differences from the old (2234) parser and the new (2235) parser are as follows:

- A bunch of devices gained product and vendor ids, which is good.
- A bunch of devices changed from category: null to category: OTHER. The criterion used for this is that any product with a valid product_id and no other category will be categorized as OTHER. This is not necessarily good, to be addressed in a following merge request.
- 82579 Gigabit Network device has wrong bus (net vs. pci), and no product or vendor id. This is bad.
- WIreless device (Centrino Advanced-N 6205) has wrong product_id (0x1C12, which matches the PCI controller, not the device itself).

Now, the differences from the 2235 parser (bad) and the one with my fixes are as follows:

- 82579 Gigabit Network device has correct bus (pci), as well as product id 0x1502 and vendor id 0x8086 (correct).
- WIreless device (Centrino Advanced-N 6205) has the correct product_id (0x0133)

This means that the fixed parser fixes everything but the category:OTHER problem (it's another bug so it'll be fixed separately).

I'll attach the dumps to provide evidence of this :)

Revision history for this message
Daniel Manrique (roadmr) wrote :
Revision history for this message
Daniel Manrique (roadmr) wrote :
Revision history for this message
Daniel Manrique (roadmr) wrote :
Daniel Manrique (roadmr)
Changed in checkbox:
status: In Progress → Fix Committed
Ara Pulido (ara)
Changed in checkbox:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.