Tools and processes for diagnosing issues with graphic drivers

Registered by Bryce Harrington on 2011-04-07

Technical discussion about procedures and tools for debugging graphics problems related to X, mesa, and kernel drm drivers.

 * Forwarding drm bug reports upstream
    + Web utility for doing this
    + How to better handle when upstream gives a kernel patch for user to test?
 * Intel GPU freeze apport hook
    + Switch from intel_gpu_dump to i915_error_state
    + Detect if hang did not produce valid debug info
    + Prompt user if it was a false positive, and if not, if it's a regression
    + Prompt user to turn on debugging and reproduce bug, and THEN send a report?
    + Freeze database?
 * Nouveau/Radeon GPU freeze apport hooks
    + For -ati and -nouveau, need a manually-run script to collect debug data
    + kernel event for nouveau gpu lockups is needed
    + kernel event for radeon gpu lockups is needed
 * xdiagnose - for setting kernel debug parameters for gfx issues
 * Handling hardware X can't properly autodetect
   + Kernel support for EDID write? (See http://www.spinics.net/lists/dri-devel/msg00802.html)
    + Adding custom randr modes / importing windows driver info (gnome?)
    + See https://wiki.ubuntu.com/X/Dev/DeviceTreeHWDetection
 * Kernel support for printing: possible outputs, connected outputs, detected modes, selected mode
    + (Possibly may be enough to simply boot drm.debug=0xe?)
 * xorg/compiz apport hook
    + attach_drm_info() - can this info be added as an attachment instead?
 * Graphics troubleshooting documentation
   + Needs major updating, particularly due to kernel modesetting changes

Blueprint information

Status:
Complete
Approver:
Martin Pitt
Priority:
Medium
Drafter:
Bryce Harrington
Direction:
Needs approval
Assignee:
Bryce Harrington
Definition:
Approved
Series goal:
Accepted for oneiric
Implementation:
Implemented
Milestone target:
milestone icon ubuntu-11.10-beta-2
Started by
Bryce Harrington on 2011-11-15
Completed by
Bryce Harrington on 2011-11-15

Related branches

Sprints

Whiteboard

Work items:
[bryce] Verify apport hooks moved to xdiagnose package still fire off properly when reporting bugs: DONE
[bryce] Verify failsafex still launches from xdiagnose when mis-editing xorg.conf: POSTPONED
[raof] Audit X debugging documentation for continued relevance: POSTPONED
[raof] Write arsenal script to pull out suspicious package upgrades when a regression in a development release: POSTPONED
[bryce] Get buy in from kernel team for a common set of bug tags: DONE
[bryce] Review feasibility of implementing https://wiki.ubuntu.com/X/Blueprints/RegressionRooter: DONE
[sarvatt] Update kernels in xorg-edgers for better bug triaging: POSTPONED
[apw] Investigate changing the mainline kernel shortlog generation to be from release tag to current rather than from tag to tag:DONE
[bryce] Investigate if we can put gpu lockup bugs into crash database (separate from launchpad) and generate statistical data to upstream about failures: DONE
[bryce] Switch apport hook from intel_gpu_dump to i915_error_state: DONE
[bryce] In gpu apport hook prompt user if it was a false positive, and if not, if it's a regression: DONE
[bryce] In apport hooks, segregate users willing to do additional testing and debugging: DONE
[cjwatson] ensure that instead of clearing recordfail during rc2 state clear once the user logs in or explicitly shuts down: POSTPONED
[cjwatson] Add nomodeset to the recovery mode options: DONE
[cjwatson] default to recovery mode when recordfail is set: POSTPONED

bryce, 2011-05-17: Mostly this is a grab bag of random toolage improvement tasks, but there's a collection of work around redoing the failsafe-x stuff, so I'll draft a wiki spec for that particular bit.

pitti, 2011-05-18:
- "option from livecd to boot directly into failsafe-x" -> where will that go, into gfxboot's F6 menu?
- what does "Investigate if we can put gpu lockup bugs into crash database" mean? We already send them to LP bugs, do you mean a different DB?
- The "mechanism to move aside monitors.xml, .driconf, etc" touches configuration files, so this should be handled very carefully. The wiki page does not cover this, can you please elaborate on how and when this will happen? When/how do they get restored? Could we perhaps set a magic environment variable IGNORE_XORG_CONFIG or similar which xdiagnose sets in the session?
- "default to failsafe mode when recordfail is set" -> do we have some evidence that this would actually help in the majority of cases? In a lot of cases boot failures were due to interrupted package upgrades (initramfs not rebuilt, or rebuilt after kernel upgrades) or other problems which were unrelated to X.org[

bryce, 2011-05-18:
- Yeah, whereever makes the most sense to you guys. I've specified it as "gfxboot's F6 menu" for now. This is to address bug #747338.
- Daniel Vetter requested that crashes be collected into a crash database that permits statistical reports to be provided on gpu lockup frequencies. We do collect these automatic crash reports into launchpad but admittedly most users filing these expect to "fire and forget" and are unlikely to follow up to questions we post to them in launchpad. The task is to evaluate if it makes sense to switch to collecting these types of errors into a stand alone crash database separate from (or in addition to) launchpad. If we did so, this would open the possibility of continuing to collect this data post-release which could be beneficial.
- I've added some discussion about config files to the spec. This is a stretch goal and not core functionality. I can include functionality to restore the moved aside files but figure most users will just regenerate them using the driconf and gnome-display-properties tools, as the failure could have been due to config file format changes or some such. Regarding IGNORE_XORG_CONFIG, that is not necessary - we can just call X directly specifying it not to use the system xorg.conf for that session, as is already done in failsafe-x; this function is intended as a repair for if a user has an xorg.conf from an old installation that they no longer need.
- I don't have data as to what the majority cause for when recordfail is set, but this was cjwatson's suggestion for implementing this feature, and I'm open to second opinions if there's alternate ways to achieve the goal. In the case of boot failures such as failed initramfs when trying to build graphics drivers such as nvidia or fglrx, this failsafe-x mode should cover it since it will boot with the vesa or fbdev driver (which should always work). I intend to include functionality in xdiagnose to remove/rebuild broken nvidia package installations, which should address that case. If there are package failures not to do with the graphics drivers that can lead to this failure state, then you are perhaps right that some additional thought is needed here. Advice?

pitti, 2011-05-18: Thanks for the clarifications. Wrt. IGNORE_XORG_CONFIG I actually meant ~/.config/monitors.xml and similar files, not just xorg.conf. But Chris pointed out that this would never automatically remove files without at least asking the user, so this should be fine.
Wrt. the failsafe-x mode, for the "nvidia/fglrx failed to build" cases, wouldn't they just fall back to nouveau/ati? But even then failsafe-x would probably be appropriate as the user might want to re-do the upgrade or all apt-get -f install (like in friendly-recovery); this would be more obvious than silently running with nouveau, so I agree it would be the right thing here as well. Perhaps a good middle-ground would be to offer friendly-recovery and failsafe-x if the previous boot failed?

bryce, 2011-05-18: I don't think we'd want to provide functionality to delete ~/.config, but yeah for monitors.xml and .driconf my thinking is there'd be a clearly labeled button the user would click to deliberately do this action, and it would likely present a confirmation dialog as well. Also, rather than delete them I'm going to move them to $file.bak or similar.
There are cases where nouveau/ati are broken as well, a common case being late model cards that are supported only by nvidia/fglrx, so falling back to the open driver is not going to result in a working system.

cjwatson, 2011-05-23: I think my suggestion was actually to boot into *recovery* mode when recordfail is set?

raof, 2011-05-24: It was. I was interpreting “failsafe” as “recovery”. I've updated the WI naming to reflect the established nomenclature.

bryce, 2011-06-23: I've inquired a bit with the LP crew about crash databases. It is certainly on their radar but not in a scheduled plan yet, nor have the implementation details been worked out. I think the right next step would be for us (me?) to make a proof of concept crash database web service, specifically for GPU lockups, but that's outside the scope of this blueprint, and possibly something for P/Q.

I've also talked with a few people about regression isolation tools, and drafted a preliminary spec for FriendlyGitBisect. I'll follow up with LP guys but it sounds like they're pretty thoroughly booked; it may be that these should be prototyped on our side first.

bryce, 2011-06-24: Reworked the apport hook for gpu lockups to use /sys/kernel/debug/dri/0/i915_error_state instead of the intel_gpu_dump tool, in xdiagnose #71.

bryce, 2011-06-25: failsafe-x doesn't launch on mis-edited xorg.conf's when lightdm is in use. Need to check whether it includes any hooking mechanisms for launching fallback X sessions on X failures.

bryce, 2011-06-27: Reviewed kernel team's tags, incorporated them on the X team's list at https://wiki.ubuntu.com/X/Tagging. There's basically just a couple that we overlap on; I've marked that we use the kernel team's tag syntax.

bryce, 2011-11-15: Remaining P/Q tasks have been moved to desktop-p-xorg blueprint. This blueprint is now considered "Completed".

(?)

Work Items