stress-ng vm stressor failing in version 0.07.28 on some systems

Bug #1681503 reported by Rod Smith
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Stress-ng
Fix Released
High
Colin Ian King

Bug Description

The 0.07.28 version of stress-ng is producing what may be spurious failures on the vm stressor on two systems: hogplum (a Dell PowerEdge T610 with 64 GiB of RAM) and wildorange (an IBM x3650 M2 with 56 GiB of RAM). An example run, on wildorange, looks like this:

$ sudo stress-ng -k --aggressive --verify --timeout 860 --vm 0
stress-ng: info: [7078] dispatching hogs: 8 vm
stress-ng: fail: [7093] flip: detected 24 memory errors
stress-ng: fail: [7093] rand-set: detected 24 memory errors
stress-ng: fail: [7093] ror: detected 24 memory errors
stress-ng: fail: [7093] swap bytes: detected 192 memory errors
stress-ng: fail: [7085] flip: detected 24 memory errors
stress-ng: fail: [7094] flip: detected 24 memory errors
stress-ng: fail: [7085] rand-set: detected 24 memory errors
stress-ng: fail: [7085] ror: detected 24 memory errors
stress-ng: fail: [7094] rand-set: detected 24 memory errors
stress-ng: fail: [7085] swap bytes: detected 192 memory errors
stress-ng: fail: [7094] ror: detected 24 memory errors
stress-ng: fail: [7094] swap bytes: detected 192 memory errors
stress-ng: fail: [7089] flip: detected 24 memory errors
stress-ng: fail: [7091] flip: detected 24 memory errors
stress-ng: fail: [7089] rand-set: detected 24 memory errors
stress-ng: fail: [7083] flip: detected 24 memory errors
stress-ng: fail: [7091] rand-set: detected 24 memory errors
stress-ng: fail: [7089] ror: detected 24 memory errors
stress-ng: fail: [7091] ror: detected 16 memory errors
stress-ng: fail: [7089] swap bytes: detected 192 memory errors
stress-ng: fail: [7091] swap bytes: detected 192 memory errors
stress-ng: fail: [7083] rand-set: detected 24 memory errors
stress-ng: fail: [7083] ror: detected 24 memory errors
stress-ng: fail: [7087] flip: detected 24 memory errors
stress-ng: fail: [7083] swap bytes: detected 192 memory errors
stress-ng: fail: [7081] flip: detected 24 memory errors
stress-ng: fail: [7087] rand-set: detected 24 memory errors
stress-ng: fail: [7087] ror: detected 24 memory errors
stress-ng: fail: [7081] rand-set: detected 24 memory errors
stress-ng: fail: [7081] ror: detected 24 memory errors
stress-ng: fail: [7087] swap bytes: detected 192 memory errors
stress-ng: fail: [7081] swap bytes: detected 192 memory errors
stress-ng: fail: [7079] stress-ng-vm: detected 264 bit errors while stressing memory
stress-ng: error: [7078] process 7079 (stress-ng-vm) terminated with an error, exit status=1
stress-ng: fail: [7082] stress-ng-vm: detected 264 bit errors while stressing memory
stress-ng: fail: [7080] stress-ng-vm: detected 264 bit errors while stressing memory
stress-ng: error: [7078] process 7080 (stress-ng-vm) terminated with an error, exit status=1
stress-ng: error: [7078] process 7082 (stress-ng-vm) terminated with an error, exit status=1
stress-ng: fail: [7084] stress-ng-vm: detected 264 bit errors while stressing memory
stress-ng: error: [7078] process 7084 (stress-ng-vm) terminated with an error, exit status=1
stress-ng: fail: [7090] stress-ng-vm: detected 264 bit errors while stressing memory
stress-ng: fail: [7088] stress-ng-vm: detected 256 bit errors while stressing memory
stress-ng: fail: [7086] stress-ng-vm: detected 264 bit errors while stressing memory
stress-ng: error: [7078] process 7086 (stress-ng-vm) terminated with an error, exit status=1
stress-ng: error: [7078] process 7088 (stress-ng-vm) terminated with an error, exit status=1
stress-ng: error: [7078] process 7090 (stress-ng-vm) terminated with an error, exit status=1
stress-ng: fail: [7092] stress-ng-vm: detected 264 bit errors while stressing memory
stress-ng: error: [7078] process 7092 (stress-ng-vm) terminated with an error, exit status=1
stress-ng: info: [7078] unsuccessful run completed in 860.06s (14 mins, 20.06 secs)

The exact pattern of failures varies from one run to the next; some other examples are available at:

* https://certification.canonical.com/hardware/201006-5798/submission/117300/test/62536/result/8615756/
* https://certification.canonical.com/hardware/201003-5451/submission/117245/test/62536/result/8611888/
* https://certification.canonical.com/hardware/201003-5451/submission/117230/test/62536/result/8611357/
* https://certification.canonical.com/hardware/201003-5451/submission/117209/test/62536/result/8609590/

Hogplum tests out fine when using stress-ng 0.07.21 under either Ubuntu 16.04.2 or 17.04; and stress-ng 0.07.28 fails under either Ubuntu version. I've tested wildorange less extensively.

To date, I have NOT encountered this problem on other systems, but most of the other test systems have significantly less RAM -- usually 4-8 GiB. One notable exception is lalande, a Dell PowerEdge C6320p with 64 GiB that passes the vm stressor but fails the brk stressor. (I'm still investigating that failure and may file another bug report.)

I've run memtest86+ on hogplum over the weekend (about 70 hours). It's completed three passes and is most of the way through a fourth with no errors so far. Of course, it's possible that stress-ng's vm stressor is uncovering a legitimate problem that memtest86+ is missing; but the replication of the same failure on two systems and the failure of memtest86+ to uncover any problems makes this look like it may be a stress-ng bug.

Rod Smith (rodsmith)
tags: added: hwcert-server
Changed in stress-ng:
status: New → Triaged
status: Triaged → Confirmed
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
Revision history for this message
Colin Ian King (colin-king) wrote :

Tracked this down to an optimization regression in the fast pseudo-random number generator, it seems that 8 and 16 bit random values were being cached and not flushed on a re-seed.

Fix committed: http://kernel.ubuntu.com/git/cking/stress-ng.git/commit/?id=79aa85597f2f7aa944d4d1d1c59fed49955c2ad7

Changed in stress-ng:
status: Confirmed → Fix Committed
Revision history for this message
Colin Ian King (colin-king) wrote :
Changed in stress-ng:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.