Sparc t2000 hangs under stress load

Bug #91601 reported by TedGent
8
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
linux-source-2.6.17 (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Binary package hint: linux-source-2.6.17

Stress testing t2000 generates reproducable system hangs.

1. uname -a

Linux wgs94-206 2.6.17-10-sparc64-smp #2 SMP Tue Dec 5 22:28:15 UTC 2006 sparc64 GNU/Linux

2. dmesg of sloft lockup

[36674.762366] BUG: soft lockup detected on CPU#4!
[36674.900712] Call Trace:
[36674.900737] [000000000043411c] smp_percpu_timer_interrupt+0xbc/0x160
[36674.900785] [00000000004109d4] tl0_irq14+0x14/0x20
[36674.900815] [0000000000494590] anon_vma_link+0x10/0x60
[36674.900858] [000000000045150c] copy_process+0x52c/0x1040
[36674.900890] [0000000000452060] do_fork+0x40/0x220
[36674.900915] [0000000000406c94] linux_sparc_syscall32+0x34/0x40
[36674.900954] [00000000000029f4] 0x29f4

3. ps in an uninterruptable sleep state

ps in an uninterruptable sleep state

S 1001 572 17225 0 78 0 1080 415 compat pts/2 00:00:15 WatchDog_foS
D 1001 10783 572 0 77 0 1320 408 access pts/2 00:00:00 ps
S 1001 10784 572 0 78 0 1072 409 pipe_w pts/2 00:00:00 grep
S 1001 10785 572 0 76 0 768 408 pipe_w pts/2 00:00:00 awk

4. Steps to reproduce

Please contact submitter

5. Changes to system

Included patches available (apt-get update etc.) 3/8/2007

Removed crashme from stress load.

6. Our planned next steps

a. build from ubuntu sources and check hang reproducable
b. check gettimeofday() source code investigating memory barrier instruction placements

thanks ted

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Hi Ted, thanks for the 2 bug reports (I saw the other one too). It would be interesting for us to know if you can reproduce these problems with dapper and feisty too.

Also.. what kind of stress test did you perform in this setup? the other was crashme, here?

Thanks
Fabio

Revision history for this message
TedGent (gents) wrote :

The stress load used is (unfortuneatly) proprietary in that the code was generated on company time.

It is a collection of tests designed to simulate several classes of time-sharing user. It is loosely based on UETP code use by VaxVMS and code generated for Tru64Unix.

The components include:

- A memory test - user mode read/write memory
- Bourne shell exerciser - uses sed and sort for test processing
- Compiler - unpacks c source code, compiles
- File test - create, write, read, delete
- Memory mapper - uses shared memory
- Signal handler - in place of crashme

There is a master script which invokes each of the sub tests in sequence until the number of requested test processes have been detached.

From experience with other platforms, values are know at which the system should run stable and where problems start to be encountered.

Normally we are able to isolate a problem and then use a targetted test to isolate to a specific sequence, library etc. In the Ubuntu case we have a more fundamental problem where calls to gettimeofday(), and ps fail. We have worked around the crashme problem by not using that specific test.

---------

I will try feisty next. Thanks ted

Revision history for this message
TedGent (gents) wrote :

I built a kernel using the same .config file from Ubuntu sources. The system uname -r is 2.6.17.14-ubuntu1-tgspcl. The time to fail lengthened but has generated a similar failure symptom at approximately 25 Hrs TTF.

The failure symptom is BUG: soft lockup detected on CPU#28!

dmesg shows:

[98810.660884] BUG: soft lockup detected on CPU#28!
[98810.805054] Call Trace:
[98810.805074] [000000000043411c] smp_percpu_timer_interrupt+0xbc/0x160
[98810.805119] [00000000004109d4] tl0_irq14+0x14/0x20
[98810.805144] [0000000000494170] anon_vma_unlink+0x10/0x80
[98810.805182] [000000000048ec84] free_pgtables+0x44/0xe0
[98810.805211] [000000000048fb3c] unmap_region+0xbc/0x140
[98810.805237] [00000000004907f4] do_munmap+0x174/0x240
[98810.805263] [00000000004908dc] sys_munmap+0x1c/0x40
[98810.805288] [0000000000406c94] linux_sparc_syscall32+0x34/0x40
[98810.805325] [00000000f7dc7388] 0xf7dc7388

Top shows the same problem, the date command is at the top incrementing run time.

top - 17:16:08 up 1 day, 4:09, 7 users, load average: 70.12, 70.09, 71.23
Tasks: 402 total, 71 running, 331 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 3.3%sy, 86.9%ni, 9.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33257904k total, 23603152k used, 9654752k free, 206488k buffers
Swap: 45440440k total, 1419392k used, 44021048k free, 37664k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10991 greg 16 0 3488 1040 888 R 100 0.0 53:30.37 date

I have downloaded feisty and will load that on the system next.

I will save the system disk which will enable me to compile any debug code needed.

thanks ted.

Revision history for this message
TedGent (gents) wrote :

Feisty is now installed on my t2000, just adding in the packages needed to re-run the tests. Installation was completely error free, no problems encountered.

Linux wgs94-206 2.6.20-8-sparc64-smp #2 SMP Tue Feb 13 09:36:03 UTC 2007 sparc64 GNU/Linux

The Time To Fail was approx 24 hours, to prove stability we will need to run for 3x. I will start the runs today.

thanks ted

Revision history for this message
TedGent (gents) wrote :

We started the tests running on Feisty at 9:00AM US Eastern Time 3/15. The normal time to fail is approximately 24 hours.

cheers ted

Revision history for this message
TedGent (gents) wrote :
Download full text (3.2 KiB)

Feisty reported the following at 2:50 into the test:

Event at 11:50 - 2Hrs50 Time to fail

1. Application output and data gathered:

The WatchDog_for_SSULoad process "hung" at 11:50. The 'ps -eyl'
command shows its channel as rwsem_. So, I'd say this kernel still
has a locking problem.

At Thu Mar 15 11:50:16 EDT 2007, the following non-stop graphics
images are running:

 for a total of 0 graphics test processes

0
DEBUG: number of active processes = 66
DEBUG: number of graphics processes = 0

DEBUG: Displaying contents of slay_crashme_images.sh at
Thu Mar 15 11:50:31 EDT 2007

 586 passes have been completed; waiting for
 66 active processes to finish.

At Thu Mar 15 11:50:31 EDT 2007, the following non-stop
graphics images are running:

greg@wgs94-206:~/systest$ ps -eyl | grep 10832
S 1001 10832 10713 0 75 0 1184 439 compat pts/2 00:00:09 SSULoad_master.
D 1001 15090 10832 0 78 0 1112 422 rwsem_ pts/2 00:00:15 WatchDog_for_SS

2. Console Output

[79845.293083] Unable to handle kernel NULL pointer dereference
[79845.473777] tsk->{mm,active_mm}->context = 0000000000000c6e
[79845.646488] tsk->{mm,active_mm}->pgd = fffff800c2afc000

3. Relevant dmesg data

[79845.293083] Unable to handle kernel NULL pointer dereference
[79845.473777] tsk->{mm,active_mm}->context = 0000000000000c6e
[79845.646488] tsk->{mm,active_mm}->pgd = fffff800c2afc000
[79845.812067] \|/ ____ \|/
[79845.812083] "@'/ .. \`@"
[79845.812090] /_| \__/ |_\
[79845.812098] \__U_/
[79845.812118] WatchDog_for_SS(15090): Oops [#1]
[79845.812138] TSTATE: 0000004411001606 TPC: 0000000000452c88 TNPC: 0000000000545d40 Y: 00000000 Not tainted
[79845.812179] TPC: <copy_process+0x570/0xfc0>
[79845.812195] g0: fffff803b24a7441 g1: 0000000000000000 g2: 0000000000000071 g3: 0000000000000000
[79845.812219] g4: fffff805e77fa8e0 g5: fffff8000e59a580 g6: fffff803b24a4000 g7: fffff802e091eac8
[79845.812241] o0: 0000000000000001 o1: fffff800893209e8 o2: 0000000000000000 o3: 000000021e550c50
[79845.812262] o4: 0000000000000000 o5: 0000000000000000 sp: fffff803b24a7491 ret_pc: 0000000000452c84
[79845.812287] RPC: <copy_process+0x56c/0xfc0>
[79845.812306] l0: fffff802e091eac8 l1: fffff803c8f74da0 l2: fffff800c23cde78 l3: 0000000001200014
[79845.812330] l4: fffff80481c2d480 l5: fffff800893209c0 l6: fffff807fa556de0 l7: 0000000000830400
[79845.812353] i0: fffff803c8f74da0 i1: 00000000fff3b970 i2: fffff803b24a7f60 i3: 0000000000000000
[79845.812377] i4: fffff80354dbe160 i5: 00000000f7f16718 i6: fffff803b24a75b1 i7: 0000000000453720
[79845.812402] I7: <do_fork+0x48/0x200>
[79845.812414] Caller[0000000000453720]: do_fork+0x48/0x200
[79845.812436] Caller[0000000000406c54]: linux_sparc_syscall32+0x3c/0x40
[79845.812471] Caller[0000000000003af2]: 0x3afa
[79845.812515] Instruction DUMP: c25d6018 92056028 4003cc2f <ec586010> c25c2028 82086800 0ac0421c 9205a21c d05d60e8
gent@wgs94-206:~$

4. Other Information

top shows a 'top' process in a continuous run state

ps aux does not complete and hangs, not breakable with ^C

5. Next Steps

Will reboot the system and rerun the test. Please let me know if there is ...

Read more...

Revision history for this message
TedGent (gents) wrote :

The hang repeated, with a similar time to fail. I will attach the test, console and dmesg logs.

Looking for direction on next steps please

thanks ted

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Thanks for the tests. I am looking at the problem with upstream now.

I will let you know as soon as something is available.

Fabio

PS and thanks for such a complete bug report! it's rare to get one of those ;)

Revision history for this message
Matthew Woerly (nattgew) wrote :

Can you test this in the latest kernel with Hardy?

Revision history for this message
Sergio Zanchetta (primes2h) wrote :

The 18 month support period for Edgy Eft 6.10 has reached it's end of life. As a result, we are closing the linux-source-2.6.17 Edgy Eft kernel task.

Changed in linux-source-2.6.17:
status: New → Invalid
Revision history for this message
Sergio Zanchetta (primes2h) wrote :

Hardy Heron 8.04 was recently released. It would be helpful if you could test the new release and verify if this is still an issue - http://www.ubuntu.com/getubuntu/download . You should be able to test your bug using the LiveCD. Please let us know your results. Thanks.

Changed in linux:
status: New → Incomplete
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Michele Mangili (mangilimic) wrote :

We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information, and don't hesitate to submit bug reports in the future. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New". Thanks again!

Changed in linux:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.