pulseaudio crashed with SIGABRT in drop_block(), pa_memblock_ref() or pa_memblock_free()

Bug #924416 reported by Eric Casteleijn
64
This bug affects 7 people
Affects Status Importance Assigned to Milestone
pulseaudio (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Symptom: Crash at random time while playback; stack trace always includes fill_mix_info / pa_sink_input_peek

Maybe: Most cpu's here seem quadcore i7's or better, maybe a fast CPU increases likelyhood of crash

Maybe: VoIP applications (uses small packages / low latency) can also increase likelyhood of crash

===

This happens a few times a day, while I'm playing music from quodlibet, and does not seem to be triggered by any user interaction. I'm using a USB sound card, which may be pertinent.

ProblemType: Crash
DistroRelease: Ubuntu 12.04
Package: pulseaudio 1:1.1-0ubuntu5
ProcVersionSignature: Ubuntu 3.2.0-12.20-genUser Name 3.2.2
Uname: Linux 3.2.0-12-generic x86_64
NonfreeKernelModules: fglrx
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 1.91-0ubuntu1
Architecture: amd64
Card0.Amixer.info:
 Card hw:0 'Device'/'USB Sound Device at usb-0000:00:1a.7-5.4, full speed'
   Mixer name : 'USB Mixer'
   Components : 'USB145f:0143'
   Controls : 14
   Simple ctrls : 6
Card1.Amixer.info:
 Card hw:1 'AW1'/'BeAutiful Qing Audioengine AW1 at usb-0000:00:1a.7-5.2, full speed'
   Mixer name : 'USB Mixer'
   Components : 'USB08bb:2704'
   Controls : 2
   Simple ctrls : 1
Card2.Amixer.info:
 Card hw:2 'Intel'/'HDA Intel at 0xfbbf4000 irq 56'
   Mixer name : 'Realtek ALC892'
   Components : 'HDA:10ec0892,10280482,00100302'
   Controls : 45
   Simple ctrls : 20
Card3.Amixer.info:
 Card hw:3 'Generic'/'HD-Audio Generic at 0xfbefc000 irq 57'
   Mixer name : 'ATI R6xx HDMI'
   Components : 'HDA:1002aa01,00aa0100,00100200'
   Controls : 6
   Simple ctrls : 1
Card3.Amixer.values:
 Simple mixer control 'IEC958',0
   Capabilities: pswitch pswitch-joined penum
   Playback channels: Mono
   Mono: Playback [on]
CrashCounter: 1
Date: Tue Jan 31 13:04:21 2012
ExecutablePath: /usr/bin/pulseaudio
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release amd64 (20111012)
ProcCmdline: /usr/bin/pulseaudio --start --log-target=syslog
Signal: 6
SourcePackage: pulseaudio
StacktraceTop:
 raise () from /lib/x86_64-linux-gnu/libc.so.6
 abort () from /lib/x86_64-linux-gnu/libc.so.6
 ?? () from /usr/lib/x86_64-linux-gnu/libpulsecommon-1.1.so
 pa_memblockq_push () from /usr/lib/x86_64-linux-gnu/libpulsecommon-1.1.so
 pa_memblockq_push_align () from /usr/lib/x86_64-linux-gnu/libpulsecommon-1.1.so
Title: pulseaudio crashed with SIGABRT in raise()
UpgradeStatus: Upgraded to precise on 2012-01-05 (25 days ago)
UserGroups: adm admin cdrom dialout lpadmin plugdev pulse pulse-access sambashare
dmi.bios.date: 05/26/2010
dmi.bios.vendor: Dell Computer Corporation
dmi.bios.version: A02
dmi.board.name: 05DN3X
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 3
dmi.chassis.vendor: Dell Inc.
dmi.chassis.version: Not Specified
dmi.modalias: dmi:bvnDellComputerCorporation:bvrA02:bd05/26/2010:svnDellInc.:pnStudioXPS9100:pvr00:rvnDellInc.:rn05DN3X:rvrA00:cvnDellInc.:ct3:cvrNotSpecified:
dmi.product.name: Studio XPS 9100
dmi.product.version: 00
dmi.sys.vendor: Dell Inc.

Revision history for this message
Eric Casteleijn (thisfred) wrote :
Revision history for this message
Apport retracing service (apport) wrote :

StacktraceTop:
 drop_block (bq=0xfd6170, q=0x7fede800f440) at pulsecore/memblockq.c:180
 pa_memblockq_push (bq=0xfd6170, uchunk=<optimized out>) at pulsecore/memblockq.c:328
 pa_memblockq_push_align (bq=0xfd6170, chunk=<optimized out>) at pulsecore/memblockq.c:786
 sink_input_process_msg (obj=0x10ac670, code=7, data=0x0, offset=0, chunk=0x7fedfd9b13a0) at modules/module-combine-sink.c:574
 pa_asyncmsgq_process_one (a=<optimized out>) at pulsecore/asyncmsgq.c:276

Revision history for this message
Apport retracing service (apport) wrote : Stacktrace.txt
Revision history for this message
Apport retracing service (apport) wrote : ThreadStacktrace.txt
Changed in pulseaudio (Ubuntu):
importance: Undecided → Medium
summary: - pulseaudio crashed with SIGABRT in raise()
+ pulseaudio crashed with SIGABRT in drop_block()
tags: removed: need-amd64-retrace
visibility: private → public
Revision history for this message
David Henningsson (diwic) wrote : Re: pulseaudio crashed with SIGABRT in drop_block()

In 3-4 hours of time, we send about 2GB of audio data, so maybe this could be an integer overflow problem? Does it usually happen after 3 hours and 10 - 20 minutes of continuous playback?

Revision history for this message
David Henningsson (diwic) wrote :

#3 0x00007fee07e8d8ed in pa_memblockq_push (bq=0xfd6170, uchunk=<optimized out>) at pulsecore/memblockq.c:328
        p = 0x7fede800f440
        q = 0x7fede8006d80
        n = <optimized out>
        chunk = {memblock = 0x7fedff572000, index = 0, length = 3676}
        old = 1878380804

1878380804 would be 177 minutes of audio data (44100 * 2 * 2 * 60 * 177).

Revision history for this message
Eric Casteleijn (thisfred) wrote :

That makes sense, it seems to happen 3 or 4 times a day, and I pretty much always have music on while I'm at my computer, which would be between 8 and 12 hours a day.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in pulseaudio (Ubuntu):
status: New → Confirmed
Revision history for this message
David Henningsson (diwic) wrote :

Hi Eric and Alex,

I'm still having troubles tracking this bug down. It was suggested to me that the raw apport crash might help:
the next time this happens, would you mind uploading a file that would look something like
 /var/crash/_usr_bin_pulseaudio.1000.crash
...and that should be created every time PulseAudio crashes?

and the output of
/var/log/syslog | grep pulseaudio

thanks!

Revision history for this message
Eric Casteleijn (thisfred) wrote :

Will do!

Revision history for this message
Alex Chiang (achiang) wrote :

Hi David,

Sounds good to me. I'm happy to close this bug out as invalid/unreproducible for now, then.

summary: - pulseaudio crashed with SIGABRT in drop_block()
+ pulseaudio crashed with SIGABRT in drop_block(), pa_memblock_ref() or
+ pa_memblock_free()
description: updated
Revision history for this message
Eric Casteleijn (thisfred) wrote :
Revision history for this message
Eric Casteleijn (thisfred) wrote :

added both requested files after a crash just happened again. Let me know if there's more information I can provide.

Revision history for this message
David Henningsson (diwic) wrote : And now, something is wrong with the flist implementation...

Hi,

I've had my eyes on a couple of bugs filed in Ubuntu, where PulseAudio
crashes with SIGABRT somewhere inside pa_memblock_* or pa_memblockq_*,
often after several hours of playback. I now got a core dump from the
bug reporter (thanks, Eric Casteleijn!) and I started analysing it.

It turns out that the same memblockq list_items is used as part of two
memblockq chains, in essence for one of the blocks q->next->prev != q.
How could this be? Well, they seem to be allocated from a static flist
pool. I've never looked at flist before, but it claims to be, "A
multiple-reader multipler-write lock-free free list implementation".

I had a look at tests/flist-test.c and decided to rewrite the test case
to make it a bit tougher and to see if I could verify that it never
handed out the same block twice. The test seemed to pass at first, but
the ~10th time I ran it, the code hung. All 20 threads accessing the
flist seemed to have stalled. I attached to the process with gdb, and
found that l->stored, l->empty, and l->stored->next [2] were all equal,
so clearly something is wrong.

Instead of trying to verify the algorithm, I went to Google to look for
a reference implementation to compare against, and quickly found [1].
And indeed our flist looks like the one under the section "Naive
lock-free stack which suffers from ABA problem." on that page. :-/
What's worse, there does not seem to be an easy fix.

At this point, it seems likely this is the root cause for the problem.
But I could use some input on how to proceed.

One trivial idea just to add a mutex around the flist as an immediate
workaround. It won't make the implementation lock-free, but futexes
should be pretty fast, and the lock won't be held for long anyway. So
this might be good enough?

I'm also noticing that in most cases where we use the flist, is as an
optimisation to avoid using malloc, essentially pointer recycling. If
this is the case, and the objects are not usually allocated by one
thread and freed by another, one could have per-thread flists instead of
one flist shared by all threads.

Any other ideas?

--
David Henningsson, Canonical Ltd.
http://launchpad.net/~diwic

[1] https://en.wikipedia.org/wiki/ABA_problem

[2] and l->empty->next, obviously

Revision history for this message
David Henningsson (diwic) wrote : Re: [pulseaudio-discuss] And now, something is wrong with the flist implementation...

On 03/02/2012 04:45 PM, Tanu Kaskinen wrote:
> On Fri, 2012-03-02 at 13:49 +0100, David Henningsson wrote:
>> Instead of trying to verify the algorithm, I went to Google to look for
>> a reference implementation to compare against, and quickly found [1].
>> And indeed our flist looks like the one under the section "Naive
>> lock-free stack which suffers from ABA problem." on that page. :-/
>> What's worse, there does not seem to be an easy fix.

Have you ever woken up in the middle of the night with an idea of how to
solve the problem almost clear in your mind? The wikipedia page said
"tag bits" was a common workaround, using the low bits of the pointer. I
initially rejected the idea as I didn't think it would be portable
enough, but now I realise that the pointers are just indices to a table.

So I've written a patch which I post separately, please review. I've
been running my own test case for a while and seems to have succeeded so
far.

@Eric Casteleijn, as you seem to be the only one that can reproduce this
problem somewhat reliably, would you mind trying my patch for a few days
and see if it resolves the problem? If so, I'll prepare a PPA for you to
use for testing.

--
David Henningsson, Canonical Ltd.
http://launchpad.net/~diwic

Revision history for this message
Jyri (oku-iki) wrote :

On Sun, 04 Mar 2012 06:05:46 +0100, David Henningsson
<email address hidden> wrote:
> On 03/02/2012 04:45 PM, Tanu Kaskinen wrote:
>> On Fri, 2012-03-02 at 13:49 +0100, David Henningsson wrote:
>>> Instead of trying to verify the algorithm, I went to Google to look
for
>>> a reference implementation to compare against, and quickly found [1].
>>> And indeed our flist looks like the one under the section "Naive
>>> lock-free stack which suffers from ABA problem." on that page. :-/
>>> What's worse, there does not seem to be an easy fix.
>
> Have you ever woken up in the middle of the night with an idea of how to

> solve the problem almost clear in your mind? The wikipedia page said
> "tag bits" was a common workaround, using the low bits of the pointer. I

> initially rejected the idea as I didn't think it would be portable
> enough, but now I realise that the pointers are just indices to a table.
>
> So I've written a patch which I post separately, please review. I've
> been running my own test case for a while and seems to have succeeded so

> far.
>
> @Eric Casteleijn, as you seem to be the only one that can reproduce this

> problem somewhat reliably, would you mind trying my patch for a few days

> and see if it resolves the problem? If so, I'll prepare a PPA for you to

> use for testing.

Hi David,
I wrote probably something similar (patch is attached). It would be
interesting to compare our patches, but I could not find your patch
from the mailing list yet.

It is good to keep in mind that at least my version of the tag
approach is not guaranteed to avoid the ABA-problem, it only makes it
very unlikely. In 64bit system one may argue that it is impossible for
one thread to remain asleep while the one element it was trying to
manipulate is poped and pushed back to linked list exactly 4294967296
times and then, while the same element is still on top of the list,
the thread is scheduled again and it fall for the ABA-problem. But what
about 32bit system, can we draw the same conclusion there? I would say
the above scenario is still virtually impossible, even if you change the
number 4294967296 to 65536.

Sorry for the mess I caused. At the time of making the ABA-suffering
flist-implementation I was desperately looking for a fix to a notorious
Nokia N900 "diesel engine"-bug* and when I found the fix I was just happy
that the flist-test passed went on to the next problem.

* The sound resembling a diesel engine came when a memblock in a free list
was swapped out. While PA was waiting for the memblock to be accessible
again the alsa-device kept on playing the same 20ms DMA buffer content
over
and over again.

Best regards,
Jyri

ps. I have not had time to test the patch on 32bit system yet.

Revision history for this message
Eric Casteleijn (thisfred) wrote :

I'd be happy to test any patches! From a PPA would be greatly appreciated. :)

Revision history for this message
David Henningsson (diwic) wrote :

Ok, I have just uploaded my fix to https://launchpad.net/~diwic/+archive/temp - when it has finished building, please install it and have it go for a few days and see if your issue is fixed. Thanks!

Changed in pulseaudio (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Eric Casteleijn (thisfred) wrote :

Ack. Installed and testing.

Revision history for this message
Eric Casteleijn (thisfred) wrote :

While it's hard to prove a negative, I've not had any crashes so far, where I'm pretty sure I would have had at least one before, so cause for cautious optimism! :)

Revision history for this message
David Henningsson (diwic) wrote :

Thanks for testing - the fix has now been committed to the 12.04 tree as well as upstream!

Changed in pulseaudio (Ubuntu):
status: Incomplete → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pulseaudio - 1:1.1-0ubuntu12

---------------
pulseaudio (1:1.1-0ubuntu12) precise; urgency=low

  [ Luke Yelavich ]
  * debian/control: Demote pulseaudio-esound-compat to suggests (LP: #930703)

  [ David Henningsson ]
  * 0110-flist-Avoid-the-ABA-problem.patch:
    Fix occasional crashes in pa_memblock_free, pa_memblock_ref and drop_block
    (LP: #924416)
 -- David Henningsson <email address hidden> Fri, 09 Mar 2012 00:08:42 +0100

Changed in pulseaudio (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Eric Casteleijn (thisfred) wrote :

Thanks for fixing it! It's greatly appreciated!

To post a comment you must log in.