Code generation bug with -O2 (-foptimize-sibling-calls)

Bug #897583 reported by David Kastrup
258
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro GCC
Fix Released
High
Andrew Stubbs
gcc
Fix Released
High
gcc-4.6 (Ubuntu)
Fix Released
High
Unassigned

Bug Description

See <URL:http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51323>

Jakub Jelinek recently added a self-contained example where this bug triggered with normal C code.

The affected bug triggered about half a dozen errors in the Lilypond code base. These bugs are very hard to debug because they occur in connection with sibling call optimization, so at the time of the resulting segfault, the stack frame of the caller with the faulty code no longer exists.

The bug affects all architectures, though the probability for it to hit is quite higher on the 32bit i686 architecture, due to fewer available registers.

If Jakub's statement that he can't detect the bug in the 4.6 codebase but on current trunk (4.7) is correct, then the bug has been introduced into Oneiric via distribution specific patches.

It definitely is in the compiler of vanilla Ubuntu 11.10, and both the c++ test case I sent as well as the much simpler C test case Jakub constructed trigger.

An option workaround is to use -fno-optimize-sibling-calls which should have minimal impact on performance and memory usage unless the code relies to a significant degree on call chaining (P-code interpreters could be problematic).

Since potentially the reliability of all code compiled with gcc (the vast majority of the Ubuntu code base) is affected, I am marking this as critical. Since bad code can also have security implications, I would suggest not taking this all too lightly.

In my opinion, if the bug can be traced back to a distribution specific patch taken from the 4.7 branch, this patch needs to get backed out and all those parts of Ubuntu having gcc in their build dependencies need to get recompiled.

ProblemType: Bug
DistroRelease: Ubuntu 11.10
Package: gcc-4.6-base 4.6.1-9ubuntu3
ProcVersionSignature: Ubuntu 3.0.0-13.22-generic 3.0.6
Uname: Linux 3.0.0-13-generic i686
ApportVersion: 1.23-0ubuntu4
Architecture: i386
Date: Tue Nov 29 09:55:27 2011
Dependencies:

InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release i386 (20111011)
ProcEnviron:
 LANGUAGE=en_US:en
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: gcc-4.6
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
In , David Kastrup (dak) wrote :

Created attachment 25921
Boiled down source code. Bad code for last function.

The following boiled down code produces a jmp to Grob::internal_set_property where the implicit first call argument (this) is equal to the explicit second call argument instead of the actual this pointer. The guilty code sequence is

.L4:
 movl %ebx, 40(%esp)
 movl %ebx, 32(%esp)
 movl %eax, 36(%esp)
 addl $24, %esp
 .cfi_remember_state
 .cfi_def_cfa_offset 8
 popl %ebx
 .cfi_def_cfa_offset 4
 .cfi_restore 3
 jmp _ZN4Grob21internal_set_propertyEPvS0_

Version is
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i686-linux-gnu/4.6.1/lto-wrapper
Target: i686-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.1-9ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++,go --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --enable-targets=all --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=i686-linux-gnu --host=i686-linux-gnu --target=i686-linux-gnu
Thread model: posix
gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)

Compilation options are -O2

This is from Lilypond source code and causes a segfault.

Revision history for this message
In , David Kastrup (dak) wrote :

-fno-optimize-sibling-calls avoids the problematic optimization. For no good reason at all, -fkeep-inline-functions, documented to do a completely unrelated non-optimization (namely emitting inline functions even when all uses of them had been inlined), will also switch off the problematic tail call optimization.

Revision history for this message
In , David Kastrup (dak) wrote :

This particular code generation bug is responsible for at least half a dozen problems in the code base of Lilypond and causes a number of regression test failures.

We will have to add respective compiler options based on the version number of gcc. If anybody knowing the responsible compiler internals can construct a self-contained test case that does not require manually inspecting the generated code for errors, we could at least add an autoconf test specifically tailored to the occurence of this bug instead of basing the workaround compiler options on the version number.

Revision history for this message
In , Jakub-gcc (jakub-gcc) wrote :

Can't reproduce this with 4.6, but can with the trunk.

/* PR middle-end/51323 */

extern void abort (void);
struct S { int a, b, c; };
int v;

__attribute__((noinline, noclone)) void
foo (int x, int y, int z)
{
  if (x != v || y != 0 || z != 9)
    abort ();
}

static inline int
baz (const struct S *p)
{
  return p->b;
}

__attribute__((noinline, noclone)) void
bar (int x, struct S y)
{
  foo (baz (&y), 0, x);
}

int
main ()
{
  struct S s;
  v = 3; s.a = v - 1; s.b = v; s.c = v + 1;
  bar (9, s);
  v = 17; s.a = v - 1; s.b = v; s.c = v + 1;
  bar (9, s);
  return 0;
}

at -O2 -m32 fails. My http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02413.html
seems to fix this.

Revision history for this message
In , David Kastrup (dak) wrote :

I can confirm that my version of gcc identifying itself as
gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
makes your test program abort under -O2. If you _cannot_ confirm this with your version 4.6.1 but with the trunk, it would appear that Ubuntu 11.10 (or its upstream Debian) has imprudently integrated unstable code from the 4.7 branch into the version of gcc they choose to distribute with the release.

If your test program can reasonably be considered as perfectly correlated with the occurence of the bug (I don't have the expertise), I'll be using it as an autoconf test in Lilypond for deciding whether to compile with -fno-optimize-sibling-calls instead of the current test just checking the version.

Thanks.

Revision history for this message
In , David Kastrup (dak) wrote :

Question: the proposed fix is in gcc/calls.c which looks somewhat architecture independent. Am I right in assuming that this means that the bug may manifest itself under architectures different from i686 given different conditions?

In that case, I would tend to just unconditionally do -fno-optimize-sibling-calls in our autoconf checks for all respective gcc versions independent from tests and architecture since I don't have the hardware for other platforms in order to figure out compiler bugs, and since the bug tends to hide its cause in the resulting segfault, as it occurs only with tail jumps, meaning that the responsible function is not even visible in the stack traceback.

Revision history for this message
In , Jakub-gcc (jakub-gcc) wrote :

That would be too big hammer approach. While the fix is in arch independent code, on most architectures you could hit it only with > 6 resp. > 8 arguments and with similar scenario earlier callee argument stack slot initialized from later caller's argument stack slot. Furthermore I don't think you could hit it before 4.6 because MEM_REF wasn't supported there, so the address would be expanded without a temporary pseudo and the checking routine would see it right away.
Using this testcase and perhaps a modified one additionally too which will have in both the caller and tail callee say 16 extra dummy integer arguments passed through should be more than enough. Not to mention that having workarounds for such compiler bugs in packages is just weird, if the compiler is buggy, the user should just upgrade it to a fixed version.

Revision history for this message
David Kastrup (dak) wrote :
Revision history for this message
In , David Kastrup (dak) wrote :

I agree that the real fix is to force an upgrade of the compiler to a fixed version. However, Ubuntu 11.10 has been released and is in circulation, so we can't reasonably implement that solution until the buggy compilers have had a reasonable chance to be replaced everywhere.

I have reported this bug to Ubuntu. If you are right that it can't be found in 4.6 proper, they will have acquired it via distribution specific patches. What that means for stability and security of the entire current Ubuntu code base, one can only guess.

Regarding Lilypond, we have chosen to use -fno-optimize-sibling-calls based on the gcc version number instead of an actual test, without consideration of the architecture. Tracking this bug down has cost us several weeks of developer time and brought down our build infrastructure for a while until the first workaround, -fkeep-inline-functions, has been discovered by chance. Lilypond is a C++ application with considerable parts written in Guile, so segfaults usually are a problem of forgetting garbage collection protection measures. As far as I know, I am the only active programmer with a system programming background. When the bug manifests itself in a segfault, the responsible function is no longer visible in the stack backtrace. This makes finding the culprit extremely unfunny. In our case, the problem was exacerbated because the last visible caller in the stack backtrace made its call via a function pointer table, this table was a C++ vector, and accessing the vector in gdb was not possible because operator[] had been inlined. Specifying -fkeep-inline-function, which is according to its documentation supposed to _only_ additionally emit (unused) inline function instantiations that could have been used for accessing that table in the debugger, made the bug disappear.

There is no sane reason that -fkeep-inline-functions turns off sibling call optimization, but while sabotaging the debugging of this problem, it at least gave us a workaround.

So we simply can't afford dealing with this kind of situation more than once. We don't have the skill sets. In contrast, the positive results of this optimization are negligible for us since we don't employ systematic call chaining (like a P code interpreter using function pointer tables likely would).

Revision history for this message
In , Jakub-gcc (jakub-gcc) wrote :

Actually, it fails on 4.6 vanilla branch too, just not in GCC 4.6-RH, because we don't enable -fipa-sra by default for -O2/-Os for debug info quality reasons in 4.6, only for -O3 (that is something that is solved in GCC 4.7).

The regression started with http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=164135 which started building MEM_REF in the IPA-SRA codepath.

As for your workaround, if you only do it for GCC 4.6.* and not also for 4.7 and later, it is fine, all I would like is that such workarounds don't stick around forever for future versions of GCC.

Revision history for this message
In , David Kastrup (dak) wrote :

Code review is at <URL:http://codereview.appspot.com/5431088>, the discussion of the bug is at <URL:http://code.google.com/p/lilypond/issues/detail?id=1997>.

As you can see, the proposed workaround is restricted to g++ versions of 4.6.x. I trust that it will be fixed by the time 4.7 gets released, and once we have conclusive evidence about versions of 4.6.x that are unaffected, those will likely not get the fix either.

As a suggestion: it might be sensible to have a meta option -fdebug that will disable all options significantly interfering with post mortem debugging. While -g by itself should not change code generation, having a supporting option that helps debugging might be nice.

The option set I currently think of is something like -fno-crossjumping -fkeep-inline-functions -fno-optimize-sibling-calls. Also optimization of noreturn functions, in particular of abort, would be disabled since clobbering the stack traceback is not really helpful for debugging.

But that's a different issue.

Revision history for this message
David Kastrup (dak) wrote :

As you can see on the upstream bug report, this bug has now been confirmed to be generally affecting gcc-4.6, not just on Ubuntu, so my guess about distribution-specific patches likely was wrong. RedHat was unaffected because of having disabled a particular optimization for debugging reasons.

The bug has been given the status of a P1 regression, meaning that gcc 4.7 won't be released without a fix for it. There is a target milestone of gcc-4.6.3 for the fix in the 4.6 series.

visibility: private → public
Changed in gcc:
importance: Unknown → High
status: Unknown → In Progress
Matthias Klose (doko)
Changed in gcc-4.6 (Ubuntu):
importance: Undecided → High
status: New → Confirmed
Revision history for this message
In , Jakub-gcc (jakub-gcc) wrote :

Author: jakub
Date: Mon Dec 5 08:15:23 2011
New Revision: 182000

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=182000
Log:
 PR middle-end/51323
 PR middle-end/50074
 * calls.c (internal_arg_pointer_exp_state): New variable.
 (internal_arg_pointer_based_exp_1,
 internal_arg_pointer_exp_scan): New functions.
 (internal_arg_pointer_based_exp): New function.
 (mem_overlaps_already_clobbered_arg_p): Use it.
 (expand_call): Free internal_arg_pointer_exp_state.cache vector
 and clear internal_arg_pointer_exp_state.scan_start.

 * gcc.c-torture/execute/pr51323.c: New test.

Added:
    trunk/gcc/testsuite/gcc.c-torture/execute/pr51323.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/calls.c
    trunk/gcc/testsuite/ChangeLog

Revision history for this message
In , Jakub-gcc (jakub-gcc) wrote :

Fixed on the trunk so far.

Revision history for this message
In , Jakub-gcc (jakub-gcc) wrote :

Author: jakub
Date: Thu Dec 8 13:36:40 2011
New Revision: 182112

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=182112
Log:
 Backport from mainline
 2011-12-05 Jakub Jelinek <email address hidden>
      Eric Botcazou <email address hidden>

 PR middle-end/51323
 PR middle-end/50074
 * calls.c (internal_arg_pointer_exp_state): New variable.
 (internal_arg_pointer_based_exp_1,
 internal_arg_pointer_exp_scan): New functions.
 (internal_arg_pointer_based_exp): New function.
 (mem_overlaps_already_clobbered_arg_p): Use it.
 (expand_call): Free internal_arg_pointer_exp_state.cache vector
 and clear internal_arg_pointer_exp_state.scan_start.

 2011-11-26 Joern Rennecke <email address hidden>

 PR middle-end/50074
 * calls.c (mem_overlaps_already_clobbered_arg_p):
 Return false if no outgoing arguments have been stored so far.

 2011-12-05 Jakub Jelinek <email address hidden>
      Eric Botcazou <email address hidden>

 PR middle-end/51323
 PR middle-end/50074
 * gcc.c-torture/execute/pr51323.c: New test.

Added:
    branches/gcc-4_6-branch/gcc/testsuite/gcc.c-torture/execute/pr51323.c
Modified:
    branches/gcc-4_6-branch/gcc/ChangeLog
    branches/gcc-4_6-branch/gcc/calls.c
    branches/gcc-4_6-branch/gcc/testsuite/ChangeLog

Revision history for this message
In , Jakub-gcc (jakub-gcc) wrote :

Fixed.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package gcc-4.6 - 4.6.2-6ubuntu1

---------------
gcc-4.6 (4.6.2-6ubuntu1) precise; urgency=low

  * Merge with Debian.
  * Linaro GCC 4.6 2011-12 changes not yet in Ubuntu:
    - Generic tuning support for Big-endian platforms.
    - SLP support for operations with arbitrary numbers of operands.
    - SLP support for conditions.
    - Pattern recognition support in basic-block SLP.
    - Enhancements to mixed-size condition pattern recognition.
    - Unaligned block-move support for ARMv7.
    - Added Cortex-A15 integer pipeline tuning.

gcc-4.6 (4.6.2-6) unstable; urgency=low

  * Update to SVN 20111208 (r182120) from the gcc-4_6-branch.
    - Fix PR c++/51265, PR bootstrap/50888, PR target/51393 (ix86),
      PR target/51002 (AVR), PR target/51345 (AVR), PR debug/48190,
      PR fortran/50684, PR fortran/51218, PR target/50906 (closes: #650318),
      PR tree-optimization/51315 (closes: #635126), PR tree-optimization/50622,
      PR fortran/51435, PR debug/51410, PR c/51339, PR rtl-optimization/48721,
      PR middle-end/51323 (LP: #897583), PR middle-end/50074,
      PR middle-end/50074.

  [ Matthias Klose ]
  * Run the libstdc++ testsuite on all architectures again. Closes: #622699.
  * Apply proposed patch for PR target/50906 (powerpcspe only). Closes: #650318.
  * Fix PR target/49030 (ARM), taken from Linaro. Closes: #633479.
  * Fix PR target/50193 (ARM), taken from Linaro. Closes: #642127.
  * Install the libstdc++.so-gdb.py file. LP: #883269.
  * Fix PR c++/50114, backport from trunk. LP: #827806.
  * Merge changes to allow gcc-snapshot cross builds, taken from Linaro.
  * Update the Linaro support to the 4.6 branch.

  [ Marcin Juszkiewicz ]
  * Fix issues with gcc-snapshot cross builds.
  * Allow building Linaro binary packages in a single package.
  * Apply hardening patches for cross builds when enabled for native builds.
 -- Matthias Klose <email address hidden> Thu, 08 Dec 2011 17:17:30 +0100

Changed in gcc-4.6 (Ubuntu):
status: Confirmed → Fix Released
Changed in gcc:
status: In Progress → Fix Released
Michael Hope (michaelh1)
Changed in gcc-linaro:
status: New → Confirmed
Revision history for this message
Michael Hope (michaelh1) wrote :

Will get picked up in the next 4.6 merge. Assigned to Andrew to track.

Changed in gcc-linaro:
status: Confirmed → In Progress
importance: Undecided → High
assignee: nobody → Andrew Stubbs (ams-codesourcery)
milestone: none → 4.6-2012.01
Michael Hope (michaelh1)
Changed in gcc-linaro:
status: In Progress → Fix Committed
Michael Hope (michaelh1)
Changed in gcc-linaro:
status: Fix Committed → In Progress
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.