add in limit for size of core file accepted / retraced

Bug #1570937 reported by Brian Murray
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Daisy
Triaged
Low
Unassigned

Bug Description

The retracers won't be able to handle core files greater than some size (due to disk or memory limitations), so we should not accept core files that are larger than that size.

This came about because we had retracers repeatedly failing to retrace a core file. From the logs we can see:

2016-04-15 12:22:31,732:8055:140170219415296:INFO:root:0518d850-0252-11e6-9725-fa163ed44aae:swift:Processing.
2016-04-15 12:22:32,653:8055:140170219415296:INFO:root:0518d850-0252-11e6-9725-fa163ed44aae:swift:Decompressing to /tmp/tmpbKYmwx-swift.0518d850-0252-11e6-9725-fa163ed44aae.oopsid.core
2016-04-15 12:26:32,690:24400:140395459364608:INFO:root:Running revision number: 697 with sandbox_dir /srv/daisy.ubuntu.com/production/cache, gdb 7.10.90.20160215-0ubuntu3~~0.IS.12.04.0.
2016-04-15 12:26:33,924:24400:140395459364608:INFO:root:Waiting for messages. ^C to exit.
2016-04-15 12:30:31,640:24400:140395459364608:INFO:root:0518d850-0252-11e6-9725-fa163ed44aae:swift:Processing.
2016-04-15 12:30:31,765:24400:140395459364608:INFO:urllib3.connectionpool:0518d850-0252-11e6-9725-fa163ed44aae:swift:Starting new HTTP connection (1): 10.34.0.136
2016-04-15 12:30:32,966:24400:140395459364608:INFO:root:0518d850-0252-11e6-9725-fa163ed44aae:swift:Decompressing to /tmp/tmp3OaOTw-swift.0518d850-0252-11e6-9725-fa163ed44aae.oopsid.core
2016-04-15 12:34:32,163:25279:140510235412224:INFO:root:Running revision number: 697 with sandbox_dir /srv/daisy.ubuntu.com/production/cache, gdb 7.10.90.20160215-0ubuntu3~~0.IS.12.04.0.
2016-04-15 12:34:33,440:25279:140510235412224:INFO:root:Waiting for messages. ^C to exit.

Revision history for this message
Brian Murray (brian-murray) wrote :

I don't believe we know the size of the core file when it is sent or being received, but the retracing process might be able to figure it out before it gets the core file from swift and it could just mark it as failed and delete the core file from swift then.

summary: - add in limit for size of core file accepted
+ add in limit for size of core file accepted / retraced
Changed in daisy:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Brian Murray (brian-murray) wrote :

I don't recall exactly what prompted the reporting of this bug but I went looking for failures to retrace where the core files were large and didn't find many.

for day in $(seq -w 1 24); do for delog in $(find . -name daisy-error.log-201605$day.gz); do zgrep -E "[4-9][0-9]{8} byte core file" $delog; done; done
2016-04-30 08:29:02 [21487] [INFO] 8088a2f6-0ead-11e6-b5ae-fa163ed44aae has a 555472575 byte core file
2016-05-01 12:50:55 [7575] [INFO] 3db3e87a-0f9b-11e6-a577-fa163ef911dc has a 659132997 byte core file
2016-05-02 11:12:32 [23157] [INFO] ab97e628-1056-11e6-ba1d-fa163ed44aae has a 584719133 byte core file
2016-05-06 11:43:24 [12105] [INFO] a029cbae-137f-11e6-b4fe-fa163ebeb28a has a 451253946 byte core file
2016-05-08 05:50:24 [10666] [INFO] b5360a14-14e0-11e6-a9ae-fa163e54c21f has a 412296492 byte core file
2016-05-12 08:07:33 [6041] [INFO] 7caa2a82-1818-11e6-8039-fa163ef911dc has a 409175196 byte core file
2016-05-18 11:33:15 [27203] [INFO] 33472862-1cec-11e6-9069-fa163e171d9b has a 438441177 byte core file
2016-05-23 09:53:33 [32421] [INFO] 1ce516cc-20cc-11e6-83e5-fa163e30221b has a 410228098 byte core file

Of those crashes only ab97e628-1056-11e6-ba1d-fa163ed44aae created an issue where the retracer couldn't allocate memory, however the crash was then sent to the failed queue retracers where it succeeded. So the data doesn't indicate this is a high priority.

Changed in daisy:
importance: High → Low
Revision history for this message
Brian Murray (brian-murray) wrote :

We could stop accepting core files for large crashes with something like this:

 $ bzr diff
=== modified file 'daisy/submit_core.py'
--- daisy/submit_core.py 2016-05-24 16:29:14 +0000
+++ daisy/submit_core.py 2016-05-24 16:44:33 +0000
@@ -117,8 +117,10 @@
             t_size = os.path.getsize(t.name)
             msg = '%s has a %i byte core file' % (oops_id, t_size)
             logger.info(msg)
- # Don't set a content_length (that we don't have) to force a chunked
- # transfer.
+ if t_size > 100000:
+ msg = 'Not writing extra large core file for %s.' % (oops_id)
+ logger.info(msg)
+ return False
             _cached_swift.put_object(bucket, oops_id, t, content_length=t_size)
     except IOError, e:
         swift_delete_ignoring_error(_cached_swift, bucket, oops_id)
@@ -233,6 +235,7 @@
     if written:
         return message
     else:
+ # don't want this to increment if it is a large core file
         metrics.meter('storage_write_error')
         return None

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.