MadGraph Crashes While Waiting for Filesystem Uodate

Bug #1071765 reported by Brian Dorney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MadGraph5_aMC@NLO
Fix Released
Undecided
Olivier Mattelaer

Bug Description

Dear Experts,

I've been trying to run MadGraph in the command line via:

bin/generate_events test_bjets3 -f

And in debug mode via:
./bin/madevent
generate_events test_bjets5 -f

The jobs appear to submit correctly, and run on the cluster. But crash on completion with the following error message:

Command "generate_events test_bjets5 -f" interrupted with error:
IOError : [Errno 2] No such file or directory: '/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/G1/results.dat'
Please report this bug on https://bugs.launchpad.net/madgraph5
More information is found in '/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/test_bjets5_tag_1_debug.log'.
Please attach this file to your report.

I have tried making the replacement suggested in

https://answers.launchpad.net/madgraph5/+question/208323

to my cluster.py file under ../bjets_yay/bin/internal

But this does not appear to solve the issue.

Additionally, I do have GX style directories present in my ../bjets_yay/Subprocesses/P*/ directory.

Would appreciate any help/advice.

I have attached the log file mentioned above.

Best,
-Brian

Related branches

Revision history for this message
Brian Dorney (bdorney-physics) wrote :
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Hi Brian,

Could you try to run with
cluster_temp_path set on None?
And tell me if this is working. This will help me a lot to understands the problem.
Cheers,

Olivier

Revision history for this message
Brian Dorney (bdorney-physics) wrote :

Olivier,

Setting cluster_temp_path to None still triggers a crash.

I have attached the new Log file.

Best,
-Brian

Revision history for this message
Brian Dorney (bdorney-physics) wrote :

Attached are the contents of the /Cards directory associated with the "test_bjets6_tag_1_debug.log" run I performed with cluster_temp_path = None.

Revision history for this message
Brian Dorney (bdorney-physics) wrote :

And here is the command prompt output from madevent, also associated with the test_bjets6_tag_1_debug.log run.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Hi Brian,

Thanks for the additional information,
I've run the exact same process on my laptop, (in fact you don't need a cluster in this case this is very fast)
and it passes out of the box. So this prooofs that this is a pure cluster problem.

Could you check if the file which is reported missing by MG5 is present on your disk?
If yes, this means that MG5 didn't wait enough time for this file to appear.

Cheers,

Olivier

Revision history for this message
Brian Dorney (bdorney-physics) wrote :

Olivier,

It looks like the file does not indeed exist. Trying the following generation, with everything the same gives:

Command "generate_events test_bjets8 -f" interrupted with error:
IOError : [Errno 2] No such file or directory: '/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/G1/results.dat'

Attempting to find the above file gives:
ls: /afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/G1/results.dat: No such file or directory

The actual contents of the directory are:

input_app.txt log.txt run1_app.log

Would MadGraph place this file in another directory unknowingly?

Ideally I'd like to go to a much larger generation (~4e7 events) using the multi-run feature in the future, which would simply take to long on my local machine. These errors occurred when I was trying out some test jobs.

Thanks again for your help an advice.

Best,
-Brian

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Hi Brian,

Looks like our UIUC cluster was subject to the same bug.

I fix it in the following way:
in the file madgraph/interface/madevent_interface.py around line 3510 changes
                if not output_files:
                    Ire = re.compile("for i in ([\d\s]*) ; do")
                    data = Ire.findall(text)
by
                if not output_files:
                    Ire = re.compile("for i in ([\d\.\s]*) ; do")
                    data = Ire.findall(text)

In principle this should have an impact only if cluster_temp_path is defined. So this might not be enough in your case.
But could you try with this fix?

Thanks,

Olivier

Revision history for this message
Brian Dorney (bdorney-physics) wrote :

Olivier,

As you suspected, the above change did not solve the problem.

After making the change you suggested, the error message was:

Start waiting for update on filesystem. (more info in debug mode)

Command "generate_events test_bjets9 -f" interrupted with error:
IOError : [Errno 2] No such file or directory: '/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/G1/results.dat'
Please report this bug on https://bugs.launchpad.net/madgraph5
More information is found in '/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/test_bjets9_tag_1_debug.log'.
Please attach this file to your report.

The new log file is attached.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Hi Brian,

Thanks for the test.

This is now running on the UIUC cluster correctly.

I tend to think that your problem is not related in the way the cluster is supported but linked to the fact that
some jobs crash on your server. If this is the case, the reason why should be include in the files present in the G1 directory.
Could you put in attachment the three files present in the G1 directory:
(i.e. input_app.txt log.txt run1_app.log)

Thanks,

Olivier

Revision history for this message
Brian Dorney (bdorney-physics) wrote :

Olivier,

I am attaching the input_app.txt file.

log.txt and run1_app.log appear to be empty.

-Brian

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Hi Brian,

Ok so clearly the problem is not linked to lsf implementation but on the fact that the executable is not running at all.

Could you check/do the following:
1) /afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/madevent exists and is executable.

If it doesn't exists, running the following
make gensym
./gensym < input_app.txt
make

2) doing the following:
cd /afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/
./ajob1

This one should crash. could you report all message printed on screen?

If it works locally, the only remaining possibility is that some node (all?) doesn't have enough RAM in order to run madevent.

By the way if you run only
p p > b b ~
is this working?

Cheers,

Olivier

Revision history for this message
Brian Dorney (bdorney-physics) wrote :
Download full text (4.1 KiB)

Olivier,

Okay the following:

/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/madevent

does exists, the output of running it is shown in one of the attachments.

When running

/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/ajob1

The following output exists:

[lxplus418] ~/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg $ ./ajob1
[lxplus418] ~/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg $

So no error message occurs. How much RAM is generally required? I could try specifying nodes with only > X MB/GB.

I copied the Template directory into a new directory. Using the same cards as before, but with only p p > b b~ QED @1 I started another generation with the following output:

[lxplus245] ~/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/BBbar_Simple $ ./bin/generate_events testSimpleRun2_1M -f
No module named madgraph.interface.extended_cmd
No module named madgraph
No module named madgraph
************************************************************
* *
* W E L C O M E to M A D G R A P H 5 *
* M A D E V E N T *
* *
* * * *
* * * * * *
* * * * * 5 * * * * *
* * * * * *
* * * *
* *
* VERSION 5.1.5.2 *
* *
* The MadGraph Development Team - Please visit us at *
* https://server06.fynu.ucl.ac.be/projects/madgraph *
* *
* Type 'help' for in-line help. *
* *
************************************************************
load configuration from /afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/BBbar_Simple/Cards/me5_configuration.txt
load configuration from /afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/input/mg5_configuration.txt
load configuration from /afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/BBbar_Simple/Cards/me5_configuration.txt
Using default eps viewer "gv". Set another one in ./input/mg5_configuration.txt
Using default web browser "firefox". Set another one in ./input/mg5_configuration.txt
generate_events testSimpleRun2_1M -f
Will run in mode parton
Generating 1000000 events with run name testSimpleRun2_1M
survey testSimpleRun2_1M
compile directory
Using random number seed offset = 48
Running Survey
Creating Jobs
Working on SubProcesses
    P1_gg_bbx
    P1_q...

Read more...

Revision history for this message
Brian Dorney (bdorney-physics) wrote :

The log file:

/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/BBbar_Simple/testSimpleRun2_1M_tag_1_debug.log

Is also attached.

Best,
-Brian

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Thanks for the tests,

Could you test two things:
1) modify the line 74 of madgraph/interface/cluster.py from
        if not hasattr(self, 'temp_dir'):
to
        if not hasattr(self, 'temp_dir') or not self.temp_dir:
and retry to submit on your cluster. This might solves your problem.

2) if it doesn't work try to sumit your job manually on your cluster with something like

echo "cd /afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/BBbar_Simple/SubProcesses/P0_gg_bbx;/afs/cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/BBbar_Simple/SubProcesses/P0_gg_bbx/ajob1" | bsub -o out.log -J a1b07041ce0e6d -e err.log'

Thanks for your patience,

Olivier

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Hi Brian,

I suppose that you have give up on this right?
I will then close this ticket.

Sorry for that,

Olivier

Revision history for this message
Brian Dorney (bdorney-physics) wrote : Re: [Bug 1071765] Re: MadGraph Crashes While Waiting for Filesystem Uodate

Olivier,

Sorry for the late response, it has been very busy here.

It was brought to my attention that someone here at CERN has modified
MadGraph to work on the clusters here. I've attached the tar file in case
you had an interest in this.

Best,
-Brian

On Thu, Nov 15, 2012 at 3:42 PM, Olivier Mattelaer <
<email address hidden>> wrote:

> Hi Brian,
>
> I suppose that you have give up on this right?
> I will then close this ticket.
>
> Sorry for that,
>
> Olivier
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1071765
>
> Title:
> MadGraph Crashes While Waiting for Filesystem Uodate
>
> Status in The MadGraph Matrix Element Generator version 5:
> New
>
> Bug description:
> Dear Experts,
>
> I've been trying to run MadGraph in the command line via:
>
> bin/generate_events test_bjets3 -f
>
> And in debug mode via:
> ./bin/madevent
> generate_events test_bjets5 -f
>
> The jobs appear to submit correctly, and run on the cluster. But
> crash on completion with the following error message:
>
> Command "generate_events test_bjets5 -f" interrupted with error:
> IOError : [Errno 2] No such file or directory: '/afs/
> cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/G1/results.dat
> '
> Please report this bug on https://bugs.launchpad.net/madgraph5
> More information is found in '/afs/
> cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/test_bjets5_tag_1_debug.log
> '.
> Please attach this file to your report.
>
> I have tried making the replacement suggested in
>
> https://answers.launchpad.net/madgraph5/+question/208323
>
> to my cluster.py file under ../bjets_yay/bin/internal
>
> But this does not appear to solve the issue.
>
> Additionally, I do have GX style directories present in my
> ../bjets_yay/Subprocesses/P*/ directory.
>
> Would appreciate any help/advice.
>
> I have attached the log file mentioned above.
>
> Best,
> -Brian
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/madgraph5/+bug/1071765/+subscriptions
>

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote : Re: [Bug 1071765] MadGraph Crashes While Waiting for Filesystem Uodate
Download full text (5.5 KiB)

Hi Brian,

Thanks a lot for the tar ball.
Looks like that this is (at least very close to it) the version that CMS is using for their official generation.

The way to submit to the cluster is very close to the standard version:
(here is the diff)
[@Oliviers-MacBook-Pro ~]$ diff Downloads/MG5v1.4.8/madgraph/various/cluster.py Documents/eclipse/1.4.8.4/madgraph/various/cluster.py
562,564c562,564
< command = ['bsub',#'-o', stdout,
< '-J', me_dir] #,
< #'-e', stderr]
---
> command = ['bsub','-o', stdout,
> '-J', me_dir,
> '-e', stderr]

as you see the problem seems to be linked to the way to specify the path of the output file / error file

In most of the case, the stdout/stderr are redirected to /dev/null.
In all those case, (i.e. output/error file not specify) I have pass to the syntax specify in your version.

So this should allow you to finish the survey.
I'm not 100% sure that my modification will be enough for the submission which requires a specific output file.
But I hope so (if the problem is only because of /dev/null then it should be fine).

Those change will be merged in 1.5.5, which will be public in a couple of hours.

Thanks so much,

Olivier

On Nov 17, 2012, at 3:59 AM, Brian Dorney <email address hidden> wrote:

> Olivier,
>
> Sorry for the late response, it has been very busy here.
>
> It was brought to my attention that someone here at CERN has modified
> MadGraph to work on the clusters here. I've attached the tar file in case
> you had an interest in this.
>
> Best,
> -Brian
>
> On Thu, Nov 15, 2012 at 3:42 PM, Olivier Mattelaer <
> <email address hidden>> wrote:
>
>> Hi Brian,
>>
>> I suppose that you have give up on this right?
>> I will then close this ticket.
>>
>> Sorry for that,
>>
>> Olivier
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1071765
>>
>> Title:
>> MadGraph Crashes While Waiting for Filesystem Uodate
>>
>> Status in The MadGraph Matrix Element Generator version 5:
>> New
>>
>> Bug description:
>> Dear Experts,
>>
>> I've been trying to run MadGraph in the command line via:
>>
>> bin/generate_events test_bjets3 -f
>>
>> And in debug mode via:
>> ./bin/madevent
>> generate_events test_bjets5 -f
>>
>> The jobs appear to submit correctly, and run on the cluster. But
>> crash on completion with the following error message:
>>
>> Command "generate_events test_bjets5 -f" interrupted with error:
>> IOError : [Errno 2] No such file or directory: '/afs/
>> cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/SubProcesses/P3_gg_bbxgg/G1/results.dat
>> '
>> Please report this bug on https://bugs.launchpad.net/madgraph5
>> More information is found in '/afs/
>> cern.ch/user/d/dorney/scratch0/MadGraph/template_gridpack/work/MadGraph5_v1_5_2/bjets_yay/test_bjets5_tag_1_debug.log
>> '.
>> Please attach this file to your report.
>>
>> I have tried making the replacement suggested in
>>
>> https://answers.launchpad.net/madgraph5/+question/208323
>>
...

Read more...

Changed in madgraph5:
status: New → Fix Committed
assignee: nobody → Olivier Mattelaer (olivier-mattelaer)
Changed in madgraph5:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.