| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 12
|
|
| Author |
|
|
XFox
Cruncher Italy Joined: Jun 12, 2007 Post Count: 5 Status: Offline Project Badges:
|
I'm on Mac OS X 10.7.5 (11G63) and I'm running the command-line only version of BOINC client 7.0.65 on a Intel Core 2 Duo Mac mini.
Looking at my system logs I noticed that sometimes the CEP2 application crashes, apparently just after finishing to crunch a workunit and just before uploading the results of the computation. Nevertheless, looking at my Results Status page I see that the given result has been deemed valid. As an example, this is the result of the last workunit that caused the application worker to crash. When it happens I see the following line in my system.log : May 2 16:52:01 my-mac ReportCrash[5765]: Saved crash report for wcgrid_cep2_qchem_6.40_i686-apple-darwin[4770] version ??? (???) to /Library/Logs/DiagnosticReports/wcgrid_cep2_qchem_6.40_i686-apple-darwin_2013-05-02-165201_localhost.crashIf needed, I can provide the full crash reports. |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
XFox,
Can you post the result log for one of results that has errored out. On the result status page just click on the status and it should open the result log in a separate window. Thanks, armstrdj |
||
|
|
XFox
Cruncher Italy Joined: Jun 12, 2007 Post Count: 5 Status: Offline Project Badges:
|
I'm sorry armstrdj,
I didn't realize that results logs are not public. Unfortunately at the moment on my results status page I cannot see any errored CEP2 results anymore. Right now I'm setting my preferences to get only CEP2 work units, as soon as I'll get another crash I'll post the related result log. |
||
|
|
XFox
Cruncher Italy Joined: Jun 12, 2007 Post Count: 5 Status: Offline Project Badges:
|
This is the result log of a workunit that caused the application worker to crash:
Result Name: E213288_ 426_ C.33.C30H19NSSi.00966244.1.set1d06_ 1-- |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello XFox,
I do not see anything unusual. The work unit completed the first 12 jobs [0-11], then started on the second longest job, Job #12. After several hours, it hit the 12-hour limit and stopped running Job #12. It created 0 results for jobs #12-15, then returned all 16 job results. Later, the Results Status page validated the first 12 jobs returned. All is well. Here is my most recent explanation about CEP2: https://secure.worldcommunitygrid.org/forums/...ead,35108_offset,0#421066 Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Lawrence,
----------------------------------------Beg to differ, this is not 12 hours [03:31:25] Starting job 13,CPU time has been restored to 22125.811551. at 22125 second (6 hours and something) Can say that not a single CEP2 job in production [probable since 6 months if not longer] get's passed # 11 (Job 12) which is the second long job in the task and ends with the "RC =..." ... this is a "by design" I'd interpret this... nothing useful after this. If 12 hours is hit, it will actually print a more blunt line: [04:49:20] Starting job 9,CPU time has been restored to 42687.115234. Killing job because cpu time has been exceeded. Subjob These come off my slow duo... the Linux 99% of the times is cut off before the 12 hours are over with the RC = message (they vary after the equal sign). Logs BTW for Linux/Mac are much more detailed than for Windows. [Edit 1 times, last edit by Former Member at May 12, 2013 8:00:03 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I was too hasty in trying to convert the date time groups.
So an error condition occurred while trying to run job #12. As always, that ends the work unit and no more jobs are run. The first 12 job results validated. The error was probably an algorithmic error, meaning that the algorithm was unable to handle that particular molecule. Lawrence |
||
|
|
XFox
Cruncher Italy Joined: Jun 12, 2007 Post Count: 5 Status: Offline Project Badges:
|
Thanks Lawrence for the explanation.
I have no doubt that the workunits results are corrects but I think that if a process crashes, there must be a bug somewhere in the code, though. It may not affect the scientific results but it's a bug nonetheless that should be addressed. I've started to collect crash reports like these ones, will I have to live with them forever? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi XFox,
**!!* I lost my post before I posted it. Anyway, don't worry about a particular numerical technique used to implement an algorithm failing. It is expected. After all, Applied Numerical Analysis works with a large collection of techniques. Just call it craftsmanship and let the project scientists worry about it. Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi XFox,
as lawrencehardin already mentioned, the crashing is not due to a bug but generally due to misbehaving numerics. That happens all the time in the computational sciences. No worries. Best wishes Your Harvard CEP team |
||
|
|
|