Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 12
Posts: 12   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2726 times and has 11 replies Next Thread
XFox
Cruncher
Italy
Joined: Jun 12, 2007
Post Count: 5
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Mac OS X application worker crashes

I'm on Mac OS X 10.7.5 (11G63) and I'm running the command-line only version of BOINC client 7.0.65 on a Intel Core 2 Duo Mac mini.
Looking at my system logs I noticed that sometimes the CEP2 application crashes, apparently just after finishing to crunch a workunit and just before uploading the results of the computation. Nevertheless, looking at my Results Status page I see that the given result has been deemed valid.
As an example, this is the result of the last workunit that caused the application worker to crash.
When it happens I see the following line in my system.log :
May  2 16:52:01 my-mac ReportCrash[5765]: Saved crash report for wcgrid_cep2_qchem_6.40_i686-apple-darwin[4770] version ??? (???) to /Library/Logs/DiagnosticReports/wcgrid_cep2_qchem_6.40_i686-apple-darwin_2013-05-02-165201_localhost.crash
If needed, I can provide the full crash reports.
[May 2, 2013 11:23:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

XFox,

Can you post the result log for one of results that has errored out. On the result status page just click on the status and it should open the result log in a separate window.

Thanks,
armstrdj
[May 9, 2013 3:34:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
XFox
Cruncher
Italy
Joined: Jun 12, 2007
Post Count: 5
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

I'm sorry armstrdj,
I didn't realize that results logs are not public.
Unfortunately at the moment on my results status page I cannot see any errored CEP2 results anymore.
Right now I'm setting my preferences to get only CEP2 work units, as soon as I'll get another crash I'll post the related result log.
[May 9, 2013 5:28:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
XFox
Cruncher
Italy
Joined: Jun 12, 2007
Post Count: 5
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

This is the result log of a workunit that caused the application worker to crash:
Result Name: E213288_ 426_ C.33.C30H19NSSi.00966244.1.set1d06_ 1-- 

<core_client_version>7.0.65</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[14:55:24] Number of jobs = 16
[14:55:24] Starting job 0,CPU time has been restored to 0.000000.
[14:55:25] Starting new Job
[14:55:27] Qink name = fldman
[14:55:27] Qink name = gesman
[14:55:27] Qink name = scfman
[15:05:25] Qink name = anlman
[15:05:31] End of Job
[15:05:36] Finished Job #0
[15:05:36] Starting job 1,CPU time has been restored to 179.612568.
[15:05:37] Starting new Job
[15:05:37] Qink name = fldman
[15:05:40] Qink name = gesman
[15:05:41] Qink name = scfman
[15:26:11] Qink name = anlman
[15:28:39] End of Job
[15:28:44] Finished Job #1
[15:28:44] Starting job 2,CPU time has been restored to 717.427004.
[15:28:45] Starting new Job
[15:28:45] Qink name = fldman
[15:28:47] Qink name = gesman
[15:28:47] Qink name = scfman
[15:40:53] Qink name = anlman
[15:40:53] Qink name = drvman
[15:44:21] Qink name = optman
[15:44:22] Qink name = fldman
[15:44:22] Qink name = gesman
[15:44:24] Qink name = scfman
[16:00:55] Qink name = anlman
[16:00:55] Qink name = drvman
[16:04:46] Qink name = optman
[16:04:46] Qink name = fldman
[16:04:46] Qink name = gesman
[16:04:48] Qink name = scfman
[16:19:42] Qink name = anlman
[16:19:42] Qink name = drvman
[16:22:41] Qink name = optman
[16:22:42] Qink name = fldman
[16:22:42] Qink name = gesman
[16:22:43] Qink name = scfman
[16:37:06] Qink name = anlman
[16:37:06] Qink name = drvman
[16:39:59] Qink name = optman
[16:40:00] Qink name = fldman
[16:40:00] Qink name = gesman
[16:40:02] Qink name = scfman
[16:54:46] Qink name = anlman
[16:54:46] Qink name = drvman
[16:57:50] Qink name = optman
[16:57:50] Qink name = fldman
[16:57:50] Qink name = gesman
[16:57:52] Qink name = scfman
[17:12:43] Qink name = anlman
[17:12:43] Qink name = drvman
[17:15:31] Qink name = optman
[17:15:31] Qink name = fldman
[17:15:31] Qink name = gesman
[17:15:33] Qink name = scfman
[17:29:37] Qink name = anlman
[17:29:37] Qink name = drvman
[17:32:12] Qink name = optman
[17:32:12] Qink name = fldman
[17:32:12] Qink name = gesman
[17:32:13] Qink name = scfman
[17:45:07] Qink name = anlman
[17:45:08] Qink name = drvman
[17:47:51] Qink name = optman
[17:47:51] Qink name = fldman
[17:47:51] Qink name = gesman
[17:47:53] Qink name = scfman
[18:01:51] Qink name = anlman
[18:01:51] Qink name = drvman
[18:04:48] Qink name = optman
[18:04:48] Qink name = fldman
[18:04:48] Qink name = gesman
[18:04:51] Qink name = scfman
[18:19:27] Qink name = anlman
[18:19:27] Qink name = drvman
[18:22:10] Qink name = optman
[18:22:10] Qink name = fldman
[18:22:10] Qink name = gesman
[18:22:12] Qink name = scfman
[18:34:26] Qink name = anlman
[18:34:27] Qink name = drvman
[18:37:04] Qink name = optman
[18:37:05] Qink name = fldman
[18:37:05] Qink name = gesman
[18:37:06] Qink name = scfman
[18:48:07] Qink name = anlman
[18:48:07] Qink name = drvman
[18:50:56] Qink name = optman
[18:50:57] Qink name = fldman
[18:50:57] Qink name = gesman
[18:50:58] Qink name = scfman
[19:01:28] Qink name = anlman
[19:01:29] Qink name = drvman
[19:04:11] Qink name = optman
[19:04:12] Qink name = fldman
[19:04:12] Qink name = gesman
[19:04:14] Qink name = scfman
[19:14:17] Qink name = anlman
[19:14:17] Qink name = drvman
[19:16:51] Qink name = optman
[19:16:51] Qink name = fldman
[19:16:51] Qink name = gesman
[19:16:54] Qink name = scfman
[19:27:33] Qink name = anlman
[19:27:34] Qink name = drvman
[19:30:38] Qink name = optman
[19:30:39] Qink name = fldman
[19:30:39] Qink name = gesman
[19:30:40] Qink name = scfman
[19:39:39] Qink name = anlman
[19:39:39] Qink name = drvman
[19:42:44] Qink name = optman
[19:42:44] Qink name = fldman
[19:42:44] Qink name = gesman
[19:42:46] Qink name = scfman
[19:56:58] Qink name = anlman
[19:56:59] Qink name = drvman
[19:59:52] Qink name = optman
[19:59:52] Qink name = fldman
[19:59:52] Qink name = gesman
[19:59:53] Qink name = scfman
[20:10:03] Qink name = anlman
[20:10:03] Qink name = drvman
[20:12:39] Qink name = optman
[20:12:39] Qink name = fldman
[20:12:39] Qink name = gesman
[20:12:41] Qink name = scfman
[20:21:27] Qink name = anlman
[20:21:27] Qink name = drvman
[20:24:08] Qink name = optman
[20:24:08] Qink name = fldman
[20:24:08] Qink name = gesman
[20:24:09] Qink name = scfman
[20:32:20] Qink name = anlman
[20:32:20] Qink name = drvman
[20:34:53] Qink name = optman
[20:34:53] Qink name = anlman
[20:36:50] End of Job
[20:36:54] Finished Job #2
[20:36:54] Starting job 3,CPU time has been restored to 12220.652958.
[20:36:55] Starting new Job
[20:36:55] Qink name = fldman
[20:36:56] Qink name = gesman
[20:36:56] Qink name = scfman
[20:49:05] Qink name = anlman
[20:51:16] End of Job
[20:51:20] Finished Job #3
[20:51:20] Starting job 4,CPU time has been restored to 12790.009726.
[20:51:22] Starting new Job
[20:51:22] Qink name = fldman
[20:51:24] Qink name = gesman
[20:51:24] Qink name = scfman
[21:02:13] Qink name = anlman
[21:04:26] End of Job
[21:04:30] Finished Job #4
[21:04:30] Starting job 5,CPU time has been restored to 13269.926576.
[21:04:31] Starting new Job
[21:04:32] Qink name = fldman
[21:04:33] Qink name = gesman
[21:04:34] Qink name = scfman
[21:15:17] Qink name = anlman
[21:17:32] End of Job
[21:17:36] Finished Job #5
[21:17:36] Starting job 6,CPU time has been restored to 13765.559638.
[21:17:37] Starting new Job
[21:17:37] Qink name = fldman
[21:17:38] Qink name = gesman
[21:17:39] Qink name = scfman
[21:28:07] Qink name = anlman
[21:30:10] End of Job
[21:30:15] Finished Job #6
[21:30:15] Starting job 7,CPU time has been restored to 14239.445508.
[21:30:16] Starting new Job
[21:30:16] Qink name = fldman
[21:30:17] Qink name = gesman
[21:30:18] Qink name = scfman
[21:44:48] Qink name = anlman
[21:46:56] End of Job
[21:47:00] Finished Job #7
[21:47:00] Starting job 8,CPU time has been restored to 14875.006281.
[21:47:02] Starting new Job
[21:47:02] Qink name = fldman
[21:47:03] Qink name = gesman
[21:47:03] Qink name = scfman
[21:57:06] Qink name = anlman
[21:59:52] End of Job
[21:59:56] Finished Job #8
[21:59:56] Starting job 9,CPU time has been restored to 15327.571312.
[21:59:57] Starting new Job
[21:59:57] Qink name = fldman
[21:59:59] Qink name = gesman
[22:00:00] Qink name = scfman
[22:12:01] Qink name = anlman
[22:15:01] End of Job
[22:15:06] Finished Job #9
[22:15:06] Starting job 10,CPU time has been restored to 15867.265160.
[22:15:07] Starting new Job
[22:15:07] Qink name = fldman
[22:15:08] Qink name = gesman
[22:15:08] Qink name = scfman
[01:13:20] Qink name = anlman
[01:16:21] End of Job
[01:16:25] Finished Job #10
[01:16:25] Starting job 11,CPU time has been restored to 17164.241456.
[01:16:27] Starting new Job
[01:16:27] Qink name = fldman
[01:16:28] Qink name = gesman
[01:16:28] Qink name = scfman
[01:30:21] Qink name = anlman
[01:33:25] End of Job
[01:33:29] Finished Job #11
[01:33:29] Starting job 12,CPU time has been restored to 17807.769896.
[01:33:31] Starting new Job
[01:33:31] Qink name = fldman
[01:33:38] Qink name = gesman
[01:33:40] Qink name = scfman
[03:08:56] Qink name = anlman
Application exited with RC = 0xb
[03:31:25] Finished Job #12
[03:31:25] Starting job 13,CPU time has been restored to 22125.811551.
[03:31:25] Skipping Job #13
[03:31:25] Starting job 14,CPU time has been restored to 22125.811551.
[03:31:25] Skipping Job #14
[03:31:25] Starting job 15,CPU time has been restored to 22125.811551.
[03:31:25] Skipping Job #15
called boinc_finish

</stderr_txt>
]]>

[May 12, 2013 12:43:59 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

Hello XFox,
I do not see anything unusual. The work unit completed the first 12 jobs [0-11], then started on the second longest job, Job #12. After several hours, it hit the 12-hour limit and stopped running Job #12. It created 0 results for jobs #12-15, then returned all 16 job results. Later, the Results Status page validated the first 12 jobs returned.

All is well.

Here is my most recent explanation about CEP2: https://secure.worldcommunitygrid.org/forums/...ead,35108_offset,0#421066

biggrin
Lawrence
[May 12, 2013 4:58:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

Lawrence,

Beg to differ, this is not 12 hours

[03:31:25] Starting job 13,CPU time has been restored to 22125.811551.


at 22125 second (6 hours and something)

Can say that not a single CEP2 job in production [probable since 6 months if not longer] get's passed # 11 (Job 12) which is the second long job in the task and ends with the "RC =..." ... this is a "by design" I'd interpret this... nothing useful after this.

If 12 hours is hit, it will actually print a more blunt line:

[04:49:20] Starting job 9,CPU time has been restored to 42687.115234.
Killing job because cpu time has been exceeded. Subjob


These come off my slow duo... the Linux 99% of the times is cut off before the 12 hours are over with the RC = message (they vary after the equal sign).

Logs BTW for Linux/Mac are much more detailed than for Windows.
----------------------------------------
[Edit 1 times, last edit by Former Member at May 12, 2013 8:00:03 AM]
[May 12, 2013 7:55:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

I was too hasty in trying to convert the date time groups.
smile
So an error condition occurred while trying to run job #12. As always, that ends the work unit and no more jobs are run. The first 12 job results validated. The error was probably an algorithmic error, meaning that the algorithm was unable to handle that particular molecule.

Lawrence
[May 12, 2013 10:00:44 AM]   Link   Report threatening or abusive post: please login first  Go to top 
XFox
Cruncher
Italy
Joined: Jun 12, 2007
Post Count: 5
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

Thanks Lawrence for the explanation.
I have no doubt that the workunits results are corrects but I think that if a process crashes, there must be a bug somewhere in the code, though. It may not affect the scientific results but it's a bug nonetheless that should be addressed.
I've started to collect crash reports like these ones, will I have to live with them forever?
[May 13, 2013 2:54:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

Hi XFox,

**!!* I lost my post before I posted it. Anyway, don't worry about a particular numerical technique used to implement an algorithm failing. It is expected. After all, Applied Numerical Analysis works with a large collection of techniques.

Just call it craftsmanship and let the project scientists worry about it.

Lawrence
[May 13, 2013 10:48:21 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Mac OS X application worker crashes

Hi XFox,
as lawrencehardin already mentioned, the crashing is not due to a bug but generally due to misbehaving numerics. That happens all the time in the computational sciences. No worries.
Best wishes
Your Harvard CEP team
[May 13, 2013 7:06:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 12   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread