Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 15
Posts: 15   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3370 times and has 14 replies Next Thread
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Workunit errored out - No obvious errors?

Wondering why my particular workunit errored out despite completing almost 12 hours without anything obvious going wrong. This is from the workunit status page of workunit "E235158_918_S.298.C38H26N4O2.QGTIVOPIQVYRAQ-UHFFFAOYSA-N.11_s1_14", my copy is in bold.
E235158_ 918_ S.298.C38H26N4O2.QGTIVOPIQVYRAQ-UHFFFAOYSA-N.11_ s1_ 14_ 3-- 700 Valid 9/12/15 15:11:00 10/12/15 04:36:16 13.21 399.4 / 445.6

E235158_ 918_ S.298.C38H26N4O2.QGTIVOPIQVYRAQ-UHFFFAOYSA-N.11_ s1_ 14_ 2-- 700 Error 8/12/15 08:14:32 9/12/15 14:24:51 11.65 117.9 / 0.0

E235158_ 918_ S.298.C38H26N4O2.QGTIVOPIQVYRAQ-UHFFFAOYSA-N.11_ s1_ 14_ 1-- 700 Valid 8/12/15 08:09:44 9/12/15 03:15:09 18.00 491.7 / 445.6

E235158_ 918_ S.298.C38H26N4O2.QGTIVOPIQVYRAQ-UHFFFAOYSA-N.11_ s1_ 14_ 0-- 700 Error 8/12/15 08:05:54 8/12/15 08:08:33 0.00 383.2 / 0.0

And here is the actual log from the workunit itself:
<core_client_version>7.2.31</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[18:45:47] Number of jobs = 8
[18:45:47] Starting job 0,CPU time has been restored to 0.000000.
[07:38:44] Finished Job #0
[07:38:44] Starting job 1,CPU time has been restored to 21456.611541.
[17:47:14] Finished Job #1
[17:47:14] Starting job 2,CPU time has been restored to 22524.407186.
[18:09:30] Finished Job #2
[18:09:30] Starting job 3,CPU time has been restored to 23802.554580.
[18:32:52] Finished Job #3
[18:32:52] Starting job 4,CPU time has been restored to 25137.018334.
[18:48:58] Finished Job #4
[18:48:58] Starting job 5,CPU time has been restored to 26068.469104.
[19:03:44] Finished Job #5
[19:03:44] Starting job 6,CPU time has been restored to 26940.998298.
Application exited with RC = 0x1
[23:16:18] Finished Job #6
[23:16:18] Starting job 7,CPU time has been restored to 41927.780366.
[23:16:18] Skipping Job #7
23:16:23 (13736): called boinc_finish

</stderr_txt>
]]>

No obvious errors in log, still errored out with no credit while others with similar results considered valid (one timed out at the 18hr limit, still valid??). Can it be explained where my copy went out of whack? confused
[Dec 11, 2015 3:14:27 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

Me too!
The volunteer who received it after my failure is working on it so maybe he and the others will fail too instead of succed and mine isn't the same scenario as yours, let's see.

Project Name: The Clean Energy Project - Phase 2
Created: 12/07/2015 10:40:06
Name: E235171_471_S.302.C36H24N2S3.XICVLFGYQUMYSL-UHFFFAOYSA-N.1_s1_14
Minimum Quorum: 1
Replication: 1

Result Name: E235171_ 471_ S.302.C36H24N2S3.XICVLFGYQUMYSL-UHFFFAOYSA-N.1_ s1_ 14_ 0--
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[13:57:28] Number of jobs = 8
[13:57:28] Starting job 0,CPU time has been restored to 0.000000.
[13:57:28] Starting new Job
[13:57:28] Qink name = fldman
[13:57:29] Qink name = gesman
[13:57:30] Qink name = scfman
[14:48:32] Qink name = anlman
[14:48:32] Qink name = drvman
[14:51:09] Qink name = optman
[14:51:10] Qink name = fldman
[14:51:10] Qink name = gesman
[14:51:11] Qink name = scfman
[15:08:26] Qink name = anlman
[15:08:26] Qink name = drvman
[15:10:58] Qink name = optman
[15:10:58] Qink name = fldman
[15:10:58] Qink name = gesman
[15:11:00] Qink name = scfman
[15:27:51] Qink name = anlman
[15:27:51] Qink name = drvman
[15:30:23] Qink name = optman
[15:30:23] Qink name = fldman
[15:30:23] Qink name = gesman
[15:30:25] Qink name = scfman
[15:47:24] Qink name = anlman
[15:47:24] Qink name = drvman
[15:49:57] Qink name = optman
[15:49:57] Qink name = fldman
[15:49:57] Qink name = gesman
[15:49:59] Qink name = scfman
[16:06:44] Qink name = anlman
[16:06:44] Qink name = drvman
[16:09:15] Qink name = optman
[16:09:15] Qink name = fldman
[16:09:15] Qink name = gesman
[16:09:17] Qink name = scfman
[16:25:13] Qink name = anlman
[16:25:13] Qink name = drvman
[16:27:42] Qink name = optman
[16:27:42] Qink name = fldman
[16:27:42] Qink name = gesman
[16:27:44] Qink name = scfman
[16:44:37] Qink name = anlman
[16:44:37] Qink name = drvman
[16:47:10] Qink name = optman
[16:47:10] Qink name = fldman
[16:47:10] Qink name = gesman
[16:47:12] Qink name = scfman
[17:04:22] Qink name = anlman
[17:04:22] Qink name = drvman
[17:07:11] Qink name = optman
[17:07:12] Qink name = fldman
[17:07:12] Qink name = gesman
[17:07:13] Qink name = scfman
[17:22:31] Qink name = anlman
[17:22:31] Qink name = drvman
[17:25:05] Qink name = optman
[17:25:05] Qink name = fldman
[17:25:05] Qink name = gesman
[17:25:07] Qink name = scfman
[17:38:43] Qink name = anlman
[17:38:43] Qink name = drvman
[17:41:15] Qink name = optman
[17:41:16] Qink name = fldman
[17:41:16] Qink name = gesman
[17:41:17] Qink name = scfman
[17:56:21] Qink name = anlman
[17:56:21] Qink name = drvman
[17:59:07] Qink name = optman
[17:59:07] Qink name = fldman
[17:59:07] Qink name = gesman
[17:59:09] Qink name = scfman
[18:13:25] Qink name = anlman
[18:13:25] Qink name = drvman
[18:16:23] Qink name = optman
[18:16:23] Qink name = fldman
[18:16:23] Qink name = gesman
[18:16:25] Qink name = scfman
[18:28:54] Qink name = anlman
[18:28:54] Qink name = drvman
[18:31:24] Qink name = optman
[18:31:24] Qink name = fldman
[18:31:24] Qink name = gesman
[18:31:26] Qink name = scfman
[18:43:01] Qink name = anlman
[18:43:01] Qink name = drvman
[18:45:33] Qink name = optman
[18:45:33] Qink name = fldman
[18:45:33] Qink name = gesman
[18:45:35] Qink name = scfman
[18:57:30] Qink name = anlman
[18:57:30] Qink name = drvman
[19:00:03] Qink name = optman
[19:00:03] Qink name = fldman
[19:00:03] Qink name = gesman
[19:00:04] Qink name = scfman
[19:11:13] Qink name = anlman
[19:11:13] Qink name = drvman
[19:13:45] Qink name = optman
[19:13:45] Qink name = anlman
[19:15:48] End of Job
[19:15:49] Finished Job #0
[19:15:49] Starting job 1,CPU time has been restored to 17577.720000.
[19:15:50] Starting new Job
[19:15:50] Qink name = fldman
[19:15:51] Qink name = gesman
[19:15:51] Qink name = scfman
[19:31:15] Qink name = anlman
[19:33:17] End of Job
[19:33:19] Finished Job #1
[19:33:19] Starting job 2,CPU time has been restored to 18572.128000.
[19:33:19] Starting new Job
[19:33:19] Qink name = fldman
[19:33:20] Qink name = gesman
[19:33:21] Qink name = scfman
[19:46:47] Qink name = anlman
[19:48:41] End of Job
[19:48:42] Finished Job #2
[19:48:42] Starting job 3,CPU time has been restored to 19443.780000.
[19:48:42] Starting new Job
[19:48:42] Qink name = fldman
[19:48:43] Qink name = gesman
[19:48:44] Qink name = scfman
[20:07:01] Qink name = anlman
[20:08:54] End of Job
[20:08:56] Finished Job #3
[20:08:56] Starting job 4,CPU time has been restored to 20625.900000.
[20:08:56] Starting new Job
[20:08:56] Qink name = fldman
[20:08:58] Qink name = gesman
[20:08:58] Qink name = scfman
[20:20:56] Qink name = anlman
[20:22:51] End of Job
[20:22:53] Finished Job #4
[20:22:53] Starting job 5,CPU time has been restored to 21435.456000.
[20:22:53] Starting new Job
[20:22:53] Qink name = fldman
[20:22:54] Qink name = gesman
[20:22:55] Qink name = scfman
[20:30:28] Qink name = anlman
[20:34:19] End of Job
[20:34:20] Finished Job #5
[20:34:20] Starting job 6,CPU time has been restored to 22107.544000.
[20:34:20] Starting new Job
[20:34:20] Qink name = fldman
[20:34:28] Qink name = gesman
[20:34:30] Qink name = scfman
Application exited with RC = 0x100
[23:45:27] Finished Job #6
[23:45:27] Starting job 7,CPU time has been restored to 33302.508000.
[23:45:27] Skipping Job #7
23:45:31 (3118): called boinc_finish

</stderr_txt>
]]>

[Dec 11, 2015 7:28:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

Me too!
The volunteer who received it after my failure is working on it so maybe he and the others will fail too instead of succed and mine isn't the same scenario as yours, let's see.
It's interesting hey? It kinda comes off as a "phantom" error of sorts. Like something goes wrong but no obvious indicator as to where or why confused
----------------------------------------
[Edit 1 times, last edit by tombell12 at Dec 11, 2015 9:28:04 AM]
[Dec 11, 2015 9:27:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

Yep! Let's see if someone (probably sekerob) knows someting about it.
[Dec 11, 2015 12:01:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

tombell12, although you got the 'Error' status, I'm wondering what your Minimum Quorum and Replication is for said WU:
E235158_ 918_ S.298.C38H26N4O2.QGTIVOPIQVYRAQ-UHFFFAOYSA-N.11_ s1_ 14_ 2-- 700 Error 8/12/15 08:14:32 9/12/15 14:24:51 11.65 117.9 / 0.0
[Dec 11, 2015 12:13:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

tombell12, although you got the 'Error' status, I'm wondering what your Minimum Quorum and Replication is for said WU:
E235158_ 918_ S.298.C38H26N4O2.QGTIVOPIQVYRAQ-UHFFFAOYSA-N.11_ s1_ 14_ 2-- 700 Error 8/12/15 08:14:32 9/12/15 14:24:51 11.65 117.9 / 0.0

The workunit status data has disappeared now but I do recall it saying 2 for both those values smile
----------------------------------------
[Edit 1 times, last edit by tombell12 at Dec 11, 2015 8:41:09 PM]
[Dec 11, 2015 8:40:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

Yep! Let's see if someone (probably sekerob) knows someting about it.

Your log is interesting though, it has all these "Qink name" values smile
[Dec 11, 2015 8:42:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

Yep! Let's see if someone (probably sekerob) knows someting about it.

Your log is interesting though, it has all these "Qink name" values smile

I don't know what it means, do you? It appears in log files of all my WUs crunched under Linux, I don't recall having ever seen it in the ones under Windows.

Meanwhile the WUs has been sent to four other volunteer: three of them had already reported it and all got an error but none of them shows error in result log. Every once in a while such a WU appears and someone on this forum calls it a toxic one.
[Dec 14, 2015 4:29:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

I don't know what it means, do you? It appears in log files of all my WUs crunched under Linux, I don't recall having ever seen it in the ones under Windows.

Meanwhile the WUs has been sent to four other volunteer: three of them had already reported it and all got an error but none of them shows error in result log. Every once in a while such a WU appears and someone on this forum calls it a toxic one.

Only obvious thing is that it would pertain to your Linux configuration. I've been stuck on such a "toxic" WU which just cuts out after 5 copies. I read that apparently those WU's get put on some "investigation" list.
[Dec 14, 2015 9:10:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: Workunit errored out - No obvious errors?

TME's, (Too Many Errors) give credit for time, but in a highly inconsistent fashion. Testing against a very big cruncher account and see 9 listed at this time with the 18:00:01 hour mark, none with credit, 6 of which are of previous stats period i.e. they got returned before last night 00:06. The program takes a snapshot of all results each period end which then allows to track back which got moved off [without credit for time]. Crunched they were, even in futility, they inform of that fact! Maybe they get re-crunched on a powerhouse cluster of the Harvard team, but that's all likely only when statistics tell them the mol is of greater interest.
----------------------------------------
[Edit 1 times, last edit by SekeRob* at Dec 14, 2015 9:34:42 AM]
[Dec 14, 2015 9:26:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 15   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread