Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 23
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Both bottom two of the distribution, so far, are 'error' results ending in 'rc - 0x100' and little hope there seems for the next two repairmen to complete differently. Dubious if they would.
restored to 26133.491513. [05:22:26] Starting new Job [05:22:26] Qink name = fldman [05:22:38] Qink name = gesman [05:22:40] Qink name = scfman Application exited with RC = 0x100 [09:16:59] Finished Job #6 [09:16:59] Starting job 7,CPU time has been restored to 39822.467590. [09:16:59] Skipping Job #7 09:17:06 (4698): called boinc_finish E225650_ 557_ S.332.C44H33N5.BRTJOOAAAPRXBR-UHFFFAOYSA-N.15_ s1_ 14_ 3-- - In Progress 10/5/14 14:40:33 10/9/14 02:40:33 0.00 0.0 / 0.0 E225650_ 557_ S.332.C44H33N5.BRTJOOAAAPRXBR-UHFFFAOYSA-N.15_ s1_ 14_ 2-- - In Progress 10/5/14 14:40:31 10/9/14 02:40:31 0.00 0.0 / 0.0 E225650_ 557_ S.332.C44H33N5.BRTJOOAAAPRXBR-UHFFFAOYSA-N.15_ s1_ 14_ 1-- 700 Error 10/4/14 19:41:41 10/5/14 08:52:09 11.17 279.5 / 0.0 E225650_ 557_ S.332.C44H33N5.BRTJOOAAAPRXBR-UHFFFAOYSA-N.15_ s1_ 14_ 0-- 700 Error 10/4/14 19:39:16 10/5/14 14:39:02 11.04 288.7 / 0.0 According the xml export api the outcome was 3, error and a validation state of 2, means invalid. Which is it? Outcome: Return results based on the outcome of their processing. 1 means success, 3 means error, 4 means no reply, 6 means validation error, 7 means abandoned./ ValidateState: Return results based on the validation status. 0 means pending validation, 1 means valid, 2 means invalid, 4 means pending verification, 5 means results failed to validate within given deadline. The exit code is zero, not specified in api, but to the agent meaning there was no error recorded on the host, all normal, therefor more appropriately it really being an 'invalid' This conundrum has been brought up before, and yes it's understood some tasks will not do what the program is supposed to do, but should this not be a wcg/cleanenergy problem, and not an issue of the volunteer who gets the 'error' slap in the face? 22 hours down the hole and more to go, on this one alone. The validator rules can surely be set to not spit out 'error' and let you internally park these results aside, yes? On reading in past 'rc = nnn' was treated as 'not the volunteers problem', so why now? Let's monitor this one and see if an earlier take out is justified when rc =100 occurs, i will. On the search history front, found threads going back to 2010, that include replies like 'we'll give credit', which is the least of my interest. The issue is, they record as node error, therefor the host is forced out of reliability and wastes the next 20 computing with a wingman, when it could genuinely do it alone. Compute those wasted hours per day. FYI, this was a technicians reply in 2013 http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=431871 saying it's a validator margin issue to be corrected. Please fix the public faced treatment. Disconcerting, disturbing, volunteer away-driving. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Need an expansion pack on the brain-pan as when going through the valids on the result status page, seeing this
E225627_ 28_ S.304.C34H14N10O2.MQSULQBIBCOMGA-UHFFFAOYSA-N.11_ s1_ 14_ 1-- 700 Valid 10/3/14 14:08:42 10/4/14 19:55:53 18.00 482.8 / 436.3 E225627_ 28_ S.304.C34H14N10O2.MQSULQBIBCOMGA-UHFFFAOYSA-N.11_ s1_ 14_ 0-- 700 Valid 10/3/14 14:04:44 10/4/14 11:19:38 14.99 389.8 / 436.3 The top one was killed due max time exceeding, not making it to job #6, mine went on and got, drums rolling rc =0x100, yet declared valid. The only thing i can think of, is that in this instance the 2 tasks were compared only on the jobs _1 managed to complete through #5, and the over and above was taken for granted. Which one was declared canonical and is being send to harvard is the 65,001 usd question. Result _1 end piece of log. 19:58:23] Finished Job #5 [19:58:23] Starting job 6,CPU time has been restored to 46882.071564. [19:58:23] Starting new Job [19:58:23] Qink name = fldman [19:58:32] Qink name = gesman [19:58:35] Qink name = scfman Killing job because cpu time limit has been exceeded. 46882.071564||17917.965591||0.000000 [02:07:16] Finished Job #6 02:07:22 (6298): called boinc_finish </stderr_txt> ]]> Result _0 end piece of log. [08:27:34] Finished Job #5 [08:27:34] Starting job 6,CPU time has been restored to 36752.327027. [08:27:34] Starting new Job [08:27:34] Qink name = fldman [08:27:43] Qink name = gesman [08:27:45] Qink name = scfman Application exited with RC = 0x100 [13:14:40] Finished Job #6 [13:14:40] Starting job 7,CPU time has been restored to 53626.065955. [13:14:40] Skipping Job #7 13:14:45 (1374): called boinc_finish </stderr_txt> ]]> Summary, if the wingman hits the 18 hour boundary before being allowed to finish #6 with a rc = 0x100 and the other does, things are fine. Feels like rigged dice at ocean 16 |
||
|
WinterGuard1944
Cruncher Czech Republic Joined: Apr 23, 2013 Post Count: 5 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hello,
----------------------------------------I noticed this funny thing: E225634_ 883_ S.322.C42H28N6.SLLJKJQOKISNAC-UHFFFAOYSA-N.13_ s1_14_1-- 700 Valid 3.10.14 23:48:39 5.10.14 11:22:07 18.00 313.8 / 177.3 E225634_ 883_ S.322.C42H28N6.SLLJKJQOKISNAC-UHFFFAOYSA-N.13_ s1_ 14_0--700 Valid 3.10.14 23:45:36 4.10.14 06:25:39 5.90 285.4 / 1,240.8 My workunit is that lucky. My computer finished all jobs and and ended with RC=0x100 in job #6, while that other computer exceeded maximum time while still doing job #0. Maybe it is useful for you to know what happened in this case. ![]() [Edit 1 times, last edit by WinterGuard1944 at Oct 5, 2014 7:32:55 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
How the 177.3 was arrived at, not going to try, but 1 job got 177.3 so the other that did 7 gets 1240.80, but, this thread is -not- about the borked points methodology, there's many active threads touching on the cep2 credits, this is about the -validation- itself. Your case is in that the same as in my previous post. Just because one of two in a quorum does not manage to get to the fail-over point in job #6, the result suddenly is rated valid. Baffling.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
As I posted in another thread somewhere, this is the reason I don't run this project anymore. Inconsistency in the errors and a total lack of concern about it. As I stated earlier, no explanation, no crunching.
|
||
|
Eric_Kaiser
Veteran Cruncher Germany (Hessen) Joined: May 7, 2013 Post Count: 1047 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Started again 2 weeks ago with cep2 and didn't have any errors - until today. Today two wu errored out with RC=0x100.
----------------------------------------My amd crunching box got many wu with quorum=1 and replication=1. I think I will switch to mcm1 until someone of the techs is looking into this issue... ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi,
Thanks for bringing this up. I will make sure it is passed to our friends at IBM. I am sorry I cannot be of too much assistance here, since I am not an expert in distributed computing, or how the grid is put together internally. Your Harvard CEP Team |
||
|
Eric_Kaiser
Veteran Cruncher Germany (Hessen) Joined: May 7, 2013 Post Count: 1047 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
As long as someone is taking care on this issue I'm fine.
----------------------------------------![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi.
I've got this error as well, lost just over 12hrs. [16:12:16] Starting job 6,CPU time has been restored to 25428.768000. [16:12:16] Starting new Job [16:12:17] Qink name = fldman [16:12:28] Qink name = gesman [16:12:30] Qink name = scfman Application exited with RC = 0x100 [21:15:02] Finished Job #6 [21:15:02] Starting job 7,CPU time has been restored to 43008.136000. [21:15:02] Skipping Job #7 21:15:07 (6922): called boinc_finish Not happy. ![]() |
||
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I just started my weekend crunch of CEP2 WUs... I hope I will avoid this issue, since I can only crunch this project on weekends. Luck to all.
----------------------------------------CJSL Crunching for a better world... |
||
|
|
![]() |