Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 9
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2146 times and has 8 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
shock CEP WU repeatedly failing quorum 6 and no success

I just wanted to let you know that I noticed one of the WU seems to be broken. I noted it because it ran 24.15 hours on my machine (usually CEP runs ~6hrs) I was the 4th quorum run, and the 5th after me also failed. All other results are Inconclusive after 5-8hrs, mine ended in Error after 24 hours.

E000666_ 306C_ 00670460j
workunitId=74817140

Result Log from my 24 hour run:

<core_client_version>6.6.29</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Calling gridPlatform.init()
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
Calling gridPlatform.init()
Calling initGraphics()
Calling gridPlatform.init()
Calling initGraphics()
Calling gridPlatform.init()
Calling initGraphics()
Calling gridPlatform.init()
Calling initGraphics()
SIGSEGV: segmentation violation

Crashed executable name: wcgrid_cep1_6.32_i686-apple-darwin
built using BOINC library version 6.3.4
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.5.7 build 9J61
Sun Jun 7 18:39:01 2009

atos cannot load symbols for the file wcgrid_cep1_6.32_i686-apple-darwin.
0 0x007db309 SIGPIPE: write on a pipe with no reader
1 0x007d1c2e SIGPIPE: write on a pipe with no reader
2 0x902822bb SIGPIPE: write on a pipe with no reader
3 0xffffffff SIGPIPE: write on a pipe with no reader
4 0x0017ff4f SIGPIPE: write on a pipe with no reader
5 0x0042fd4d SIGPIPE: write on a pipe with no reader
6 0x001c03d6 SIGPIPE: write on a pipe with no reader
7 0x0019913f SIGPIPE: write on a pipe with no reader
8 0x0019f0c9 SIGPIPE: write on a pipe with no reader
9 0x0012aaeb SIGPIPE: write on a pipe with no reader
10 0x0012e83b SIGPIPE: write on a pipe with no reader
11 0x0000534d SIGPIPE: write on a pipe with no reader
12 0x00001da2 SIGPIPE: write on a pipe with no reader
13 0x00001cc9
Thread 0 crashed with X86 Thread State (32-bit):
eax: 0xffffffe1 ebx: 0x9024a8c2 ecx: 0xbfffb97c edx: 0x90216286
edi: 0x00000000 esi: 0x00000000 ebp: 0xbfffb9b8 esp: 0xbfffb97c
ss: 0x0000001f efl: 0x00000206 eip: 0x90216286 cs: 0x00000007
ds: 0x0000001f es: 0x0000001f fs: 0x00000000 gs: 0x00000037

Binary Images Description:
0x1000 - 0x874fff /Library/Application Support/BOINC Data/slots/5/../../projects/www.worldcommunitygrid.org/wcgrid_cep1_6.32_i686-apple-darwin
0x90215000 - 0x9037cfff /usr/lib/libSystem.B.dylib
0x911c4000 - 0x91221fff /usr/lib/libstdc++.6.dylib
0x92931000 - 0x92938fff /usr/lib/libgcc_s.1.dylib
0x92e64000 - 0x92e68fff /usr/lib/system/libmathCommon.A.dylib


Exiting...

</stderr_txt>
]]>


Return to Top
----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 10, 2009 2:11:14 AM]
[Jun 10, 2009 2:09:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP WU repeatedly failing quorum 6 and no success

Thanks, the distribution should end automatically after the 7th error. Seems high, but if they were short jobs it would not appear to be so bad. And tasks miraculously do sometimes need a 5th, 6th, 7th result to finally achieve a valid quorum. The automation is necessary as there are simply not enough hands to monitor each and every single result (428,000 on Monday). These will flow into reports that will decide if it needs further investigation, but in a structured manor.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Jun 10, 2009 8:18:55 AM]
[Jun 10, 2009 8:16:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP WU repeatedly failing quorum 6 and no success

Hello Sekerob,

I have no clue as to what is going on here, but this WU and/or its siblings seem to have recycled several times thru my machine, and now is running it again with High Priority...for the last 2 days, that I've noticed. Think it said the WU would run 35 hours, but now we are at 38 hrs and less than 50% done. Is this a hopeless case? Should I abort, or just let it run and see what happens? Already past due date, so assume I wouldn't even get credit for it, but maybe the results would help you figure out what went wrong?...but what's 38 hours of wasted machine time relative to the life of the universe. The CEP WU's usually run 5-6 hours and run to successful completion.

My machine is: <p_model>Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz [Intel64 Family 6 Model 23 Stepping 7]</p_model>
and using BOINC v6.2.2? (fell off the page, but think it's .28)
Jeanne

CurrTime cpu Progress To completionReport deadline
12:41 pm 11:52:15 0.425 26:15:48 6/18/2009 2:46 pm
12:43 p 11:54:17 26.17:03
16:43 13:26:56 0.475 26:41:33

Workunit Status


Project Name: The Clean Energy Project
Created: 6/10/09
Name: E000596_957B_002e2h00b
Minimum Quorum: 2
Replication: 2



Result Name App Version Number Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
E000596_ 957B_ 002e2h00b_ 5-- - In Progress 6/18/09 21:55:54 6/22/09 21:55:54 0.00 0.0 / 0.0
E000596_ 957B_ 002e2h00b_ 4-- - In Progress 6/18/09 21:55:27 6/22/09 21:55:27 0.00 0.0 / 0.0
E000596_ 957B_ 002e2h00b_ 3-- - No Reply 6/14/09 21:46:19 6/18/09 21:46:19 0.00 0.0 / 0.0
E000596_ 957B_ 002e2h00b_ 2-- - No Reply 6/14/09 21:46:16 6/18/09 21:46:16 0.00 0.0 / 0.0
E000596_ 957B_ 002e2h00b_ 1-- 632 Aborted 6/10/09 21:43:06 6/15/09 12:39:08 1.29 10.0 / 0.0
E000596_ 957B_ 002e2h00b_ 0-- - No Reply 6/10/09 21:42:58 6/14/09 21:42:58 0.00 0.0 / 0.0
[Jun 18, 2009 10:53:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP WU repeatedly failing quorum 6 and no success

Hello jeanne95132,
Ouch! You can choose to abort if you like, but it would be nice to return at least one completed result for this unit - - assuming that it can complete. Your Core 2 has a better chance of finishing it than an old Athlon64 would.

Lawrence
[Jun 19, 2009 12:14:45 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP WU repeatedly failing quorum 6 and no success

Hello Lawrence,

If it will serve some use, I'll let it run. It is making some progress, albeit VERY slowly.

But let me ask you, does disk space have anything to do with how fast a WU is run? This is a relatively new machine, so have lots of free disk space, and I've allotted a big space for WCG, but it's only using 225 MB. If it will run faster/more efficiently, please tell me how I can tweak it. I know it only runs on 1 of the 4 cores and nothing can be done about that, but does a larger disk space matter? (Obviously I'm just a driver, not a mechanic.)

Thanks,
Jeanne
[Jun 19, 2009 3:15:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
confused Re: CEP WU repeatedly failing quorum 6 and no success

Hi Lawrence,
i let it run, but it did something weird...completion time was going up instead of down (47:40 more hours after alrelady running 34:41 hours, when originally it was supposed to run 35 hours???? Retrograde.) I aborted it as it might not finish in our lifetime.

Do you want the error report when I get it?
Jeanne

currTime cpu Progress To completion Report deadline
12:41 pm 11:52:15 0.425 26:15:48 6/18/2009 2:46 pm
12:43 p 11:54:17 26.17:03
16:05 13:26:56 0.475 26:41:33
16:35 14:09:16 0.50 27:23:01
19:47 16:03:44 0.575 27:37:50
00:42 18:51:57 0.67 39.35
04:45 21:12:23 0.75 37:47
13:06 26.06:57 0.925 39.21:12
03:27 34:41:04 1.250 47:40:52
[Jun 20, 2009 10:37:46 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP WU repeatedly failing quorum 6 and no success

Hello jeanne95132,
First, disk space is not important for most projects. BOINC includes a limit in order to catch runaway bugs rather than for normal use.

Second, if a project is unable to make progress, the time to completion will start increasing just as fast as run time. This is never a good sign. It happens in HCMD2 when the next position requires several hours of computation rather than 5 or 6 seconds. We have special code in the latest HCMD2 code to automatically stop the project if it is too long. Running into this problem on a CEP unit is reason enough to abort. It might be an endless loop.

Better luck with your new work unit! biggrin

:Lawrence
[Jun 20, 2009 5:43:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP WU repeatedly failing quorum 6 and no success

Hi Lawrence,
Thanks for the info. Next time I will just abort when I see the TimeRemaining start to creep up; however, on the previous monster WUs, the TimeRemaining also crept up and then later started moving down again and the WU continued to completion without error (in some cases, but not in all instances). Any chance of my getting credit for the WUs that either ended in errors or was automatically aborted? The 2,278 points looks good to me!!!
Jeanne

Project Name: The Clean Energy Project
Created: 6/10/09
Name: E000596_957B_002e2h00b
Minimum Quorum: 2
Replication: 2



Result Name App Version Number Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
E000596_ 957B_ 002e2h00b_ 6-- - In Progress 6/19/09 09:23:16 6/24/09 05:58:48 0.00 0.0 / 0.0
E000596_ 957B_ 002e2h00b_ 5-- - In Progress 6/18/09 21:55:54 6/22/09 21:55:54 0.00 0.0 / 0.0
E000596_ 957B_ 002e2h00b_ 4-- 632 Error 6/18/09 21:55:27 6/19/09 08:22:01 0.00 0.0 / 0.0
E000596_ 957B_ 002e2h00b_ 3-- - No Reply 6/14/09 21:46:19 6/18/09 21:46:19 0.00 0.0 / 0.0
E000596_ 957B_ 002e2h00b_ 2-- 632 Aborted 6/14/09 21:46:16 6/20/09 10:30:06 34.69 578.4 / 0.0
E000596_ 957B_ 002e2h00b_ 1-- 632 Aborted 6/10/09 21:43:06 6/15/09 12:39:08 1.29 10.0 / 0.0
E000596_ 957B_ 002e2h00b_ 0-- 632 Error 6/10/09 21:42:58 6/20/09 19:02:15 113.87 2,287.8 / 0.0
[Jun 20, 2009 11:32:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP WU repeatedly failing quorum 6 and no success

Hello jeanne95132,
Invalid gets half credit but error and abort normally gets 0. knreed sometimes gives credit for Beta units, but it is time consuming, so there is no real chance in a standard project.

Lawrence
[Jun 20, 2009 11:55:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread