World Community Grid - View Thread - exited with zero status

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: exited with zero status

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 26

[ ]

Author

This topic has been viewed 4530 times and has 25 replies

tfmagnetism
Cruncher
Joined: Jul 22, 2011
Post Count: 25
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

45 day badge for The Clean Energy Project - Phase 2

14 day badge for Drug Search for Leishmaniasis

14 day badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

14 day badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: exited with zero status

I've been having the resetting problem. I get between 10-25% done, machine gets switched off for the night, and the next day it's back to zero. I've been aborting the tasks that reset, and so far about 50% that I've received had to be aborted. Maybe I've had 6-7 units, and aborted 3-4. Only doing one at a time, with only two work units running at once (dual core cpu). Hmm never thought to check results status yet. I'm going to use an exception on my antivirus to see if it helps. There must be plenty of people having problems if only 50% are working properly! Think of all the wasted cpu time! If I have further detailed info which may be of help I will post back, otherwise I'll be aborting half of these. I really hope someone sorts this out. I mean how many computers are going around in circles on CEP2 on WCG without anyone stepping in? It's a real shame to see it happening.

[Feb 23, 2012 8:54:14 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with zero status

You might want to try:

1. Shutting down the BOINC service before shutting down computer, via BOINC Manager or using the stop service in Task Manager when running it with "show all processes of all users" (admin essentially)
2. Read up on hibernating the computer [does not use power], then resuming tasks without a second of computing loss, need even a return to last good checkpoint save.

Yes, we recommend an exception is set in security software to scan the *sandboxed* BOINC data directory.

For all sciences the failure rate is below 5%, else they would not run at all. Some sciences [at WCG] have a failure rate smaller than 0.2%.

--//--

[Feb 23, 2012 9:07:12 PM]

tfmagnetism
Cruncher
Joined: Jul 22, 2011
Post Count: 25
Status: Offline
Project Badges:


Re: exited with zero status

Thanks for the reply, but unfortunately:

1. I can't find a "BOINC" service anywhere in task manager
2. I'd prefer to "shut down" my computer
3. That's still a good... 50% of CEP2 workunits failing on my computer?!

I'd much prefer it if... the problem didn't exist in the first place, because, let's face it, how many people aren't noticing the problem? I'm sure it will timeout after 10 days, I'm sure, but even if it does then how much cpu time has been lost? From what I've seen, tasks keep running even after the 10 days is up. I mean, what I'm saying is,
I was the one that noticed the problem, not BOINC. So no red flag was waving saying "oh I've failed". I noticed it failed, not the computer! So if there is no red flag waving here from BOINC, maybe a red flag should be waving here??!!

Anyway, I'll I'm doing is trying to help WCG out here. I've put an exception on my AV and I'll let you know how it goes. I think I'm right to be a little worried if it's ... 50%!

[Feb 23, 2012 9:33:57 PM]

KWSN - A Shrubbery
Master Cruncher
Joined: Jan 8, 2006
Post Count: 1585
Status: Offline


Re: exited with zero status

There is a reason CEP2 is an opt-in project. Some computers just aren't run in a manner that works well with this project. Sounds like yours might be one of them. Fortunately, there are many other valuable projects from which to choose within WCG.

----------------------------------------

Distributed computing volunteer since September 27, 2000

[Feb 23, 2012 9:44:13 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2175
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

10 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: exited with zero status

Dear TPCBF,
this is a strange problem and we are not quite sure what to make of it. If it persists, please post again and maybe the IBM-WCG team can chime in.
Best wishes from
Your Harvard CEP team

I have been pretty busy at work since, with no time to babysit that machine. Had changed it to a non-CEP2 device profile but will see that I try again this weekend...

Ralf

[Feb 24, 2012 6:18:06 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with zero status

Hi Sekerob,
yes, having a regular and an intense queue (or two corresponding projects CEP2 and CEP3) is a great idea and we actually brought it up with our friends at IBM in the very beginning of the project. Unfortunately, there seem to be technical problems on the WCG/BOINC side, so the idea could not be realized.

Hi tfmagnetism,
unfortunately, the checkpoints in CEP2 are - for technical reasons - spread quite far apart, so if you have to fully shut down your computer every night, then CEP2 might not be the best science application for you. But there are other great projects within WCG which you could consider. The checkpoints are not a problem if you can use hibernation or sleep mode. There have been many detailed discussions on this issue in this forum if you want to read more about it.

Hi Ralf,
sounds like a plan! Also, if you haven't already done so you can test the setting tips described in the footer link.
Best wishes
Your Harvard CEP team

[Feb 24, 2012 4:57:26 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Africa Rainfall Project


Re: exited with zero status

tfmagnetism,

Can you post the result log for one of the workunits you aborted that reset back to 0%? On the website click on "MY GRID" -> "Result Status" then you can filter by project cep2 and status user abort to narrow down the results. Click on the link "User Aborted' in the status column.

Thanks,
armstrdj

[Feb 24, 2012 6:11:54 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2175
Status: Offline
Project Badges:


Re: exited with zero status

Hi Ralf,
sounds like a plan! Also, if you haven't already done so you can test the setting tips described in the footer link.

I haven't tried the settings yet, might give this a shot late today.
But in general, I don't see how this would apply in this case anyway, it terminates on this host within 3 minutes or less, doubt "leave WU in memory" comes into play here.
Otherwise, the machine just sits idle (from a user perspective) as it is a laptop that waits for a replacement screen, doing nothing but crunching, right now for C4CW and SN2S. which it does just fine...

Ralf

[Feb 26, 2012 6:27:30 PM]

tfmagnetism
Cruncher
Joined: Jul 22, 2011
Post Count: 25
Status: Offline
Project Badges:


Re: exited with zero status

Hi Guys,

Had a quick read. Sorry about that I was a bit tired when I wrote that above. I just had to abort three of these in a row tonight. Looks like a fourth now too. I test by restarting the machine after about 5% done for this problem, now that I've seen it. It doesn't seem to be any different from what happens if I let it run to 25%. Still about 50% problematic. I agree about the checkpoints. Looks like it makes a checkpoint at about 11 mins (1.5%), and must not be making one after that? I scanned the data directory for helpful stuff, and that's all I can come up with atm.

OK - armstrdj (above) - just what I was thinking and I looked at this today...
========
Result Log

Result Name: E206601_ 543_ C.25.C21H13N3S.02216491.0.set1d06_ 0--
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[15:45:36] Number of jobs = 16
[15:45:36] Starting job 0,CPU time has been restored to 0.000000.
[15:48:40] Finished Job #0
[15:48:40] Starting job 1,CPU time has been restored to 180.477557.
[15:57:57] Finished Job #1
[15:57:57] Starting job 2,CPU time has been restored to 679.259554.
[16:51:53] Number of jobs = 16
[16:51:53] Starting job 2,CPU time has been restored to 679.259554.
Abort requested: Exiting

</stderr_txt>
]]>
=======================
-Same as stderr in BOINC,slots,0 folder

Also: boinc_task_state (current WU) in that folder gives:

<active_task>
<project_master_url>http://www.worldcommunitygrid.org/</project_master_url>
<result_name>E206603_235_C.25.C21H13N3S.01875002.1.set1d06_1</result_name>
<checkpoint_cpu_time>665.749868</checkpoint_cpu_time>
<checkpoint_elapsed_time>727.459200</checkpoint_elapsed_time>
<fraction_done>0.015411</fraction_done>
</active_task>

Like I say, it looks like only one checkpoint at about 11 mins. So keep getting to 20-25% at shutdown and presumably there is nothing to restore from. OK, so the resetting here is not technically 0% but 1.5% (close enough for this thread!). What a strange thing? So 50% (out of about 15 WUs so far) are working OK, and 50% not. We just go around and around back to 1.5%, unless I abort. It would be sensible to have a checkpoint somewhere after 1.5% in case I had done 20-25% (max). I just wonder how many people are suffering from this problem and not knowing.

So how far apart are these checkpoints I wonder? It would be good to know for reference. I'm quite baffled why 50% seem to be OK? Are you sure they are functioning correctly?

I'll have a look into hibernating. The antivirus trick didn't help anything - didn't think it would. Hmm this is so strange. I don't really want to opt-out. After all, 50% WUs are working without a problem.

Any more helpful info and I'll post back.

[Mar 11, 2012 10:10:59 PM]

tfmagnetism
Cruncher
Joined: Jul 22, 2011
Post Count: 25
Status: Offline
Project Badges:


Re: exited with zero status

From stdoutdae:

11-Mar-2012 20:29:51 [World Community Grid] Task X0930059120882200511080633_1 exited with zero status but no 'finished' file
11-Mar-2012 20:29:51 [World Community Grid] If this happens repeatedly you may need to reset the project.
11-Mar-2012 20:29:51 [World Community Grid] Task E206433_528_C.25.C21H11NOS2.01460520.0.set1d06_1 exited with zero status but no 'finished' file
11-Mar-2012 20:29:51 [World Community Grid] If this happens repeatedly you may need to reset the project.

^^ Usually get this just before shutdown, but this example was a restart so I also got the following immediately after

11-Mar-2012 20:29:52 [---] Resuming computation
11-Mar-2012 20:29:52 [---] Resuming network activity
11-Mar-2012 20:29:54 [World Community Grid] Task X0930059120882200511080633_1 exited with a DLL initialization error.
11-Mar-2012 20:29:54 [World Community Grid] If this happens repeatedly you may need to reboot your computer.
11-Mar-2012 20:29:54 [World Community Grid] Task E206433_528_C.25.C21H11NOS2.01460520.0.set1d06_1 exited with a DLL initialization error.
11-Mar-2012 20:29:54 [World Community Grid] If this happens repeatedly you may need to reboot your computer.
11-Mar-2012 20:29:54 [World Community Grid] Restarting task X0930059120882200511080633_1 using hcc1 version 642
11-Mar-2012 20:29:54 [World Community Grid] Restarting task E206433_528_C.25.C21H11NOS2.01460520.0.set1d06_1 using cep2 version 640

And now the computer restarted..

11-Mar-2012 20:31:23 [---] Starting BOINC client version 6.12.34 for windows_x86_64
11-Mar-2012 20:31:23 [---] log flags: file_xfer, sched_ops, task
11-Mar-2012 20:31:23 [---] Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5
11-Mar-2012 20:31:23 [---] Data directory: C:\ProgramData\BOINC
11-Mar-2012 20:31:23 [---] Running under account S
11-Mar-2012 20:31:23 [---] Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 5000+ [Family 15 Model 107 Stepping 2]
11-Mar-2012 20:31:23 [---] Processor: 512.00 KB cache
11-Mar-2012 20:31:23 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni cx16 syscall nx lm svm rdtscp 3dnowext 3dnow
11-Mar-2012 20:31:23 [---] OS: Microsoft Windows 7: Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)
11-Mar-2012 20:31:23 [---] Memory: 1.75 GB physical, 3.23 GB virtual
11-Mar-2012 20:31:23 [---] Disk: 60.00 GB total, 33.11 GB free
11-Mar-2012 20:31:23 [---] Local time is UTC +0 hours
11-Mar-2012 20:31:23 [---] No usable GPUs found
11-Mar-2012 20:31:23 [World Community Grid] URL http://www.worldcommunitygrid.org/; Computer ID 1800297; resource share 100
11-Mar-2012 20:31:23 [World Community Grid] General prefs: from World Community Grid (last modified 05-Feb-2012 23:10:36)
11-Mar-2012 20:31:23 [World Community Grid] Host location: none
11-Mar-2012 20:31:23 [World Community Grid] General prefs: using your defaults
11-Mar-2012 20:31:23 [---] Preferences:
11-Mar-2012 20:31:23 [---] max memory usage when active: 895.25MB
11-Mar-2012 20:31:23 [---] max memory usage when idle: 1342.87MB
11-Mar-2012 20:31:23 [---] max disk usage: 10.00GB
11-Mar-2012 20:31:23 [---] don't compute while active
11-Mar-2012 20:31:23 [---] don't use GPU while active
11-Mar-2012 20:31:23 [---] (to change preferences, visit the web site of an attached project, or select Preferences in the Manager)
11-Mar-2012 20:31:23 [---] Not using a proxy
Initialization completed
11-Mar-2012 20:31:28 [World Community Grid] Restarting task X0930059120882200511080633_1 using hcc1 version 642
11-Mar-2012 20:31:28 [World Community Grid] Restarting task E206433_528_C.25.C21H11NOS2.01460520.0.set1d06_1 using cep2 version 640

Not sure if any of that's useful but for completeness I added it.

[Mar 11, 2012 10:24:24 PM]

[ ]