| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 26
|
|
| Author |
|
|
tfmagnetism
Cruncher Joined: Jul 22, 2011 Post Count: 25 Status: Offline Project Badges:
|
I've been having the resetting problem. I get between 10-25% done, machine gets switched off for the night, and the next day it's back to zero. I've been aborting the tasks that reset, and so far about 50% that I've received had to be aborted. Maybe I've had 6-7 units, and aborted 3-4. Only doing one at a time, with only two work units running at once (dual core cpu). Hmm never thought to check results status yet. I'm going to use an exception on my antivirus to see if it helps. There must be plenty of people having problems if only 50% are working properly! Think of all the wasted cpu time! If I have further detailed info which may be of help I will post back, otherwise I'll be aborting half of these. I really hope someone sorts this out. I mean how many computers are going around in circles on CEP2 on WCG without anyone stepping in? It's a real shame to see it happening.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
You might want to try:
1. Shutting down the BOINC service before shutting down computer, via BOINC Manager or using the stop service in Task Manager when running it with "show all processes of all users" (admin essentially) 2. Read up on hibernating the computer [does not use power], then resuming tasks without a second of computing loss, need even a return to last good checkpoint save. Yes, we recommend an exception is set in security software to scan the *sandboxed* BOINC data directory. For all sciences the failure rate is below 5%, else they would not run at all. Some sciences [at WCG] have a failure rate smaller than 0.2%. --//-- |
||
|
|
tfmagnetism
Cruncher Joined: Jul 22, 2011 Post Count: 25 Status: Offline Project Badges:
|
Thanks for the reply, but unfortunately:
1. I can't find a "BOINC" service anywhere in task manager 2. I'd prefer to "shut down" my computer 3. That's still a good... 50% of CEP2 workunits failing on my computer?! I'd much prefer it if... the problem didn't exist in the first place, because, let's face it, how many people aren't noticing the problem? I'm sure it will timeout after 10 days, I'm sure, but even if it does then how much cpu time has been lost? From what I've seen, tasks keep running even after the 10 days is up. I mean, what I'm saying is, I was the one that noticed the problem, not BOINC. So no red flag was waving saying "oh I've failed". I noticed it failed, not the computer! So if there is no red flag waving here from BOINC, maybe a red flag should be waving here??!! Anyway, I'll I'm doing is trying to help WCG out here. I've put an exception on my AV and I'll let you know how it goes. I think I'm right to be a little worried if it's ... 50%! |
||
|
|
KWSN - A Shrubbery
Master Cruncher Joined: Jan 8, 2006 Post Count: 1585 Status: Offline |
There is a reason CEP2 is an opt-in project. Some computers just aren't run in a manner that works well with this project. Sounds like yours might be one of them. Fortunately, there are many other valuable projects from which to choose within WCG.
----------------------------------------![]() Distributed computing volunteer since September 27, 2000 |
||
|
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 2175 Status: Offline Project Badges:
|
Dear TPCBF, I have been pretty busy at work since, with no time to babysit that machine. Had changed it to a non-CEP2 device profile but will see that I try again this weekend...this is a strange problem and we are not quite sure what to make of it. If it persists, please post again and maybe the IBM-WCG team can chime in. Best wishes from Your Harvard CEP team Ralf |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi Sekerob,
yes, having a regular and an intense queue (or two corresponding projects CEP2 and CEP3) is a great idea and we actually brought it up with our friends at IBM in the very beginning of the project. Unfortunately, there seem to be technical problems on the WCG/BOINC side, so the idea could not be realized. Hi tfmagnetism, unfortunately, the checkpoints in CEP2 are - for technical reasons - spread quite far apart, so if you have to fully shut down your computer every night, then CEP2 might not be the best science application for you. But there are other great projects within WCG which you could consider. The checkpoints are not a problem if you can use hibernation or sleep mode. There have been many detailed discussions on this issue in this forum if you want to read more about it. Hi Ralf, sounds like a plan! Also, if you haven't already done so you can test the setting tips described in the footer link. Best wishes Your Harvard CEP team |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
tfmagnetism,
Can you post the result log for one of the workunits you aborted that reset back to 0%? On the website click on "MY GRID" -> "Result Status" then you can filter by project cep2 and status user abort to narrow down the results. Click on the link "User Aborted' in the status column. Thanks, armstrdj |
||
|
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 2175 Status: Offline Project Badges:
|
Hi Ralf, I haven't tried the settings yet, might give this a shot late today.sounds like a plan! Also, if you haven't already done so you can test the setting tips described in the footer link. But in general, I don't see how this would apply in this case anyway, it terminates on this host within 3 minutes or less, doubt "leave WU in memory" comes into play here. Otherwise, the machine just sits idle (from a user perspective) as it is a laptop that waits for a replacement screen, doing nothing but crunching, right now for C4CW and SN2S. which it does just fine... Ralf |
||
|
|
tfmagnetism
Cruncher Joined: Jul 22, 2011 Post Count: 25 Status: Offline Project Badges:
|
Hi Guys,
Had a quick read. Sorry about that I was a bit tired when I wrote that above. I just had to abort three of these in a row tonight. Looks like a fourth now too. I test by restarting the machine after about 5% done for this problem, now that I've seen it. It doesn't seem to be any different from what happens if I let it run to 25%. Still about 50% problematic. I agree about the checkpoints. Looks like it makes a checkpoint at about 11 mins (1.5%), and must not be making one after that? I scanned the data directory for helpful stuff, and that's all I can come up with atm. OK - armstrdj (above) - just what I was thinking and I looked at this today... ======== Result Log Result Name: E206601_ 543_ C.25.C21H13N3S.02216491.0.set1d06_ 0-- <core_client_version>6.12.34</core_client_version> <![CDATA[ <message> aborted by user </message> <stderr_txt> INFO: No state to restore. Start from the beginning. [15:45:36] Number of jobs = 16 [15:45:36] Starting job 0,CPU time has been restored to 0.000000. [15:48:40] Finished Job #0 [15:48:40] Starting job 1,CPU time has been restored to 180.477557. [15:57:57] Finished Job #1 [15:57:57] Starting job 2,CPU time has been restored to 679.259554. [16:51:53] Number of jobs = 16 [16:51:53] Starting job 2,CPU time has been restored to 679.259554. Abort requested: Exiting </stderr_txt> ]]> ======================= -Same as stderr in BOINC,slots,0 folder Also: boinc_task_state (current WU) in that folder gives: <active_task> <project_master_url>http://www.worldcommunitygrid.org/</project_master_url> <result_name>E206603_235_C.25.C21H13N3S.01875002.1.set1d06_1</result_name> <checkpoint_cpu_time>665.749868</checkpoint_cpu_time> <checkpoint_elapsed_time>727.459200</checkpoint_elapsed_time> <fraction_done>0.015411</fraction_done> </active_task> Like I say, it looks like only one checkpoint at about 11 mins. So keep getting to 20-25% at shutdown and presumably there is nothing to restore from. OK, so the resetting here is not technically 0% but 1.5% (close enough for this thread!). What a strange thing? So 50% (out of about 15 WUs so far) are working OK, and 50% not. We just go around and around back to 1.5%, unless I abort. It would be sensible to have a checkpoint somewhere after 1.5% in case I had done 20-25% (max). I just wonder how many people are suffering from this problem and not knowing. So how far apart are these checkpoints I wonder? It would be good to know for reference. I'm quite baffled why 50% seem to be OK? Are you sure they are functioning correctly? I'll have a look into hibernating. The antivirus trick didn't help anything - didn't think it would. Hmm this is so strange. I don't really want to opt-out. After all, 50% WUs are working without a problem. Any more helpful info and I'll post back. |
||
|
|
tfmagnetism
Cruncher Joined: Jul 22, 2011 Post Count: 25 Status: Offline Project Badges:
|
From stdoutdae:
11-Mar-2012 20:29:51 [World Community Grid] Task X0930059120882200511080633_1 exited with zero status but no 'finished' file 11-Mar-2012 20:29:51 [World Community Grid] If this happens repeatedly you may need to reset the project. 11-Mar-2012 20:29:51 [World Community Grid] Task E206433_528_C.25.C21H11NOS2.01460520.0.set1d06_1 exited with zero status but no 'finished' file 11-Mar-2012 20:29:51 [World Community Grid] If this happens repeatedly you may need to reset the project. ^^ Usually get this just before shutdown, but this example was a restart so I also got the following immediately after 11-Mar-2012 20:29:52 [---] Resuming computation 11-Mar-2012 20:29:52 [---] Resuming network activity 11-Mar-2012 20:29:54 [World Community Grid] Task X0930059120882200511080633_1 exited with a DLL initialization error. 11-Mar-2012 20:29:54 [World Community Grid] If this happens repeatedly you may need to reboot your computer. 11-Mar-2012 20:29:54 [World Community Grid] Task E206433_528_C.25.C21H11NOS2.01460520.0.set1d06_1 exited with a DLL initialization error. 11-Mar-2012 20:29:54 [World Community Grid] If this happens repeatedly you may need to reboot your computer. 11-Mar-2012 20:29:54 [World Community Grid] Restarting task X0930059120882200511080633_1 using hcc1 version 642 11-Mar-2012 20:29:54 [World Community Grid] Restarting task E206433_528_C.25.C21H11NOS2.01460520.0.set1d06_1 using cep2 version 640 And now the computer restarted.. 11-Mar-2012 20:31:23 [---] Starting BOINC client version 6.12.34 for windows_x86_64 11-Mar-2012 20:31:23 [---] log flags: file_xfer, sched_ops, task 11-Mar-2012 20:31:23 [---] Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5 11-Mar-2012 20:31:23 [---] Data directory: C:\ProgramData\BOINC 11-Mar-2012 20:31:23 [---] Running under account S 11-Mar-2012 20:31:23 [---] Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 5000+ [Family 15 Model 107 Stepping 2] 11-Mar-2012 20:31:23 [---] Processor: 512.00 KB cache 11-Mar-2012 20:31:23 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni cx16 syscall nx lm svm rdtscp 3dnowext 3dnow 11-Mar-2012 20:31:23 [---] OS: Microsoft Windows 7: Ultimate x64 Edition, Service Pack 1, (06.01.7601.00) 11-Mar-2012 20:31:23 [---] Memory: 1.75 GB physical, 3.23 GB virtual 11-Mar-2012 20:31:23 [---] Disk: 60.00 GB total, 33.11 GB free 11-Mar-2012 20:31:23 [---] Local time is UTC +0 hours 11-Mar-2012 20:31:23 [---] No usable GPUs found 11-Mar-2012 20:31:23 [World Community Grid] URL http://www.worldcommunitygrid.org/; Computer ID 1800297; resource share 100 11-Mar-2012 20:31:23 [World Community Grid] General prefs: from World Community Grid (last modified 05-Feb-2012 23:10:36) 11-Mar-2012 20:31:23 [World Community Grid] Host location: none 11-Mar-2012 20:31:23 [World Community Grid] General prefs: using your defaults 11-Mar-2012 20:31:23 [---] Preferences: 11-Mar-2012 20:31:23 [---] max memory usage when active: 895.25MB 11-Mar-2012 20:31:23 [---] max memory usage when idle: 1342.87MB 11-Mar-2012 20:31:23 [---] max disk usage: 10.00GB 11-Mar-2012 20:31:23 [---] don't compute while active 11-Mar-2012 20:31:23 [---] don't use GPU while active 11-Mar-2012 20:31:23 [---] (to change preferences, visit the web site of an attached project, or select Preferences in the Manager) 11-Mar-2012 20:31:23 [---] Not using a proxy Initialization completed 11-Mar-2012 20:31:28 [World Community Grid] Restarting task X0930059120882200511080633_1 using hcc1 version 642 11-Mar-2012 20:31:28 [World Community Grid] Restarting task E206433_528_C.25.C21H11NOS2.01460520.0.set1d06_1 using cep2 version 640 Not sure if any of that's useful but for completeness I added it. |
||
|
|
|