| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 20
|
|
| Author |
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Ok problem found...all of the se work units that restart will have this issue. Basically, in the DYNAMICS loop it says @Nstep and should be, @restartnstep. As long as the work unit isn't restarted it'll finish to completion. I will be discussing with the researchers tomorrow morning what they wish to do with these. The pe work units appear to be fine it is only the B.se work units.
-Uplinger |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Basically if you can let the work units run to completion without stopping or suspending them they will complete. Basically at this time, checkpointing is not working on these work units. Having the wrong variable makes the work unit start from scratch on the dynamics loop. But the counter for times through the loop keep going. I will see what the researchers want. But if you can just let them run without stopping them.
-Uplinger |
||
|
|
HutchNYC
Advanced Cruncher United States Joined: Nov 27, 2005 Post Count: 97 Status: Offline Project Badges:
|
Ok. The few se units that bonic switched from mid-completion I have now manually suspended until you guys figure out what you want to do with them.
----------------------------------------Thanks for looking into this Uplinger. Hutch |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Hutch, Can you edit the subject in the first post to say type-b.se. I have confirmed with my own test that it does not affect B.pe work units.
thanks, -Uplinger |
||
|
|
HutchNYC
Advanced Cruncher United States Joined: Nov 27, 2005 Post Count: 97 Status: Offline Project Badges:
|
Certainly.
----------------------------------------Hutch |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Whilst eyes shut, 2 pe + 2 se came in though the device had 4 HPF2 crash-outs on the 9th... reliability must have come back up quite quickly :biggrin:. Manually pushed them to front just now (the suspended HCC tasks had no wingman waiting ;>)
----------------------------------------This fetch happened when BOINCTasks showed a total of about 1.6 days per core CPU time in the cache. (at time of going into dormant state :o) PS. v.v. the se stop/start issue of this set, will make sure they run in one stretch. Backup project suspended. First checkpoint log entries for result names: 1892 World Community Grid 11-03-2010 07:29:04 [checkpoint_debug] result erlc_d098_se0000_1 checkpointed 1893 World Community Grid 11-03-2010 07:29:12 [checkpoint_debug] result erlc_d099_se0000_1 checkpointed 1894 World Community Grid 11-03-2010 07:29:35 [checkpoint_debug] result erlc_d170_pe0000_1 checkpointed 1895 World Community Grid 11-03-2010 07:30:02 [checkpoint_debug] result erlc_d105_pe0000_1 checkpointed
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Mar 11, 2010 7:30:17 AM] |
||
|
|
X-Files 27
Senior Cruncher Canada Joined: May 21, 2007 Post Count: 391 Status: Offline Project Badges:
|
Exit code 29 on type-b.se
----------------------------------------Result Name: erlc_ c123_ se0000_ 1-- <core_client_version>6.10.36</core_client_version> <![CDATA[ <message> The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d) </message> <stderr_txt> INFO: No state to restore. Start from the beginning. forrtl: severe (29): file not found, unit 1, file D:\BOINC Data\slots\3\fort.1 Image PC Routine Line Source wcg_dddt2_charmm_ 00B1638E Unknown Unknown Unknown Stack trace terminated abnormally. </stderr_txt> ]]> ![]() ![]() |
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
Whilst eyes shut, 2 pe + 2 se came in though the device had 4 HPF2 crash-outs on the 9th... reliability must have come back up quite quickly Sekerob, although I could well be mistaken, I don't think having a reliable computer is a requirement for receiving B & C type units - just the A's. Thus, those 4 WU's you've got, could well be helping you get back up to that lofty status. ![]() |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Whatever the priority they're circulated with... suites me, think the known return time for a device has to be < 2 days.
----------------------------------------Meantime the 2 se completed in 1 run, 3:44 and 3:50 hours respectively, both PV of which 1 is quorum complete. edit: 1 valid erlc_ d099_ se0000_ 1-- 1112084 Valid 10-3-10 23:11:11 11-3-10 10:26:44 3.66 67.5 / 69.9 The pe tasks a solidly on track to do full 10 hours. RAM peak at 46% showing 24Mb and VM stuck on max of 540Mb. Kernel time 15 seconds, PF Soft: 31k, Delta zero. Very nice.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Mar 11, 2010 11:00:55 AM] |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
You will start seeing some new work units erls_*se0000, these are to make sure that the results from the work units above are proper. future erlc*se0000 work units will have the checkpoint fix in them and won't require being sent as erls*se0000.
If you can let the current se0000 work units run from start to finish without a stopping then let the work units run. If you are not able to do this on your machine, then please abort the work unit to let another member complete them. -Uplinger |
||
|
|
|