| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 9
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I read a response by cleanenergy. Is this what is happening with this work unit? If so, should I abort it? Or should I let it continue… is a non-coverging work unit is still useful for the science?
----------------------------------------Let me state that first off I have read: Project Checkpoint Saving - How to Minimize Progress Loss on Close/Restart I also did a “checkpoint” search in the CEP forum. I read the most relevant of the 113 posts; some of which I understood, some of which I did not. I read, and understood, that there are 16 jobs to each work unit. I also understand that a checkpoint is only made after each job. I also gather that these checkpoints can be hours apart. Work unit number E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0 has been running for over 9 hours. The time without a checkpoint DOES NOT worry me. What worries me is the fact that it is over 70% complete without a checkpoint. Can it be possible that the first of 16 jobs is not be completed by now? I am going to try to provide you with any information/screenshots that I think you may or may not ask for. I know, I know, I forgot something... BOINC message tab: 10/21/2011 2:11:30 PM Starting BOINC client version 6.10.58 for windows_intelx86 BOINC cc_config.xml file: <cc_config> WCG website device profile 1 WCG website device profile 2 BOINC local preferences – Processor BOINC local preferences – Network BOINC local preferences – Disk/Memory BOINC disk graph Antivirus: Norton 2011 Firewall: ZA Free 9.2.106.000 My computer is hyper-threaded, which is why I set the multi-processor usage to 50%. ------------------------------ [append] The work unit finished without ever doing a checkpoint. Here is the rest of the messages: 10/22/2011 5:44:30 AM World Community Grid Computation for task E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0 finished So, was this a non-converging work unit? If I get another one like this, should I abort it, or should I let it complete? Off Topic: I thought these work units timed out at 12 hours. This one went for 12 hours and 23 minutes. Two more work units in my queue are estimated for over 12 hours also. screenshot [Edit 2 times, last edit by Former Member at Oct 23, 2011 4:23:40 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Dear debsgr8 ,
it is a bit complicated to explain what non-convergence means, but wikipedia sums it up quite nicely: http://en.wikipedia.org/wiki/Iterative_method Most numerical calculations in science (such as in quantum chemistry as employed in CEP2) use iterative algorithms, i.e., they generate a sequence of improving approximate solutions that get closer and closer to the actual result until they reach (=converge to) it, and no more change occurs. Sometimes, these algorithms can fail (for many reasons), in which case the intermediate solutions do not approach (=converge to) the desired result but, e.g., start to oscillate or shoot off to noman's land. (It can also sometimes converge to a wrong result, but that is a different story). Anyways, the code we use and the wus we designed try to reduce the chance of failure as much as possible so it only happens very rarely. But if it happens in CEP2, there is not much one can do about it. The wus at some point give up on a job and move on to the next, no manual intervention necessary. Please don't abort wus, they do that by themselves once it's hopeless. Your example could well be a case which does not converge. In these cases we have to manually rerun certain parts of the calcs in-house and with specialized settings, but that's ok. The progress setting is a very crude tool and cannot account for extraordinary circumstances. The 12h limit is for CPU time, you look at wall clock time which is commonly higher. Best wishes Your Harvard CEP team BTW: You use LAIM, correct? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
cleanenergy said:
Please don't abort wus, they do that by themselves once it's hopeless. Thank you. That's what I wanted to know. cleanenergy asked: BTW: You use LAIM, correct? Yes. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Work unit E203560_910_C.28.C22H15N3SSi2.00144301.2.set1d06_1 is doing the same thing. I'm at 80% without a checkpoint.
I'm I just a lucky cruncher who happens to get 2 of these? Or is there something about my machine and/or settings that is causing this to happen? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello debsgr8,
Checkpoints are set in the work unit, they are not affected by the computer as long as it computes correctly. I would just monitor the Results Status page and only get worried if something strange showed up there. I admit, I just exaggerrated. Personally, I would check Properties in Tasks for that work unit in BOINC Manager repeatedly, trying to figure it out. But I would just be puzzled as long as Results Status validated.What does check point say in the Properties popup window? Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi debsgr8,
While problematic molecules are rare, they tend to cluster: there may be a structural feature which causes problems, and since similar molecules and their wus are created at the same time, they may end up with the same host, if the host downloads a series of results. You should just monitor the situation and if the Results Status persistently shows problems over the next two weeks, then we should have a closer look. Best wishes from Your Harvard CEP team |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
lawrencehardin said:
... I would just monitor the Results Status page and only get worried if something strange showed up there. cleanenergy said: ... You should just monitor the situation and if the Results Status persistently shows problems over the next two weeks, then we should have a closer look. Neither of those two work units are even listed on my Results Page... even with the Result Status filter set to 'All'.cleanenergy said: ...they tend to cluster... I tend to believe that's it, as I've done other CEP2 work units without problems and have received valid results on them. I'm currently working on a work unit that is over 54% with 2 checkpoints already.It's a puzzle. I love puzzles. I've just set my device profile for CEP2 only in an attempt to sort this out. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
No ideas, but do you really want to run a lot of CEP2 units simultaneously? Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Simultaneously? No. I have my HT setting at 50% so I'm only crunching on one at a time.
But if you mean exclusively, then yes... just until I can determine that it whether or not it is my computer and/or settings. |
||
|
|
|