Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ] |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 114
|
Author |
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7581 Status: Recently Active Project Badges: |
Surely there must be a better way than to just throw 18 hours of work at a time away and give the client no credit? Can't we let the client run until it finishes the first job? That would work if it is only a "long" WU." If the program is in an endless loop or is on a diverging path it would never end. The 18 hour limit is there as a stop gap measure in those cases. I believe the 18 limit was also put in place because the scientists figured this would be sufficient for most of the machines to return a meaningful result most of the time. The fact remains there are some systems which are going to be too slow for this project, but I presume they are few and far between. There are also some molecules which are too big for even the fastest consumer grade machines and these would then get kicked over to the the scientists' own workstation cluster. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
So the project was made explicit opt-in for various technical reasons yet when the opt-in is done and the comp is not hacking it, there continues this stream of extend requests. Just why can't we accept this is it, after half a decade running?
----------------------------------------[Edit 2 times, last edit by SekeRob* at Mar 9, 2016 10:18:58 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Sek, I think you might (just might) be being a bit too simplistic. Some things have changed.
For a start, this is a beta so presumably new molecule (types?) are being explored, so the boundaries are being pushed. We don't know if it's the type of molecule that takes a lot of processing, or if the algorithm is not converging well (or whatever it does) with these new molecules. I also think that the processor power spread of user computers is changing as power gets turned down and CPU count goes up, so as to consume less electricity. As I said in an earlier post, a time-limit is a very crude brake. I personally think it would be better to use the number of actual processing steps, but that might need too many changes or too long a processing time on the slowest machines to be practical. At the end of the day we should all just let the scientists and the techies do their bit and decide the right way to go. But I don't see anything wrong with crunchers expressing opinions. What we say may influence decisions and that's how it should be. |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
You slam dunked rephrased my Too Simple perfectly in just these few words "let the scientists and techies do their bit and decide the right way,." They got the statistics for thousands of results and the [new/adapted] goals set. Accept it.
|
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Sgt. Joe, the endless loop prevention is actually build into all project setup files in the <rsc_fpops_bound> parameter. Usually WCG has it set at 10x estimated fpops of the current distribution. Given the high variability of task duration, this setting could actually lead to a premature kill / max_time_exceeded. Suppose the current mean would be 1.5 hours... then the task could be killed at 15 for an average performing device. Have no CEP2 production or beta to see what the limit is set at, but probably the app was hard-coded to have them die at 18 no matter the device speed.
|
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7581 Status: Recently Active Project Badges: |
Sgt. Joe, the endless loop prevention is actually build into all project setup files in the <rsc_fpops_bound> parameter. Thanks for the clarification. The scenario you give could certainly occur, but it would be hard to envision the current mean dropping to such a low level. I believe you are correct that the 18 hour limit is hard coded into the program, thus a fail safe mechanism to a runaway condition. That some WU's hit this limiter and do not finish even Job 0, is unfortunate, but that is the nature of basic research. Even the failures impart some knowledge to the researchers. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
the endless loop prevention is actually build into all project setup files in the <rsc_fpops_bound> parameter probably the app was hard-coded to have them die at 18 no matter the device speed Well, I can only assume they know something we don't, as this seems weird! |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Not in the least, as apps can be configured to listen or not listen for a range of parameters. As said, I don't have a CEP2 on a system, but will not be surprised it is set to the WCG standard 10x and the logic working along the line of second safety... Die at 18 and otherwise die at max_fpops_bound. Can't remember to have ever read about one that did go past the 18 or in past when it was set to 12 hour limit.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I say it's weird because I don't know of any algorithm which is time dependent. If it does a thousand cycles on a slow machine and is killed after a time limit, why should I let it do a million on a faster machine? What would that buy me?
But maybe they know something I don't. |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Actually you do know the answer, which is not technical.
|
||
|
|