| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 28
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
You're not wrong Sek, but I would argue that we ought to be able to take part in a project if at all possible. My slow machine takes about 48 hours to process a FAH2 WU, so why not let CEP2 WUs run that long? I know that there are arguments about lack of checkpointing, but it would then be up to me to decide to opt-out due to too much time lost because of uncheckpointed failures. I can, however, imagine a scenario where someone would keep processing CEP2 WUs in the hope of getting enough that finish step 0 in under 18 hours so as to at least gain a bronze badge. That could involve a lot of wasted processing time.
----------------------------------------Also, any machine which runs for more than about 8 hours without a restart is likely to be going to run 24/7, isn't it? Why is 18 hours so special? But finally, whatever the credit granting process for production, I thought beta was supposed to work differently. Not, apparently, in this case. [Edit 1 times, last edit by Former Member at Apr 19, 2016 4:00:53 PM] |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
I have an open question to the researchers about seeing if we can get the first step to be quicker or split to help increase checkpoints. Depending on what they respond with, we will then determine what we can do from our stand point to help prevent this issue going forward.
Thanks, -Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for listening, Keith! I look forward to reading about the response and the subsequent decision(s), even if they go against my ideas. I recognise that only you guys have the full version of the bigger picture, but it really does help make us feel needed if we're kept aware of what's going on.
|
||
|
|
Composer
Cruncher Joined: May 28, 2014 Post Count: 29 Status: Offline Project Badges:
|
Im not one of the techs, but from what I understand it is simply the nature of the computation being performed. It basically is performing a simulation on the quantum mechanical level of the compound in question. (note to anyone: if I am incorrect in this, please feel free to correct me)
From what I understand, there are two reasons why it does not checkpoint very logically. 1) doing so at what might be deemed "convenient" times would require an inordinate amount of disk space. and 2) Because of the way the calculations work, they actually cannot be stopped part way through in certain locations. Something about how the math works out or something actually makes it impossible in many locations. Again, not entirely sure I know what Im talking about, and if anyone has any better understanding of it, please feel free to correct me. |
||
|
|
Composer
Cruncher Joined: May 28, 2014 Post Count: 29 Status: Offline Project Badges:
|
Yeah, I have several old machines that I would love to run CEP2 on, but they will never finish in 18 hours. I get why they have the time limit, so that if a task goes off the deep end it will eventually end, but many computers just cant crunch fast enough to meet the deadline. Anything running a Pentium 4 definitely falls under that category, and I have a 3rd gen i3 and a 1st gen i5 that sometimes have trouble finishing. heck, even my brand new computer with a 6th gen i7 sometimes wouldnt finish in under 18 hours (this was before the WU availability issue). I would be awesome if there was an option to override WU timeout or something. Either in the device manager on the WCG end or in the BOINC manager, some way to manually extend the time limit. By default it should be set to not override, but if you wanted to you could change it.
By the way, what happens to unfinished WU's that dont throw an error? Do the CEP2 servers just finish whats left, or do they have to send the task to a new computer and start over? |
||
|
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 823 Status: Offline Project Badges:
|
By the way, what happens to unfinished WU's that dont throw an error? Do the CEP2 servers just finish whats left, or do they have to send the task to a new computer and start over? I believe if it hits the 18 hours and doesn't hit the first checkpoint and X number of wingmen have the same issue, then it is run on their own cluster without that 18 hour time limit. My assumption is if it hits at least one checkpoint but then hits the 18 hour limit, they have enough info to know that it is not a good candidate for making solar panels and stop there or if it looks like a good candidate then they run it again (continue?) on their cluster. If anyone has more info than I do please chime in :) ![]() |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Just to be sure you're on the same page as others v.v. "but many computers just cant crunch fast enough to meet the deadline." Are you talking about the 18 hour cut-off or the actual deadline a result has to be reported? In latter case, the problem is member-made by setting higher buffer values **. Remember, if you have a 2 day deep work buffer, after a while of crunching, a task is already 2 days old before started, but since CEP2 has a 10 day deadline, it would not be an issue. Those who do have a 2+ day buffer eventually would not get any repair work since results are not returned regularly in under 48 hours i.e. they do not get those tasks with the 3.5 day deadline.
** Devices that don't run 24/24 , part time, best don't have any buffer at all... just compute, complete, report and at that time let BOINC fetch a new task to occupy the core that is threatened to go idle. Any buffered task 'ages'... the deadline clock starts ticking the moment they are received. |
||
|
|
supdood
Senior Cruncher USA Joined: Aug 6, 2015 Post Count: 333 Status: Offline Project Badges:
|
I have an open question to the researchers about seeing if we can get the first step to be quicker or split to help increase checkpoints. Depending on what they respond with, we will then determine what we can do from our stand point to help prevent this issue going forward. Thanks, -Uplinger Thanks for checking into this. CEP2 is the reason I first joined WCG but haven't been able to crunch it at all (running old laptops and shutting down at the end of the day, can't reach checkpoint 1 to get any science complete). It would be great to have shorter checkpoints up front and an extended max runtime. |
||
|
|
|