Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 98
|
![]() |
Author |
|
RTorpey
Advanced Cruncher Joined: Aug 24, 2005 Post Count: 67 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
But that doesn't acknowledge the impact this has on other projects. If it's always running at high priority, it pushes every other project to the back of the line. It's great that FAAH2 runs well, but what about people who participate in more than one project? The other projects now suffer because FAAH2 can't forecast their work properly.
|
||
|
deltavee
Ace Cruncher Texas Hill Country Joined: Nov 17, 2004 Post Count: 4852 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
But that doesn't acknowledge the impact this has on other projects. If it's always running at high priority, it pushes every other project to the back of the line. It's great that FAAH2 runs well, but what about people who participate in more than one project? The other projects now suffer because FAAH2 can't forecast their work properly. Why should it run at high priority? I've been running this with OET1 and haven't gone to high priority yet. It's just a matter of not keeping the cache too large.
4720 Yrs
|
||
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I now got this WU
FAH2_ avx17287-ls_ 000085_ 0014_ 001_ 1-- In Progress 10/5/15 17:59:07 10/7/15 03:35:06 9.03 / 0.00 77.7 / 0.0 Note the very short difference between sent time and return time: less than 34 hours! With this kind of return time, the job immediately goes into high priority mode when it is received, regardless of your queue settings. OK, this kind of WU seems the exception, but it is jumping the queue... |
||
|
KLiK
Master Cruncher Croatia Joined: Nov 13, 2006 Post Count: 3108 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Fact: At WCG the client v7 does -NOT- learn about estimated run-times i.e. will not adjust them based on client specific actual runtimes. They are fully controlled and adjusted by the WCG scheduler system with each new assignment, based on server current validated runtime averages. deltavee has his explanation right on the button. It's irrelevant whether you complete part or whole of an assignment. If not completing a percent or whole by the deadline minus N hours, the uncompleted part is packaged into a follow-on task and the slow boat machine gets a cut-off instruction. This ensures the pace of progression from step 1 to step 3 million [or however many the scientists decide on], stays on track. That track is currently a theoretical -maximum- of ~120 days long to get to step 3 million. Practically/Statistically it will likely be sooner as when my host receives and returns 100K steps within 24 hours [which it does], this gains 3 days on the timeline. If then followed by a straggler that does not do anything by say day 4, the sequence at that point in time is still on schedule. making a 8d worth of WUs with a 5d limit - will get all of us on cut-off! that is the main problem for me now... ![]() |
||
|
deltavee
Ace Cruncher Texas Hill Country Joined: Nov 17, 2004 Post Count: 4852 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I now got this WU And you should only get a workunit like this if you are a reliable computer and have been returning your workunits on time.FAH2_ avx17287-ls_ 000085_ 0014_ 001_ 1-- In Progress 10/5/15 17:59:07 10/7/15 03:35:06 9.03 / 0.00 77.7 / 0.0 Note the very short difference between sent time and return time: less than 34 hours! With this kind of return time, the job immediately goes into high priority mode when it is received, regardless of your queue settings. OK, this kind of WU seems the exception, but it is jumping the queue...
4720 Yrs
|
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
But that doesn't acknowledge the impact this has on other projects. If it's always running at high priority, it pushes every other project to the back of the line. It's great that FAAH2 runs well, but what about people who participate in more than one project? The other projects now suffer because FAAH2 can't forecast their work properly. PLUS!, eventually if a project is overworked, the client stops fetching jobs from that project and another will get it's chance to catch up. Anyone who runs a buffer that is less than half of the shortest deadline project i.e. FAHB standard deadline 4 / 2 = < 2 will hardly see any HP processing. Those with a 'reliable' host will by definition be running a buffer under 2 days, as else these would not receive the repair jobs, and these fast returners anyway hardly care for jobs jumping the queue. The client manages this quite well, except when micro-managers continue to interfere with the FIFO/EDF scheduling. |
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
PLUS!, eventually if a project is overworked, the client stops fetching jobs from that project and another will get it's chance to catch up. Anyone who runs a buffer that is less than half of the shortest deadline project i.e. FAHB standard deadline 4 / 2 = < 2 will hardly see any HP processing. Those with a 'reliable' host will by definition be running a buffer under 2 days, as else these would not receive the repair jobs, and these fast returners anyway hardly care for jobs jumping the queue. The client manages this quite well, except when micro-managers continue to interfere with the FIFO/EDF scheduling. This appears not to be the case in every instance. I have a fast machine that was set with a 1/2 day cache. It downloaded 120 tasks 2 days ago (10/4) with a 4 day deadline. Every task has been running @ high priority since 10/4. I'm guessing that is the case because the client realized there is no way that machine will finish 120 tasks in 4 days. FWIW I had the same issue on a second machine with a 1/2 day cache that downloaded over 100 tasks on 10/1 with a 4 day deadline. They also all ran on high priority until the deadline at which time those that hadn't started went to no reply status. I'm sure the same thing will happen again to the tasks that are due 10/8. Why I was sent so many tasks at 1 time with such a small cache setting needs to be addressed. Makes those machines look unreliable.
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
----------------------------------------![]() ![]() [Edit 3 times, last edit by nanoprobe at Oct 6, 2015 12:43:53 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm seeing the same thing as nanoprobe with my 1 day cache and have several FAAH2 WUs running high priorty. Running a mix of OET1 and FAAH2. The OET1 units have some variability to them and the client doesn't respond quickly to the changes in run times. Some FAAH2 sit in the queue for a day or more because the client download as much as 32 hours of work for a 24 hour cache. Then the FAAH2 units run as long as 31 or 32 hours on the Linux machines. I haven't seen the extreme download #s that nanoprobe has but that may be due to my mix of work.
|
||
|
KLiK
Master Cruncher Croatia Joined: Nov 13, 2006 Post Count: 3108 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I now got this WU And you should only get a workunit like this if you are a reliable computer and have been returning your workunits on time.FAH2_ avx17287-ls_ 000085_ 0014_ 001_ 1-- In Progress 10/5/15 17:59:07 10/7/15 03:35:06 9.03 / 0.00 77.7 / 0.0 Note the very short difference between sent time and return time: less than 34 hours! With this kind of return time, the job immediately goes into high priority mode when it is received, regardless of your queue settings. OK, this kind of WU seems the exception, but it is jumping the queue... that's d main problem...most of our "devices" aren't reliable anymore! why? too much WUs give with short completion times! ![]() |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
PLUS!, eventually if a project is overworked, the client stops fetching jobs from that project and another will get it's chance to catch up. Anyone who runs a buffer that is less than half of the shortest deadline project i.e. FAHB standard deadline 4 / 2 = < 2 will hardly see any HP processing. Those with a 'reliable' host will by definition be running a buffer under 2 days, as else these would not receive the repair jobs, and these fast returners anyway hardly care for jobs jumping the queue. The client manages this quite well, except when micro-managers continue to interfere with the FIFO/EDF scheduling. This appears not to be the case in every instance. I have a fast machine that was set with a 1/2 day cache. It downloaded 120 tasks 2 days ago (10/4) with a 4 day deadline. Every task has been running @ high priority since 10/4. I'm guessing that is the case because the client realized there is no way that machine will finish 120 tasks in 4 days. FWIW I had the same issue on a second machine with a 1/2 day cache that downloaded over 100 tasks on 10/1 with a 4 day deadline. They also all ran on high priority until the deadline at which time those that hadn't started went to no reply status. I'm sure the same thing will happen again to the tasks that are due 10/8. Why I was sent so many tasks at 1 time with such a small cache setting needs to be addressed. Makes those machines look unreliable. Getting 100/120+ on a half day cache is a server scheduler screw-up **, and yes if the buffer is over half deadline of all tasks [sum of the TTCs, then all tasks will run HP. With v7 it could initially try to test different tasks to see if the real time is less, but it should stop trying when the pre-empted count has reached the number of active cores. ** We've seen more of these reports and seen it myself how from one task to the next the TTC drops like crazy and the next it doubles / triples and more. A flaw in the server scheduler logic, since longer. The coq who's been dabbling the beak in the vin. ![]() Don't know if client side a fetch can be capped, the 2:01 minutes standard deferred giving the client time to recompute the total buffer, but with these type of run-times not advisable to send a boatload. Up to WCG to fix this... e.g. give no more than total active threads, or idle devices, then back-off to whir the buffer wheels. TTCs estimates are maintained by-app, so it would be really screwy if one affects the other. Am on 7.6.3 on one and 7.6.9 on the other that does FAHB, and have been spared so far [or get the non available for...] [Edit 1 times, last edit by SekeRob* at Oct 6, 2015 2:28:50 PM] |
||
|
|
![]() |