| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 30
|
|
| Author |
|
|
Mysteron347
Senior Cruncher Australia Joined: Apr 28, 2007 Post Count: 179 Status: Offline Project Badges:
|
Here's my experience from this afternoon, running HCMD2 on a dual-core under XP with 6.10.56
Was running a grandchild task from batch 875 and a child from 866, both due 31/10. The child task (866) finished at 9hr+ CPU * Grandchild(875) went to Waiting. * Two child tasks, from 866 and 867 with due date 2/11 started. ** These were the earliest-alphabetically-by-name in the cache. * A POEM task started for no apparent reason (due 30/10 with expected run-time ~2hr) and the 867 went to wait. * After a few seconds, with the first stanza written, the POEM task was set to waiting and a NEW HCMD2 task commenced, this time again a child task, but the lowest-alphabetically-by-name-but-never-yet-started, also due 2/11. ** This was coincident with AVAST! squawking about a trojan on a web page. OK - so I'm not using the latest version - I'm d/l that as I type, and I'll install it at next cold-boot opportunity. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Et Al,
----------------------------------------There are various ways, if you wish to insist on having these 4 day deadline "Repair" jobs skip ahead, to force that in an automated way. The below is tested and works for 6.10 and the currently developed 6.12: a) "Connect About Every" to 2 days (equivalent to Additional buffer) b) "Additional buffer" at 0.0 days (additive to the Connect Every c) "Switch between applications every" to 1440 minutes (1 day) This combination makes the client think that these tasks will not be finish in time unless started immediately. Once started, the "switch between application" setting makes sure they complete in 1 run. A different combo that works too, same options as above: a) 0.00 days b) 1.00 days c) 4500 minutes (just over 3 days) Again, including the client safeties that cannot be changed, the client is tricked into processing the short deadline tasks first. Such tasks skipped ahead in the queue will process in High Priority and will stop all project work fetching if all cores are processing tasks in that status. This is resource share impacted, meaning that in the example above where POEM is mentioned and POEM has a small weight and short deadline, the client will compute that the share of processing time is not enough unless such a task is started immediately. This cycle will continue until when the client scheduler determines that all will be processed in time, but particular on multi-core devices backfilling of the buffer by one or the other project will continue to push these short deadline jobs. It of course to the uninitiated will start to look chaotic if more short deadline jobs are in the task queue then there are cores. (Last had 15 of them waiting on my quad). Then the client will at times flip the rush tasks to pure Earliest Deadline First and allocated processing time. Those that only contribute to WCG would though rarely see many tasks that are in "waiting to run" state. Don't panic and let the client deal with this. It does not like to be watched and long as it delivers Results in time, all is fine. As for oldest by name... is not relevant as work does not come purely sequential out of the feeders. Receipt time order and deadline are the controlling factors which task get processed when in Round Robin or Earliest Deadline First. BOINC likes to learn and be allowed to learn so when it gets a short deadline task and has not had one recently it start to run them for a little to find out what the REAL estimated run time is. Oh, and if work regularly needs more than 2 days to completed, the rush jobs will stop to come from WCG... Beta work will though still come... it does not care less about any settings, long as the "connect about every" time is less than the deadline....usually 4 days, sometimes shorted, sometimes longer. In a nutshell... really, in 99.9x% of the cases the client knows better... and please, if you make a change to the scheduling/work buffering prefs and it will again take longer for the client to find the right balance... let it chuck away. happy crunching.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Oct 26, 2010 11:51:30 AM] |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Heed thé for what one wishes for. The last 5 tasks were all CEP2 repair jobs and of course the quad went into complete High Priority processing mode. Of course we (me) don't want 4 CEP2 concurrent so suspended 3 and will release them as and when the first 2 have finished. Suspending a task for a grid will stop all work fetching too, for this grid (clients 5.8 and above).
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Mysteron347
Senior Cruncher Australia Joined: Apr 28, 2007 Post Count: 179 Status: Offline Project Badges:
|
There are various ways, if you wish to insist on having these 4 day deadline "Repair" jobs skip ahead, to force that in an automated way. No, I don't so insist. What has "4 day deadline" got to do with it? Today is October 26th. The earliest-due WCG job has a deadline of Oct. 31st. - that's 5 days away. The jobs currently being run are not due until Nov. 2nd. That's 7 days away. Where does "4 days" come into it? Why are the jobs due in 7 days running before the FOUR jobs due on Oct. 31st? In fact, a job due on Oct. 31st. has been stopped in favour of jobs due two days later. The other three haven't been started. That's ignoring the FOURTEEN jobs due on Nov. 1st., and the jobs currently running also have a due date LATER than 5 due on Nov. 2nd. That's a total of TWENTY-THREE jobs that I have with EARLIER deadlines than the ones currently running. Again, including the client safeties that cannot be changed, the client is tricked into processing the short deadline tasks first. Such tasks skipped ahead in the queue will process in High Priority and will stop all project work fetching if all cores are processing tasks in that status. The tasks currently being run are not 'short deadline tasks.' They have a seven-day deadline - and have probably been in the queue for days as well. For their NAME, I can tell that they're from an earlier batch - but they're only child tasks, from recently-tackled batches. There seems no reason for them to be skipped ahead. [POEM] : I can't see that a 2-hour (estimated) POEM job should make the system panic in this way. It's at a very small resource share. (Last had 15 of them waiting on my quad). Then the client will at times flip the rush tasks to pure Earliest Deadline First and allocated processing time. Again, this is NOT executing 'earliest deadline first.' It does not like to be watched and long as it delivers Results in time, all is fine. Yes - I know how it feels. I don't like being watched myself. I've no doubt that all of the jobs will be completed on time - just that deliberately running jobs with a later due date earlier than those with an earlier deadline doesn't seem to be a particularly safe strategy... As for oldest by name... is not relevant as work does not come purely sequential out of the feeders. Receipt time order and deadline are the controlling factors which task get processed when in Round Robin or Earliest Deadline First. BOINC likes to learn and be allowed to learn so when it gets a short deadline task and has not had one recently it start to run them for a little to find out what the REAL estimated run time is. Nice theory. All I can do is to report what I observe - and the only common factor that I can see is that the job NAMES that are unexpectedly being run earlier are the earliest alphabetically. This is not receipt-time order, I've repeatedly said it is NOT Earliest-deadline-first, nor have the running jobs a short deadline. As for experimenting with unfamiliar tasks, it's had a steady diet of HCMD2 units for eighteen months now - 24/7, with the odd excursion to write the odd POEM etc. for light relief. Oh, and if work regularly needs more than 2 days to completed, the rush jobs will stop to come from WCG... Beta work will though still come... it does not care less about any settings, long as the "connect about every" time is less than the deadline....usually 4 days, sometimes shorted, sometimes longer. Often happens that work takes more than 2 days to complete. These are NOT rush jobs - they have do NOT have a 4-day deadline. And I don't run Beta. In a nutshell... really, in 99.9x% of the cases the client knows better... and please, if you make a change to the scheduling/work buffering prefs and it will again take longer for the client to find the right balance... let it chuck away. happy crunching. Haven't changed any preferences in months, so BOINC should have got used to the routine. It doesn't really worry me - the work will get done, and I'll only intervene should BOINC decide to go on to Nov. 3rd jobs before it's completed the October batch. I've updated to 6.10.58, which had no apparent curative effect - in fact, having run two due-Nov-2nd tasks for a further couple of hours, it's now decided to start a completely different due-Nov-2nd task, ignoring the 21 earlier-deadline tasks. This latest usurper is a child from batch 885. Not likely to be a rush job - and not even the earliest alphabetically, either... I'm just reporting what it's doing, which seems to be to concur with the original commentary in this thread... |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
It was NOT a reply to you and will clarify that with an insert of "et al" in that post, rather a sharing of what CAN be done with some samples. Only the 3rd last para of my post relates to your post specifically ;>)
----------------------------------------edit: One note for Mysteron347 and anybody else who's still confused over the sequence of task execution: All you see if there is no high priority processing is that the client principally will process tasks in order of receipt, per active grid connected to a client! It's called "First In First Out" aka FIFO. The order of receipt can be sorted on the My Grid > Result Status page by selecting the Sent Time header which then shows Newest-Oldest order. If you think things are broken or not operating as desired you may wish to take your observations to the Questions and problems forum over at the BOINC developers. There the collective knowledge and insight is much deeper. (sorry if I wrote this before, which I probably did)
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Mysteron347
Senior Cruncher Australia Joined: Apr 28, 2007 Post Count: 179 Status: Offline Project Badges:
|
I'll head off using the link as you suggest.
Both the currently-running tasks and the usurped tasks are on standard 10-day deadlines, so the sent times are exactly 10 days behind the due-times, there is no apparent question of 'rush jobs' and there seems to be no logical reason why I should be observing this behaviour - it's counter-intuitive. But I also note you said 'principally.' |
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
[POEM] : I can't see that a 2-hour (estimated) POEM job should make the system panic in this way. It's at a very small resource share. This part atleast is easily explained. Now, you didn't say your resource-share for POEM, but with a 50% resource-share, the BOINC-client wants to run 1 hour POEM, and 1 hour something else. This means, a task with 2 hours estimated run-time, has 4 hours estimated "real time" until finished. If Poem has 25% resource-share, you'll run 1 hour Poem and 3 hours other projects, meaning your 2 hours has estimated "real time" of 8 hours. At 10% resource-share, 1 hour Poem + 9 hours other projects, and your 2 hours has estimated "real time" of 20 hours. At 1 % resource-share, 1 hour Poem + 99 hours other projects... and your 2 hours has estimated "real time" of 200 hours... But, 200 hours is 8.33 days, and it's less than 8.33 days until the deadline, meaning if Poem runs with a 1% resource-share, it must run in High Priority to be finished early enough. As for running tasks in "Strange" order, if assumes runs only a single project, and to make a very easy example, let's say you've got a single-core computer, and each task has expected run-time of 12 hours. Client checks the tasks, after earliest-deadline-1st.-order. Let's say it's like this: #1: 12 hours left to run, 3 days until deadline. In this order, 0.5 days total run-time. #2: 12 hours left to run, 3.1 days until deadline. In this order, 1 days total run-time. #3: 12 hours left to run, 3.2 days until deadline. In this order, 1.5 days total run-time. #4: 12 hours left to run, 3.3 days until deadline. In this order, 2 days total run-time. #5: 12 hours left to run, 3.4 days until deadline. In this order, 2.5 days total run-time. #6: 12 hours left to run, 4 days until deadline. In this order, 3 days total run-time. #7: 12 hours left to run, 4.1 days until deadline. In this order, 3.5 days total run-time. #8: 12 hours left to run, 4.2 days until deadline. In this order, 4 days total run-time. #9: 12 hours left to run, 4.3 days until deadline. In this order, 4.5 days total run-time. Task #9 is in deadline-trouble, 4.5 days > 4.3 days until deadline. #10: 12 hours left to run, 7 days until deadline. .... In this example, task 1 - 8 isn't currently in deadline-trouble, but #9 is, so #9 will be started and run High Priority. After #9 has run for some time HP, task #8 will likely run HP, since now it's got too little time until deadline... While this example ** is very easy, it should hopefully atleast show why it's one of the last tasks that runs HP, instead of running in EDF-order ***, or for that matter in download-order. ** In practice this is a really bad example, since one of the tasks will always miss the deadline... *** The reason to not run EDF, except if this is really neccessary, but select the 1st. task that is in deadline-trouble instead, is to guard against users having example a dual-core with a single 1-month CPDN-model with low resource-share (even 50% will be 'low' in this regard) with 2 months until deadline. If client runs all tasks EDF, only the small tasks with shorter deadline will run, until there's maybe only 1 week until the deadline for the CPDN-task, and of course doing 30 days crunching in 1 week is impossible... By running the CPDN-task in HP, chances are all work can be returned by their deadlines. ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
|
Mysteron347
Senior Cruncher Australia Joined: Apr 28, 2007 Post Count: 179 Status: Offline Project Badges:
|
That makes sense, Ingleside.
I have POEM at 1.55% (look, I said low - I mean low.) The task took just over an hour, and is now completed. This policy means that low-share projects will push ahead of large-share because BOINC will massively overestimate their time to completion. It also neatly explains why the odd-order-of-execution has been invoked. As further confirmation, the phenomenon was repeated a few minutes ago - a particularly 'chewy' unit (HCMD2 with nearly 10 hrs CPU/10.5 elapsed) triggered the companion unit to be waited and started two later-due units in HP, as if following your script. It also seems that the estimated run-time for each queued unit has been unexpectedly recalculated from its original level of around 6 hours to 10hrs+ for each unit. Gold star to Ingleside, please! |
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
It also seems that the estimated run-time for each queued unit has been unexpectedly recalculated from its original level of around 6 hours to 10hrs+ for each unit. Each time a task finish, the duration correction factor is updated, DCF. This is how "bad" the server-estimates + benchmark is compared to the actual crunch-times. You can see your current DCF if you select WCG-project on project-tab and hits "Properties". Especially if you runs a mixture of WCG-projects, the DCF for WCG can variate a lot as some tasks takes much longer than the initial estimate. Having nearly a doubling of the DCF isn't uncommon, and this means all tasks suddenly is expected to take nearly double time to run... DCF increases fast, but decreases slower (max 10% down in one step). ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
How good or "bad" the FPOPS are estimated is a function of the variability of tasks when the nature is largely "non-deterministic", plus BOINC never been designed to track DCF per-science within an Umbrella Project such as WCG. There's apparently a special build out somewhere of 6.10.58, that does it though and hoping this code is good enough to be passed back into an official BOINC build. It will remove a major cache sizing issue for any host that runs with more than a couple of days in the buffer.
----------------------------------------POEM = 1.5%... yup does explain. The smaller the share and the shorter the deadline the more you see this (in appearance) "outlandish" behavior and to add, the client continuing to pull work and processing it in HP until the project is "overworked". The most hateful flaw I think in the scheduler is that it tries at times to keep work for all cores for a small share project, so one observes silly request such as 1 second, 4 CPU and suddenly there's so much work that the client cannot but determine that a race condition is on. Have 20 active projects attached to a client which if all on default 100, effectively get 5% each and increase the cache by a day and you'll be seeing the real fun... don't watch. Glad it's clarified, Mysteron347.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Oct 27, 2010 7:36:59 AM] |
||
|
|
|