| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 9
|
|
| Author |
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
One of my Win10 Pro x64 clients on BOINC 7.14.2 x64 just suddenly wigged out, and ETAs for all ARP1 tasks went from pretty accurate (i.e. 15-20 hours per task) to almost 8 days per task. As a result, the client is no longer requesting new tasks, saying "job cache full."
----------------------------------------![]() I manually went to Tools > Run CPU benchmarks hoping it would then re-adjust, but that didn't work. I waited for one task to complete and upload and report thinking it would re-adjust, but nope. Other than restarting either the BOINC client or the computer itself, some questions: 1. What caused the estimates to go WAY out of whack? 2. What's the solution? 3. Is there a workaround in lieu of a solution? Thanks! I don't want to restart the BOINC client or the computer because for this host ARP1 checkpoints occur every 2-3 hours.
[Edit 1 times, last edit by hchc at Nov 2, 2019 7:36:00 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I can't really help, and I don't think I recall have this happen so blatantly to me, but I have had estimates double. Every time it happened it sorted itself out over time, but ARP1 tasks run so long that you may have to wait a day or two, I'm afraid.
I also seem to recollect that WCG stopped using the local benchmark data as they could be manipulated by the user, so they use feedback from results instead. This is another reason that things do, albeit slowly, adjust over time. I think you'll find that, if you're patient, the mess will clear up of its own accord. [Though I do wonder if there's an underlying bug which is still breeding in the shadows ...] |
||
|
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3716 Status: Offline Project Badges:
|
hchc, from reading all your posts, you seem to interact with the server very often, and I am afraid you have triggered its "give-me-a-break" routine.
---------------------------------------- The last time I have observed what you describe was when I was trying to process as many very short SCC WUs as I could. My fastest machine had its queue at its 70-WUs-per-thread maximum, so every time it was returning one result it was requesting a new one. Practically that was about 30 times per hour because of the 2-minute delay. As Apis said, it will fix by itself after several returned results. Unfortunately for you, several returned ARP1 results will probably take a looooong while. ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Whatever you have in app_config.xml for ARP1, replace with below, where nn is the number you want to run concurrently for this science, 0 unlimited.
----------------------------------------<app> ETA ETC RTC, it will adjust time left, most of the time fairly accurately. [Edit 1 times, last edit by Former Member at Nov 2, 2019 11:53:45 AM] |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
I'm at the one month mark for this issue. The ETA for one ARP1 work unit has gone from 7 days, 20 hours to 1 day, 3 hours. The actual runtime for an ARP1 work unit is about 15-18 hours on this device.
----------------------------------------1. How long must I wait for this to normalize? 2. What's the root cause of why this happened in the first place? 3. Is there a way to fix this on either the client side or server side? Thanks.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Wait till you're blue in the face or start by forcing a CPU benchmark run on the client and apply the fraction_done_exact change to app_config.
|
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
hchc, originally the estimated times were way short, then i bumped them up, so that caused those who originally had some to go out of sync. Then looking into a points issue, I saw that the estimated fpops that are sent to the clients weren't getting adjusted automatically. This has been fixed from about 1-2 weeks ago.
Next, is the workunits themselves. They vary in runtime based on a few things that are in the workunits. A major one that can swing a runtime is if it is rainy in the simulation or dry. The estimates for fpops are based on all workunits returned from all members as well. Thanks, -Uplinger |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
Thanks uplinger, but that still doesn't explain why new work units received on this device -- even today 12/3/19 -- are still showing an estimate of 1 day 1 hour to complete on this device. It's taken quite a few completed work units to bring it down from 7 days 20 hours to this. I'm wondering if there's something I can delete in client_state.xml or somewhere to fix this?
----------------------------------------Edited to Add: Speak of the devil, as of 9:45 AM CST or so (within the last 15 minutes), the estimate for a new work unit is 14 hours 15 minutes, which seems normal. Earlier this morning it said it would take 1 day 3 hours. It took about a month, but I think this machine is finally accurate!
[Edit 1 times, last edit by hchc at Dec 3, 2019 3:58:40 PM] |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1407 Status: Offline Project Badges:
|
A major one that can swing a runtime is if it is rainy in the simulation or dry. . . -Uplinger Interesting Although set every 12.5% of progress (6 hour period of data), I already noticed different times between the checkpoints. The difference between the shortest and longest run time between checkpoints is almost 50%. The longest periods are during the processing of the data from 06 UTC - 12 UTC (twice in 48hrs). That is the second and sixth pass and the shortest periods are between 18 UTC - 00 UTC. That are the periods between the 3rd and 4th checkpoint and between the 7th and last checkpoint. This may vary in future when other areas and months are processed, but for now I would not wonder when the current batch is mainly grid data from Kenia area. [Edit 3 times, last edit by Crystal Pellet at Dec 3, 2019 5:07:16 PM] |
||
|
|
|