| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3593
|
|
| Author |
|
|
Speedy51
Veteran Cruncher New Zealand Joined: Nov 4, 2005 Post Count: 1326 Status: Offline Project Badges:
|
Update: The logging in the script is not helpful. (I wrote it...) I'm adding more logging in places that it appears to have failed. I should be able to get to the bottom of it quickly I hope. Thanks, -Uplinger Thanks for the update Keith. Don't be hard on yourself about the script logging :) ![]() |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Ok, figured out the issue. It will require a code change to help prevent it in the future.
So, as many of you are aware, we are limiting the number of results per hour that are sent out. The check on how many workunits to load is showing -183 at the moment. This is the expected behavior of the application at the moment, this means load -183 workunits...Thus the application is not loading anything new. It gets this number because it is checking the number of workunits running on the grid and waiting to be sent. We have about 200 results waiting for reliable hosts, but don't have hosts pulling those off the feeder. This is tough, because it is running slow. I'm going to release these work units in question to regular hosts to get us over this bump and work towards a permanent fix in the morning. It will require extra thinking because of the complexity of the estimator. Anyways, I know that probably was the long version of, "I know the issue, I'll fix it later. For now, I'll put a bandaid on it." Thanks, -Uplinger |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Thanks, Keith
|
||
|
|
RTorpey
Advanced Cruncher Joined: Aug 24, 2005 Post Count: 67 Status: Offline Project Badges:
|
Just had one WU show up!
|
||
|
|
TonyEllis
Senior Cruncher Australia Joined: Jul 9, 2008 Post Count: 286 Status: Offline Project Badges:
|
Received a few re-sends (-2) and a few new work (-0) WUs
----------------------------------------
Run Time Stats https://grassmere-productions.no-ip.biz/
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Keith,
Your band-aid seems to be holding well. A WU I had waiting for a wingman has been sent out, and I received a couple of new WUs. Thank-you. I've been thinking about your statement that: We have about 200 results waiting for reliable hosts, but don't have hosts pulling those off the feeder. I have to say that I don't understand this. To be honest, I can't remember the exact definition of "reliable", but it surprises me that there aren't enough machines which have acquired that status asking for WUs. Is it just too early in the project, or has something gone wrong with the process which decides on that status? Or maybe you just need a fall-back plan for this situation which foregoes the reliable status check when the queue backs up too much? I'm sure you've got better things to do than to indulge me with an answer, but I'm still curious. Maybe someone with more knowledge than me would like to speculate? |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Apis,
I've been thinking about your statement that: My speculation is that those 200 results are resends (_2, _3, _4).We have about 200 results waiting for reliable hosts, but don't have hosts pulling those off the feeder. I have to say that I don't understand this. it surprises me that there aren't enough machines which have acquired that status asking for WUs. Again my speculation: there are enough (reliable) hosts, but something went wrong with the script that distributes the tasks. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It's that 48 hour return thingie, one of the criteria for a machine to be reliable for a project. If you buffer these, bound to be not in. Got app_config set at 1 and profile at 2. One waiting with 29 hours runtime keeps me out. Got a _2 this time though with the full 7 day deadline. Think that was forced this round... some have reported not making even the 35%, 2 days 10 hours even when started immediately.
----------------------------------------[Edit 1 times, last edit by Former Member at Nov 26, 2019 11:19:04 AM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
The definition of reliable seems much too tight. I have abandoned use of my laptop for arp because each unit was taking more than 48 hours crunching and it was only on 50% of time so taking more than 4 days to return. That had only 1 error which was due to too many restarts.
My PC is an i7-3770 and that has been taking 27 hours each, restricted to a maximum of 4 arp running and 12 waiting. That has not had any errors, but the 4.5 day turnaround would classify it as unreliable. Mike |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I think what gets me is that stuck resends can completely barf the feeder. I think there should be two queues, one of new work and one of resends. If a machine isn't suitable for a resend, then it should get a WU from the 'new' queue. If the 'new' queue is empty, then the 'committed to other platforms' message would become 'no new work'. If the resend queue gets too full, then the situation should be flagged to the admins -- we don't need to know.
Just my cogs turning over again. They need oiling. |
||
|
|
|