| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3593
|
|
| Author |
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1407 Status: Offline Project Badges:
|
It's that 48 hour return thingie, one of the criteria for a machine to be reliable for a project. .... report a multi-day average of 15 good results consecutively is the other criterion, if still valid. With so less and long running tasks, one can imagine that there are no reliable hosts at all. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The definition of reliable seems much too tight. I have abandoned use of my laptop for arp because each unit was taking more than 48 hours crunching and it was only on 50% of time so taking more than 4 days to return. That had only 1 error which was due to too many restarts. My PC is an i7-3770 and that has been taking 27 hours each, restricted to a maximum of 4 arp running and 12 waiting. That has not had any errors, but the 4.5 day turnaround would classify it as unreliable. Mike Think they can make a general project level exception like is/was done on HSTb, which has a repair deadline same as original for the _0 copy. A variation in percent does not seem to be possible. 35%. Read about in past there having been 30, 35 and 40%. The tighter the number the quicker a forced turn around and batch completion. [Edit 1 times, last edit by Former Member at Nov 26, 2019 1:13:52 PM] |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
The main problem is not the criteria. They may not have been sending enough of them out to find the reliable machines. A lot have gone to people who push the update button often enough instead.
|
||
|
|
5TEVE
Cruncher Joined: Sep 4, 2006 Post Count: 34 Status: Offline Project Badges:
|
Been getting some Resend ARP Wu's this morning 12 across 4 Box's so far ...
----------------------------------------![]() |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Good Morning,
Ok. So a few things to note, this is a multi part process. We have the Indexer, createWork and the feeder. The indexer checks to see what we have that is not in the database yet, it places them in an indexed state when found. We usually like to keep about 10 days indexed for most projects. This project is a bit different, but we have plenty indexed at the moment (even some from generation 001). The createWork is what loads it into BOINC databases. Since BOINC requires xml to be stored in the database, we usually keep this buffer at about 48 hours to assist in keeping the database quick. But again, for this project we do something a bit different...we load a set number of results per half hour to artificially slow the project down. (Again, we hope to get this up to full speed in the future, which would be 30+k workunits in the wild at one time. 60+k results since redundancy) The Feeder is the last part, it grabs what has been loaded by createWork. it pulls in work units to fill it based on weights set on our end. Say the feeder has 1000 spots and we give a weight of 50 to arp1, 25 to scc1, 15 to hstb, 10 to mcm1. The feeder would try to fill 500 slots with arp1 work. This happens every 5 seconds it tries to fill the empty slots. Members when they do a scheduler request pull from this feeder. The feeder attempts to pull higher priority results in first and then based on time stamp. This means that reliable results get pushed to the top first. The problem wasn't the feeder, the problem was that we had a backlog of reliable results needing to be sent. This caused the createWork to believe it had over 21 work units loaded already on the grid. (If you haven't guessed we are loading 21 workuntis every 30 minutes). If we had only 10 reliable results waiting to send it would have only loaded 16 work units for arp1. I am still trying to think what the best option is, because loading more work to keep the flow going is desirable, however, I need to have the resends be sent out and returned positively. I'm trying to think of a happy medium between the two extremes that would still keep the system quick and provide new work units to members. I hope my long winded explanation helps, if it doesn't....please feel free to ask for clarification. :) Thanks, -Uplinger |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Keith
Thank you for the illuminating explanation. However, one of the problems relates to the definition of 'reliable'. Because of the length of time needed to complete each unit and the need to hold a cache, very few machines can classify as 'reliable' even if they never have an error. The definition needs to be relaxed for this project so as to enable more machines to qualify. Mike |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for the background.
"(If you haven't guessed we are loading 21 workuntis every 30 minutes). If we had only 10 reliable results waiting to send it would have only loaded 16 work units for arp1." Suppose that could be 28 minutes, 33 minutes, 38 minutes, i.e. on average, something that you cant tune a scheduler to which was the whole point of 'randomized distribution' I'm math challenged BTW... 10 + 16 = 21. Guess you added some fuzzy logic. |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
I note the reference to generation 001. This seems to indicate that we are now about 0.5% through the project.
Mike |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Keith,
Many thanks for the detail. I guess writing it down gave you an opportunity to think while you did it. My only take is not very useful: it sounds like a process that works OK for a single project research effort, but for a multi-project effort like WCG I think I would have taken things in a different direction. But that sounds too much like "If you want to get there I wouldn't start from here". I think enough people have made comments for someone of your expertise to weigh up the different requirements and to have a reasonable chance of sticking something together that works. But I do agree that an automatic, gently sloped fall-back to more relaxed 'reliable' constraints would seem to be necessary, even if not easily implemented in this environment. Good luck! |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Fuzzy math answer: I used different terms like results and workunits. A single work unit has 2 results by default. So if the createWork sees 10 results waiting to be sent, then that equals 5 workunits. We load based on workunits, so 21 - 5 = 16 :)
I have thought about removing the reliable hosts for this project, however there could become an issue where 5 copies are sent out, and 4 failed due to unreliable hosts...that would mean the 1 person who returned the result valid would not get used by the system until it was investigated why the workunit failed entirely. I am debating increasing the 40% of original time for reliable hosts to help keep those machines happy. Especially since we are running this slow, having to get those back at the moment isn't an issue. On most of our projects, we use what is called a batch status, a batch may have 1000 workunits in it. if 95% of the workunits return within 3 days, then we could be waiting 10+ days for the remaining 5% to return. This means we are waiting the extra days to get back that remaining 5% before packaging and sending a batch back. Which also means we are temporarily storing it on our infrastructure until that time is complete. Some actual stats: 10 day return period with zero redundancy and reliable hosts gets us batches to complete in 15-16 days. A batch with 10 day return and single redundancy (2 copies needed) has a return time of 17-19 days. A batch with 10 day return and zero redundancy and no reliable hosts averages 28-30 days. As you can see it's a balancing act. But as mentioned before, most projects use a batch concept, this project is using a workunit concept which is different than our other projects and some of the guides we used in the past can be relaxed/tweaked. Thus a learning experience for all :) Thanks, -Uplinger |
||
|
|
|