| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3593
|
|
| Author |
|
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges:
|
Who If you consider an anology with marketing, Delft would be the manufacturer, WCG would be the intermediary (or shopkeeper) and we would be the customer. Who's on first ![]() CJSL Gotta keep crunching... ---------------------------------------- [Edit 1 times, last edit by cjslman at Nov 27, 2019 1:47:54 AM] |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
I can see how fast turnaround matters for repair work units (_2, _3, etc.) with much shorter deadlines, as well as for Beta work units and FAHB work units that are AsyncRE. But -- not to be pedantic over definitions -- "reliable" to me simply means 1) consistently turned in within the deadline, and 2) no errors (or no error within many weeks/months or error ratio is ridiculously low).
----------------------------------------I generally keep a 1 day cache to account for my ISP outages and thieves stealing Comcast wiring as well as WCG maintenance and unplanned outages, but for ARP1 I'm keeping a cache of 3-4 days of MIP1 work units so that I don't have to babysit my devices that often and gives them a chance to get a few ARP1 work units. Otherwise they'd quickly run dry and idle or require more babysitting restocking with fresh MIP1. Thank you Uplinger for the detailed explanations! And for the config changes. I'd love to see the createWork cache get up to 30K or more work units (60K or more results) in the future.
[Edit 3 times, last edit by hchc at Nov 27, 2019 2:01:04 AM] |
||
|
|
DrMason
Senior Cruncher Joined: Mar 16, 2007 Post Count: 153 Status: Offline Project Badges:
|
First, to dispel a misconception: having an average runtime of "x" and a cache smaller than "x" DOES NOT mean that you get no units. I have a machine right now that has cache settings of store at least 0.01 days of work, and store an additional 0.1 days of work, that is crunching an ARP unit right now. That machine alone has crunched at least 4 units in the past several days. That machine has taken between 15.89 and 18.3 hours to complete and return units (which is obviously longer than 0.01 and .1 days). So, the idea that low cache numbers results in no units simply isn't correct.
----------------------------------------Second, I feel like there's a difference in the value of "reliability" from the viewpoint of the project versus the viewpoint of some users and I don't think there's anything wrong with that. It seems like a lot of people aren't looking at it from the researchers' perspective. I understand the need (in some limited cases) or desire to cache units to crunch later. If you have intermittent internet, or a bandwidth cap, it allows people to still contribute to WCG when it's convenient. But there's always a tension between convenience ("we don't need the result right away, so we can wait a couple extra days") and what will make the project unworkable ("we've been waiting on this unit for x days now and it's holding us back from creating the next batch of units"). It sounds like they use "reliable" computers to primarily crunch in the second kind of case: re-sends on errored out, no response, aborted, or too late units. There, time would be of the essence. So what's wrong with WCG creating a definition of reliability based on "this computer makes the project run smoothest"? They are now "late" and at risk of slowing down the project, so it would be to the benefit of the project to send them to the hosts that have proven a consistently fast (or, in another word, "reliable") turnaround time. If it is redefined to allow hosts that do not crunch them immediately and quickly then that could slow down the project and these slowdowns compound quickly over time. It's not a month into ARP, so there's no issue with testing things out to see what runs most smoothly, especially since they are not fully ramped up. But we should keep in mind that WCG is looking at the projects on a different level than the crunchers, and they may have different priorities and viewpoints. There needs to be some deference to those priorities. And to address an analogy by Mike.Gibson above, we are not the customer. The researchers are the customer. We, the crunchers, are a supplier of a good (computer power) that we donate to WCG, which is given to the customer. It's always good to know where you are in the supply chain ![]() ![]() [Edit 1 times, last edit by DrMason at Nov 27, 2019 2:18:16 AM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Well said DrMason.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
DCS1955
Veteran Cruncher USA Joined: May 24, 2016 Post Count: 668 Status: Offline Project Badges:
|
Are we seeing our first drop off on WU? HSTB Redux... down to 828 WU yesterday. I am running on fumes to get to gold.
----------------------------------------![]() ![]() |
||
|
|
DrMason
Senior Cruncher Joined: Mar 16, 2007 Post Count: 153 Status: Offline Project Badges:
|
Hey dcs1955
----------------------------------------It seems there was a slight error in the number of workunits being sent out for a couple of days. I think this has been fixed now, and units are being pushed back out. Because of the length of the units (and I suppose some people's caching routines), the effects of the error will lag. So, in the coming days, we should see those numbers rebound. ![]() |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
Well said DrMason. I don't understand the first paragraph though:
----------------------------------------First, to dispel a misconception: having an average runtime of "x" and a cache smaller than "x" DOES NOT mean that you get no units. I have a machine right now that has cache settings of store at least 0.01 days of work, and store an additional 0.1 days of work, that is crunching an ARP unit right now. That machine alone has crunched at least 4 units in the past several days. That machine has taken between 15.89 and 18.3 hours to complete and return units (which is obviously longer than 0.01 and .1 days). So, the idea that low cache numbers results in no units simply isn't correct. I re-read the last page or two of posts to make sure, but I haven't seen anyone assert that low buffers means no ARP1 tasks, so it appears like you created and responded to a strawman to me. I know you didn't quote anybody, but can you please point me to somebody who made the argument that "low cache numbers results in no units" in case I missed someone taking that position? At least my position is that small buffers require more frequent babysitting, meaning if I fill a 1 day buffer with MIP1 work then switch to ARP1, I would have to check every 1 day in order to make sure the device is 100% full of work instead of sitting idle. I prefer not to micromanage all this, hence why I set the period to 3 days, which means I fill up 3 days' worth of MIP1 work and whatever ARP1 work it receives is simply icing on the cake. But at least I don't have to babysit my devices except every 3 days.
[Edit 7 times, last edit by hchc at Nov 27, 2019 4:41:53 AM] |
||
|
|
DrMason
Senior Cruncher Joined: Mar 16, 2007 Post Count: 153 Status: Offline Project Badges:
|
@hchc Eh, I'm newish to the forums so am still figuring out etiquette, but it was a page or two ago. Is that the standard practice - if it's a page or two ago, quote it? I'll try to remember in the future. Posted the quote below to reference.
---------------------------------------- Jim If I was to implement your settings which would mean a minimum cache of 2.4 hours and a maximum cache of 14.4 hours, I would never get any WUs as they are taking 27 hours without counting any queuing time. Owing to the paucity of availability, the settings need to be at least 1.5 days + 1.5 days in order to get 1 and have another waiting. That would mean a turnaround of 3 days which is less than half the allowed time. I think a better definition of 'reliable' would be half the allowed time, which could be implemented as an across the board definition. Mike Not trying to construct a strawman, but I suppose it's possible that I misunderstood what Mike.Gibson was referring to in this post. If so, feel free to correct me; I never discount the possibility that I'm wrong haha. I kinda like what you've done with your caches; if my system stops working or my internet craps out, I know what system to try out. It's a cool approach! I saw you said that thieves are stealing ISP wiring? Dang dude, that's next level... The approach I use kind of assumes that the internet is not interrupted, and that work is constantly being done and reported, so it may not work for everyone. But my approach is that I use mainly the device manager. If the cache is set very low, it will fetch new work whenever a project finishes crunching. I then adjust the projects to limit some and encourage the others. I check the boxes of just the one or two programs I want to encourage. Since those units are scarce, then I check the box to have WCG fill the rest of the threads with whatever other projects if none of the units I want are available. To maximize efficiency, I limit the number of MIP (since the level 3 cache requirements has a knock-on effect on other work units if too many MIP units are crunching at the same time), set the project I want to encourage to "unlimited" just in case, and then set the cancer units to a high enough number to fill the rest of the threads just in case. If you have few enough machines (or groups of similar machines), you can create a profile for each tailored to how much MIP to limit and how many threads to fill with cancer workunits (if HSTB or ARP aren't available), and then they never need babysitting again. But, it takes a fair amount of effort at the start haha. ![]() [Edit 1 times, last edit by DrMason at Nov 27, 2019 5:15:37 AM] |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
DrMason said:
----------------------------------------@hchc Eh, I'm newish to the forums so am still figuring out etiquette, but it was a page or two ago. Is that the standard practice - if it's a page or two ago, quote it? I'll try to remember in the future. Posted the quote below to reference. Jim If I was to implement your settings which would mean a minimum cache of 2.4 hours and a maximum cache of 14.4 hours, I would never get any WUs as they are taking 27 hours without counting any queuing time. Owing to the paucity of availability, the settings need to be at least 1.5 days + 1.5 days in order to get 1 and have another waiting. That would mean a turnaround of 3 days which is less than half the allowed time. I think a better definition of 'reliable' would be half the allowed time, which could be implemented as an across the board definition. Mike Not trying to construct a strawman, but I suppose it's possible that I misunderstood what Mike.Gibson was referring to in this post. If so, feel free to correct me; I never discount the possibility that I'm wrong haha. Gotcha dude! I didn't read back far enough. I guess that post confused me too, so sorry for dropping the strawman thing on you and getting all debatey. ![]() Yeah, Mike's post I believe focused more on meeting the requirement for "reliable" fast turnaround time, which is hard to achieve on a really old system. My oldest system takes 36 hours to do a ARP1 work unit. My fastest knocks them out in about 12 hours. DrMason said: I kinda like what you've done with your caches; if my system stops working or my internet craps out, I know what system to try out. It's a cool approach! I saw you said that thieves are stealing ISP wiring? Dang dude, that's next level... Yeah, I started out preferring a 0.1 day cache -- that way I get fresh work and turn it around immediately. My computers were super reliable and got a ton of repair work. But with such a tiny cache, sometimes the WCG maintenance window is 4 hours so that wouldn't cover it. And I've lost Internet for 1-2 days here so settled with 0.5 days then now 1 day cache. 1 day is kinda my sweet spot. And yeah, even in a nice neighborhood here, every year we'll wake up and the neighborhood cable Internet node boxes are busted open and wiring or equipment is stolen. I mean, I hate Comcast with a thousand suns, but stealing Comcast gear knocks out Internet for a whole neighborhood. DrMason said: The approach I use kind of assumes that the internet is not interrupted, and that work is constantly being done and reported, so it may not work for everyone. But my approach is that I use mainly the device manager. If the cache is set very low, it will fetch new work whenever a project finishes crunching. I then adjust the projects to limit some and encourage the others. I check the boxes of just the one or two programs I want to encourage. Since those units are scarce, then I check the box to have WCG fill the rest of the threads with whatever other projects if none of the units I want are available. To maximize efficiency, I limit the number of MIP (since the level 3 cache requirements has a knock-on effect on other work units if too many MIP units are crunching at the same time), set the project I want to encourage to "unlimited" just in case, and then set the cancer units to a high enough number to fill the rest of the threads just in case. If you have few enough machines (or groups of similar machines), you can create a profile for each tailored to how much MIP to limit and how many threads to fill with cancer workunits (if HSTB or ARP aren't available), and then they never need babysitting again. But, it takes a fair amount of effort at the start haha. Oh wow, that's interesting. I'm aware of the L3 cache issues with Rosetta/MIP1 and maybe other projects, but I'm not optimizing MIP1 at this point. My current system is mostly "ARP1 100% if possible," but since supply is scarce as the project ramps up, I just fill up with other stuff then untick the boxes so that only ARP1 is selected which gives me a 100% chance that the clients ask for ARP1 work instead of getting spread thin. Sorry for misreading you man, and thanks for sharing your system on how you balance your workload.
[Edit 2 times, last edit by hchc at Nov 27, 2019 5:31:59 AM] |
||
|
|
floyd
Cruncher Joined: May 28, 2016 Post Count: 47 Status: Offline Project Badges:
|
When I read through this thread I get the impression that the dilemma is (A) we need to return ARP results as fast as possible and (B) many of us, including me, don't want to run without a work cache. My idea is to make the deadline for ARP tasks shorter than for other tasks, say five days. That way I could still have a cache of one or two days for other work but selectively trigger panic mode for ARP tasks, making them bypass the queue.
|
||
|
|
|