World Community Grid - View Thread

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: Work Available

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 3593

[ ]

Author

This topic has been viewed 5823820 times and has 3592 replies

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1407
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Work Available

It's that 48 hour return thingie, one of the criteria for a machine to be reliable for a project.
....

report a multi-day average of 15 good results consecutively is the other criterion, if still valid.

With so less and long running tasks, one can imagine that there are no reliable hosts at all.

[Nov 26, 2019 12:54:43 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Work Available

The definition of reliable seems much too tight. I have abandoned use of my laptop for arp because each unit was taking more than 48 hours crunching and it was only on 50% of time so taking more than 4 days to return. That had only 1 error which was due to too many restarts.

My PC is an i7-3770 and that has been taking 27 hours each, restricted to a maximum of 4 arp running and 12 waiting. That has not had any errors, but the 4.5 day turnaround would classify it as unreliable.

Mike

Think they can make a general project level exception like is/was done on HSTb, which has a repair deadline same as original for the _0 copy. A variation in percent does not seem to be possible. 35%. Read about in past there having been 30, 35 and 40%. The tighter the number the quicker a forced turn around and batch completion.

----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 26, 2019 1:13:52 PM]

[Nov 26, 2019 1:13:04 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

20 year badge for The Clean Energy Project - Phase 2

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Work Available

The main problem is not the criteria. They may not have been sending enough of them out to find the reliable machines. A lot have gone to people who push the update button often enough instead.

[Nov 26, 2019 1:28:46 PM]

5TEVE
Cruncher
Joined: Sep 4, 2006
Post Count: 34
Status: Offline
Project Badges:

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

10 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for OpenPandemics - COVID-19


Re: Work Available

Been getting some Resend ARP Wu's this morning 12 across 4 Box's so far ...

----------------------------------------

[Nov 26, 2019 1:53:36 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

50 year badge for Uncovering Genome Mysteries

100 year badge for FightAIDS@Home - Phase 2

50 year badge for Microbiome Immunity Project


Re: Work Available

Good Morning,

Ok. So a few things to note, this is a multi part process. We have the Indexer, createWork and the feeder.

The indexer checks to see what we have that is not in the database yet, it places them in an indexed state when found. We usually like to keep about 10 days indexed for most projects. This project is a bit different, but we have plenty indexed at the moment (even some from generation 001).

The createWork is what loads it into BOINC databases. Since BOINC requires xml to be stored in the database, we usually keep this buffer at about 48 hours to assist in keeping the database quick. But again, for this project we do something a bit different...we load a set number of results per half hour to artificially slow the project down. (Again, we hope to get this up to full speed in the future, which would be 30+k workunits in the wild at one time. 60+k results since redundancy)

The Feeder is the last part, it grabs what has been loaded by createWork. it pulls in work units to fill it based on weights set on our end. Say the feeder has 1000 spots and we give a weight of 50 to arp1, 25 to scc1, 15 to hstb, 10 to mcm1. The feeder would try to fill 500 slots with arp1 work. This happens every 5 seconds it tries to fill the empty slots. Members when they do a scheduler request pull from this feeder. The feeder attempts to pull higher priority results in first and then based on time stamp. This means that reliable results get pushed to the top first.

The problem wasn't the feeder, the problem was that we had a backlog of reliable results needing to be sent. This caused the createWork to believe it had over 21 work units loaded already on the grid. (If you haven't guessed we are loading 21 workuntis every 30 minutes). If we had only 10 reliable results waiting to send it would have only loaded 16 work units for arp1.

I am still trying to think what the best option is, because loading more work to keep the flow going is desirable, however, I need to have the resends be sent out and returned positively. I'm trying to think of a happy medium between the two extremes that would still keep the system quick and provide new work units to members.

I hope my long winded explanation helps, if it doesn't....please feel free to ask for clarification. :)

Thanks,
-Uplinger

[Nov 26, 2019 2:58:50 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for Microbiome Immunity Project


Re: Work Available

Keith

Thank you for the illuminating explanation.

However, one of the problems relates to the definition of 'reliable'. Because of the length of time needed to complete each unit and the need to hold a cache, very few machines can classify as 'reliable' even if they never have an error.

The definition needs to be relaxed for this project so as to enable more machines to qualify.

Mike

[Nov 26, 2019 3:13:39 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Work Available

Thanks for the background.

"(If you haven't guessed we are loading 21 workuntis every 30 minutes). If we had only 10 reliable results waiting to send it would have only loaded 16 work units for arp1."

Suppose that could be 28 minutes, 33 minutes, 38 minutes, i.e. on average, something that you cant tune a scheduler to which was the whole point of 'randomized distribution'

I'm math challenged BTW... 10 + 16 = 21. Guess you added some fuzzy logic.

[Nov 26, 2019 3:18:27 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:


Re: Work Available

I note the reference to generation 001. This seems to indicate that we are now about 0.5% through the project.

Mike

[Nov 26, 2019 3:24:49 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Work Available

Keith,

Many thanks for the detail. I guess writing it down gave you an opportunity to think while you did it. My only take is not very useful: it sounds like a process that works OK for a single project research effort, but for a multi-project effort like WCG I think I would have taken things in a different direction. But that sounds too much like "If you want to get there I wouldn't start from here".

I think enough people have made comments for someone of your expertise to weigh up the different requirements and to have a reasonable chance of sticking something together that works. But I do agree that an automatic, gently sloped fall-back to more relaxed 'reliable' constraints would seem to be necessary, even if not easily implemented in this environment.

Good luck!

[Nov 26, 2019 3:33:22 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:


Re: Work Available

Fuzzy math answer: I used different terms like results and workunits. A single work unit has 2 results by default. So if the createWork sees 10 results waiting to be sent, then that equals 5 workunits. We load based on workunits, so 21 - 5 = 16 :)

I have thought about removing the reliable hosts for this project, however there could become an issue where 5 copies are sent out, and 4 failed due to unreliable hosts...that would mean the 1 person who returned the result valid would not get used by the system until it was investigated why the workunit failed entirely. I am debating increasing the 40% of original time for reliable hosts to help keep those machines happy. Especially since we are running this slow, having to get those back at the moment isn't an issue.

On most of our projects, we use what is called a batch status, a batch may have 1000 workunits in it. if 95% of the workunits return within 3 days, then we could be waiting 10+ days for the remaining 5% to return. This means we are waiting the extra days to get back that remaining 5% before packaging and sending a batch back. Which also means we are temporarily storing it on our infrastructure until that time is complete.

Some actual stats: 10 day return period with zero redundancy and reliable hosts gets us batches to complete in 15-16 days. A batch with 10 day return and single redundancy (2 copies needed) has a return time of 17-19 days. A batch with 10 day return and zero redundancy and no reliable hosts averages 28-30 days.

As you can see it's a balancing act. But as mentioned before, most projects use a batch concept, this project is using a workunit concept which is different than our other projects and some of the guides we used in the past can be relaxed/tweaked. Thus a learning experience for all :)

Thanks,
-Uplinger

[Nov 26, 2019 4:01:08 PM]

[ ]