World Community Grid - View Thread - 6.56 not resuming properly?

World Community Grid Forums

Category: Completed Research

Forum: Help Conquer Cancer

Thread: 6.56 not resuming properly?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 18

[ ]

Author

This topic has been viewed 4197 times and has 17 replies

KSMooney
Cruncher
Joined: Jun 10, 2007
Post Count: 1
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

45 day badge for Help Fight Childhood Cancer

45 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

14 day badge for Uncovering Genome Mysteries

14 day badge for Outsmart Ebola Together

14 day badge for FightAIDS@Home - Phase 2

14 day badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

45 day badge for OpenPandemics - COVID-19


Re: 6.56 not resuming properly?

Not all cards finish in under 2 minutes. If the application would checkpoint every 2 minutes, or even every 1 minute, that'd be wonderful.

[Oct 17, 2012 4:52:23 AM]

thebestjaspreet
Cruncher
Canada
Joined: Jun 16, 2011
Post Count: 10
Status: Offline
Project Badges:

45 day badge for Human Proteome Folding - Phase 2

180 day badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

45 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water


Re: 6.56 not resuming properly?

Do you have LAIM(Leave application in memory) active?

AFAIK, the GPU app has no checkpoints thus it goes back to the beginning when stopped

I believe you can check Computation allowed while computer is in use as well check Use GPU while computer is in use in preferences of boinc client and use something like TTHrottle (http://efmer.eu/boinc/ to control the temperatures. This will eliminate the problem with checkpointing and will help you to use the computer also.

Hope it helps.

----------------------------------------

http://www.boincstats.com/signature/user_2219745.gif

[Oct 17, 2012 5:06:42 AM]

cristipurdel
Senior Cruncher
Joined: Dec 13, 2008
Post Count: 158
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

180 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

14 day badge for The Clean Energy Project - Phase 2

20 year badge for Mapping Cancer Markers

10 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: 6.56 not resuming properly?

Do you have LAIM(Leave application in memory) active?

AFAIK, the GPU app has no checkpoints thus it goes back to the beginning when stopped

Tried it and it is not working with tthrottle, since it does not have "exclude gpu app" option.
Tip of the Hat to KSMooney, it would be nice if the application would checkpoint at 2 or 4 minutes, so the faster cards would not be bother with this, while helping the slower cards.

----------------------------------------

How to Recycle Android Phones for BOINC Rig

[Oct 17, 2012 5:29:39 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: 6.56 not resuming properly?

With the moans on lag, somehow perceive this wanted checkpointing to only add to that problem for the affected users. Probability of implementation, lest the Techs put multiply CPU WU in a single GPU WU: Small, more than small.

[Oct 17, 2012 5:46:31 PM]

cristipurdel
Senior Cruncher
Joined: Dec 13, 2008
Post Count: 158
Status: Offline
Project Badges:


Re: 6.56 not resuming properly?

I do not think it is hard to implement something that it is already done in other projects, and it doesn't require to reinvent the wheel.
If this is too cumbersome to do, then I suggest that future apps should run one WU on each Compute Unit, so that every card should run them in "almost" the same amount of time.
If the fastest cards can run one task in under 2 minutes, than wrap 16 of them, and do a chekpoint every 2 minutes so that everybody is happy.
P.S. If memory serves me right, back when there were some hints about GPUs being used in WCG, you were saying that it cannot be done so easily or the speedup will not be as significant.

----------------------------------------

How to Recycle Android Phones for BOINC Rig

[Oct 17, 2012 6:48:13 PM]

cristipurdel
Senior Cruncher
Joined: Dec 13, 2008
Post Count: 158
Status: Offline
Project Badges:


Re: 6.56 not resuming properly?

I may have found a partial solution:

1. From Boinc > Use GPU always
2. Install TThrottle, From Programs set temperatures for CPU & GPU very low, around 20-30 degrees
From Preference check "If the computer is not used for" and I put 120 seconds (so if there is no activity the crunching is resumed) and set the temperatures as high as you would trust your hardware.
From Expert put 10 seconds for "Rebuild list after 10 seconds"

The disadvantages are:
1. If you are watching a movie with vlc, moc or flash, the computer will think that you are away and tthrottle is stoped
2. There is a small lag after you return from idle for 3-10 seconds but not that critical. Also while using the computer, from time to time, there is a small lag, .5 seconds but is not that annoying.

Possible complete solutions:
1. HCC GPU should use checkpoint, if the developers want this.
2. BOINC re-enables LAIM for GPU, not wanted at the moment.
3. TThrottle gets an option like: continue crunching if the following processes are running.

What is nice about TThrottle is that it does not suspend the taks like BOINC does, but it "keeps them alive" until throttle is stopped.

----------------------------------------

How to Recycle Android Phones for BOINC Rig

----------------------------------------
[Edit 2 times, last edit by cristipurdel at Oct 20, 2012 12:24:28 PM]

[Oct 20, 2012 12:22:25 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: 6.56 not resuming properly?

There's checkpointing requested on the single WU per GPU task, which is what I'm talking about [think my post was clear on that]. If the plan, yes there was a plan, and multiple were packaged in a GPU job, then checkpointing becomes sensible [too noted in my post]. What I said few years ago and what I'm saying now? Was talking as said a single HCC WU task and some members having a lag issue where checkpointing at arbitrary points is likely to worsen the situation. And YES, it depends on which WCG project is brought to the GPU. The HCC is integer intense [none/little FPO dealt with I guess in the CPU phases], integer is easy peasy for a GPU card, yet to get it to production took how long and what was all needed to be put in place? Some research done by scientists have shown the case that little to none will be gained by porting other sciences to the GPU [search posts on the forums]. If there was, and there would be enough time to make a "bring it to the grid" viable, it would have been done. For Rice a GPU development was done, as post-processing project and it was not deemed big enough or too involved to bring it to WCG, so the scientists are doing it in-house. The future... anything GPU able in this type of research is substantially to be moved off grid is one expectation. Much easier to have a homogenous set of [latest] hardware and let it churn through in a few months than trying to cater for a zillion different volunteer requirements and afterwards having to deal with just enough statistical variance that your set of results is just not optimal. This is the most major concern with public, distributed computing.

[Oct 20, 2012 1:07:52 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: HCC-GPU 6.56 not resuming properly?

Just to share a pretty good A to a Q over at the Berkeley forum, why paused GPU tasks *cant* resume where they were suspended

It appears that Leave Applications in Memory is not working with GPU tasks.
Any plans to introduce the feature also for GPU applications?

No - it was deliberately taken out.

CPU (main system) memory is routinely swapped out to a pagefile by the operating system if things get tight - so LAIM has minimal effect on system performance.

But GPU memory has no swapfile system - so anything left in memory, is in memory. That bit of BOINC was written when 512MB GPUs were common, and many projects (including Einstein, and SETI with cuda 2.3 and above) can only fit one task in that little VRAM. Leaving an app behind in VRAM would prevent any other GPU task running.

[Oct 21, 2012 3:23:45 PM]

[ ]