World Community Grid - View Thread - New Beta Starting 2011/07/22

World Community Grid Forums

Category: Beta Testing

Forum: Beta Test Support Forum

Thread: New Beta Starting 2011/07/22

Quick Go »

No member browsing this thread

Thread Status: Locked
Total posts in this thread: 296

[ ]

Author

This topic has been viewed 527418 times and has 295 replies

wplachy
Senior Cruncher
Joined: Sep 4, 2007
Post Count: 423
Status: Offline


Re: New Beta Starting 2011/07/22

ADDITIONAL COMMENT:
On reflection, I would like to add that longer checkpoints running over several cores could result in there never being an optimal window of opportunity to allow a machine to be restarted for housekeeping purposes. I think that the trouble will come if the proposed lengthening of checkpoint times gets in the way of allowing vital updates to the machine to be carried out.

I trust that the proposed change to checkpoint times will not become too obtrusive.

My thought as well. My contribution to CEP2 is my lowest of all the sub-projects for exactly that reason. The long run times and length between checkpoints pose a very real problem for me, so I contribute very little time to that sub-project.

----------------------------------------

Bill P

[Jul 28, 2011 2:41:48 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: New Beta Starting 2011/07/22

The time between checkpoints will be about 10 times larger.

Sorry, but this is purely unacceptable if this mechanism goes to live production. Why so ambitious?

So we will have more work units to run but hopefully keep the runtime around 6 hours.

"Hopefully"? Does this mean these WUs will run much longer in slow machines (unlike those HCMD2 WUs, which will be cut off at 6/12 hours)? If so, then it just won't work if combining with the above-mentioned checkpoint interval...

[Jul 28, 2011 5:29:03 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: New Beta Starting 2011/07/22

Since this is Autodock Vina the way work *per-compound* is cut [seen ZINC named in the beta tasks], will be much identical to e.g. HFCC and FAAH. Don't think it will involve hard cut-offs at 6 or 12 or whatever hours in a complicated HCMD2 scheme.

Re checkpoints, 10x 2 to 2.5 minutes makes them to run 20-25 per checkpoint on me Linux 64 bit quad [32 bit app I saw in this test]. Not attractive for booting, let alone on me duo when they, based on ampc test observations, would be an hour apart. Pretty plz, 2-4-8-12 threaded devices will loose lots of progress on system or client restarts. Electricity being more and more expensive, it would not be a very green decision on top in this day and age to have long checkpoints and still seeking unhindered participation by all. Possible there is misunderstanding though what 10x longer checkpointing means, so lets hear the techs on that, what their plan is.

--//--

[Jul 28, 2011 6:07:31 AM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

50 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: New Beta Starting 2011/07/22

Currently the way the application is designed allows us to take checkpoints after each job. By increasing the time for each job, we would be exponentially decreasing the probability of not finding the minimum. Since this route is an option, we are already starting to examine checkpoints within a job.

Thanks,
-Uplinger

[Jul 28, 2011 2:26:16 PM]

pirogue
Veteran Cruncher
USA
Joined: Dec 8, 2008
Post Count: 685
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

10 year badge for Help Fight Childhood Cancer

10 year badge for Help Cure Muscular Dystrophy - Phase 2

10 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for Computing for Clean Water

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: New Beta Starting 2011/07/22

Those Betas I reported yesterday are still sucking wind.

Should I abort them or let them run?

----------------------------------------

[Jul 28, 2011 2:58:53 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: New Beta Starting 2011/07/22

Those Betas I reported yesterday are still sucking wind.

Should I abort them or let them run?

hi proque,

Something not kosher there... the Elapsed / CPU time ratio is what's doing that what fills the balloon. Don't know what platform, but suggest to stop client / service and restart and see what it does... many core device, maybe switch LAIM off, then suspend project for for 30 seconds to let the sciences unload, then resume project.

Suppose the checkpoint column in that view shows none or one being ages ago. On Linux that checkpoint last time froze on mine, but not under W7. The jobs did though continue clocking CPU time.

--//--

[Jul 28, 2011 3:14:01 PM]

sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:

180 day badge for Discovering Dengue Drugs - Together

5 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

20 year badge for Help Fight Childhood Cancer

1 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Uncovering Genome Mysteries

5 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

45 day badge for OpenPandemics - COVID-19


Re: New Beta Starting 2011/07/22

It's inevitable that WCG will not be able to run all projects on all systems, so trying to fit square pegs into round holes is counterproductive.
The fastest CPU's can do 40times the work of the least productive CPUs. You simply facilitate the running of every project on every system. Project requirements are naturally different. So it's unreasonable to expect RAM usage to always be low, bandwidth to be low and run time to be even, and ditto for checkpointing. Just have a look at the trouble other projects have. If LAIM is presently fine for CEP2 why not this one? I don't like losing run time (during restarts) just as much as anyone else, but if this does not suite some people WCG has other good projects to choose from. Crunchers strive to be more productive, so if you have to modify things to the point where increased checkpointing slows the research down, you are losing as much as you gain, at best. Perhaps you will be able to find a work around, rather than universally chopping into everybody's run time to do checkpoints.

Could it be possible to add a feature where the cruncher could select to suspend the task at the next checkpoint? That way you could click the option, and allow the system to run to the last task checkpoints, and run other tasks as the cores free up. Come back in 20min and do the restart then. Perhaps even a checkpoint and then restart/shutdown the system function. Alternatively could you force a checkpoint only if Boinc is closing. That would deal with restarts, log offs and shutdowns and LAIM would deal with system in use task suspensions.
Could one of these task not be divided and allowed to run on each core/thread? I'm not suggesting this would deal with checkpointing, just asking.

[Jul 28, 2011 3:31:19 PM]

nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water

50 year badge for Outsmart Ebola Together

20 year badge for Microbiome Immunity Project

20 year badge for OpenPandemics - COVID-19


Re: New Beta Starting 2011/07/22

Those Betas I reported yesterday are still sucking wind.

Should I abort them or let them run?

I had 2 of these last night. The properties tab said 3 hours of run time but only 57 minutes of CPU time. When looking at the tasks tab in BM the progress bar said 47% complete after 3 hours so that didn't jive with what the properties said about those WUs. I didn't get a chance to look at them this morning before I left for work but according to my grid they are validated. confused

----------------------------------------

In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.

[Jul 28, 2011 3:36:15 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: New Beta Starting 2011/07/22

?. We're trying to establish what these BETA jobs do and don't in the here and now, what they withstand and what-not, in amongst what happens if the client / system is restarted to see if it survives [an old trick on frozen jobs]. Beta testing is not lossless.

Suggesting this or that change to the client really belongs over at the Developers Alpha mail list or in Trac. You're a member of former.

--//--

[Jul 28, 2011 3:52:27 PM]

pirogue
Veteran Cruncher
USA
Joined: Dec 8, 2008
Post Count: 685
Status: Offline
Project Badges:


Re: New Beta Starting 2011/07/22

Those Betas I reported yesterday are still sucking wind.

Should I abort them or let them run?

It's been around 2 hours since that screenshot. None of the times had changed, so I rebooted and they seem to be running more normally. The time left is in the 40 minute range instead of 5.5+ hours. Utilization is 50% on 2 of them and 88% on the 3rd.

It's interesting (odd) that of all the Betas that I received, only these 3 are still in progress and they are still in progress with the wingpeople.

I currently have 2 machines running Linux x64. One is running an Intel CPU + 4GB of RAM + HD and one (the problem) is running an AMD CPU + 8GB of RAM + SSD. This machine didn't show any signs of problems on any of the previous Betas and none of my Windows machines have shown any problems at all.

SekeRob: Was your problem on an AMD or Intel CPU?

----------------------------------------

[Jul 28, 2011 5:03:45 PM]

[ ]