World Community Grid - View Thread - Where do all the errored work units go?

World Community Grid Forums

Category: Completed Research

Forum: FightAIDS@Home

Thread: Where do all the errored work units go?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 7

[ ]

Author

This topic has been viewed 2489 times and has 6 replies

Dayle Diamond
Senior Cruncher
Joined: Jan 31, 2013
Post Count: 452
Status: Offline
Project Badges:

1 year badge for The Clean Energy Project - Phase 2

14 day badge for Drug Search for Leishmaniasis

100 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Where do all the errored work units go?

The Android version of Vina, as many of us have experienced, is pretty unstable. I've seen forum recommendations where folks guess how to users can compensate, ie by lowering the number of active cores, but after a while it became clear that it was out of our hands.

It's disappointing, because I've got up to eleven Android cores crunching and getting nowhere.

I'm looking at my errors, and they all seem to have been generated scores of times, and almost always end in errors.

For example, this work unit ran for just over 24 hours, and then failed because it couldn't open the output file. It's on it's ninth iteration and hasn't been crunched successfully.

FAHV_x1HVH-A-AS_0876796_0260_9

Or this one, where Vina was killed by signal 9 minutes after it began. Attempt #10 is waiting for validation.

FAHV_ x1F7A-B-AS_ 0876456_ 0366_ 7

So my questions: What's happening to all these work units once the server stops sending them?

[Aug 19, 2014 2:48:26 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for OpenPandemics - COVID-19


Re: Where do all the errored work units go?

If a workunit hits the limit on errors it will be attempted on another platform once. If it continues to return an error it will be marked and removed from the grid and sent back to the researchers to investigate.

Thanks,
armstrdj

[Aug 27, 2014 3:57:47 PM]

Dayle Diamond
Senior Cruncher
Joined: Jan 31, 2013
Post Count: 452
Status: Offline
Project Badges:


Re: Where do all the errored work units go?

Thanks for the feedback. It's reassuring to know that the workunits are getting done after all.

Quick question: If the only WUs that are returned to researchers are ones that A. fail after both hitting the limit on Android resends and B. Fail again on other platforms, is it possible that the researchers have no idea how many Android errors are occurring? If only successful work units are counted in the metrics, is it possible that demand for android work units is underrepresented?

[Aug 28, 2014 5:45:44 PM]

Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:

10 year badge for The Clean Energy Project - Phase 2

90 day badge for GO Fight Against Malaria

50 year badge for Uncovering Genome Mysteries

50 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project


Re: Where do all the errored work units go?

Similar case in point happened a few days ago on this very project. The latest experiment is officially 166, and it looked like all previous files had run. But then workunits were appearing from #104 and similar numbered experiments (link!).

----------------------------------------

[Aug 28, 2014 11:20:18 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:


Re: Where do all the errored work units go?

We have an new Android build that is currently in Alpha testing. There are a couple of more things we need to test but so far the results are good. As soon as it is ready we will promote it beta for additional testing.

Thanks,
armstrdj

[Sep 2, 2014 2:58:56 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Where do all the errored work units go?

armstrdj,

Whilst it all looks promising in alpha, can you please verify if your android science application is adhering to the 'write to disk at most every nn seconds'. Set the agent to 300, but is still logging a checkpoint every few minutes. One task is now at 344 in 10 hours, or a 1:45 minutes frequency. At least, on the pc the time between does adhere i.e. checkpoints are logged at a 5 minute or greater interval, whenever one occurs on or after 5 minutes of abstinence. Running to completion in about 2 hours with 140 jobs packed implies there's an internal completion about every 1:15 minutes on the pc.

[Sep 2, 2014 4:13:34 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:


Re: Where do all the errored work units go?

lavaflow,

We haven't touched any of the checkpointing code so that is likely a bug. I will investigate.

Thanks,
armstrdj

[Sep 3, 2014 3:49:39 PM]

[ ]