Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 7
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2489 times and has 6 replies Next Thread
Dayle Diamond
Senior Cruncher
Joined: Jan 31, 2013
Post Count: 452
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
confused Where do all the errored work units go?

The Android version of Vina, as many of us have experienced, is pretty unstable. I've seen forum recommendations where folks guess how to users can compensate, ie by lowering the number of active cores, but after a while it became clear that it was out of our hands.

It's disappointing, because I've got up to eleven Android cores crunching and getting nowhere.

I'm looking at my errors, and they all seem to have been generated scores of times, and almost always end in errors.

For example, this work unit ran for just over 24 hours, and then failed because it couldn't open the output file. It's on it's ninth iteration and hasn't been crunched successfully.

FAHV_x1HVH-A-AS_0876796_0260_9

Or this one, where Vina was killed by signal 9 minutes after it began. Attempt #10 is waiting for validation.

FAHV_ x1F7A-B-AS_ 0876456_ 0366_ 7

So my questions: What's happening to all these work units once the server stops sending them?
[Aug 19, 2014 2:48:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Where do all the errored work units go?

If a workunit hits the limit on errors it will be attempted on another platform once. If it continues to return an error it will be marked and removed from the grid and sent back to the researchers to investigate.

Thanks,
armstrdj
[Aug 27, 2014 3:57:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dayle Diamond
Senior Cruncher
Joined: Jan 31, 2013
Post Count: 452
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Where do all the errored work units go?

Thanks for the feedback. It's reassuring to know that the workunits are getting done after all.

Quick question: If the only WUs that are returned to researchers are ones that A. fail after both hitting the limit on Android resends and B. Fail again on other platforms, is it possible that the researchers have no idea how many Android errors are occurring? If only successful work units are counted in the metrics, is it possible that demand for android work units is underrepresented?
[Aug 28, 2014 5:45:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Where do all the errored work units go?

If a workunit hits the limit on errors it will be attempted on another platform once. If it continues to return an error it will be marked and removed from the grid and sent back to the researchers to investigate.

Similar case in point happened a few days ago on this very project. The latest experiment is officially 166, and it looked like all previous files had run. But then workunits were appearing from #104 and similar numbered experiments (link!).
----------------------------------------

[Aug 28, 2014 11:20:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Where do all the errored work units go?

We have an new Android build that is currently in Alpha testing. There are a couple of more things we need to test but so far the results are good. As soon as it is ready we will promote it beta for additional testing.

Thanks,
armstrdj
[Sep 2, 2014 2:58:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Where do all the errored work units go?

armstrdj,

Whilst it all looks promising in alpha, can you please verify if your android science application is adhering to the 'write to disk at most every nn seconds'. Set the agent to 300, but is still logging a checkpoint every few minutes. One task is now at 344 in 10 hours, or a 1:45 minutes frequency. At least, on the pc the time between does adhere i.e. checkpoints are logged at a 5 minute or greater interval, whenever one occurs on or after 5 minutes of abstinence. Running to completion in about 2 hours with 140 jobs packed implies there's an internal completion about every 1:15 minutes on the pc.
[Sep 2, 2014 4:13:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Where do all the errored work units go?

lavaflow,

We haven't touched any of the checkpointing code so that is likely a bug. I will investigate.

Thanks,
armstrdj
[Sep 3, 2014 3:49:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread