| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 7
|
|
| Author |
|
|
Dayle Diamond
Senior Cruncher Joined: Jan 31, 2013 Post Count: 452 Status: Offline Project Badges:
|
The Android version of Vina, as many of us have experienced, is pretty unstable. I've seen forum recommendations where folks guess how to users can compensate, ie by lowering the number of active cores, but after a while it became clear that it was out of our hands.
It's disappointing, because I've got up to eleven Android cores crunching and getting nowhere. I'm looking at my errors, and they all seem to have been generated scores of times, and almost always end in errors. For example, this work unit ran for just over 24 hours, and then failed because it couldn't open the output file. It's on it's ninth iteration and hasn't been crunched successfully. FAHV_x1HVH-A-AS_0876796_0260_9 Or this one, where Vina was killed by signal 9 minutes after it began. Attempt #10 is waiting for validation. FAHV_ x1F7A-B-AS_ 0876456_ 0366_ 7 So my questions: What's happening to all these work units once the server stops sending them? |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
If a workunit hits the limit on errors it will be attempted on another platform once. If it continues to return an error it will be marked and removed from the grid and sent back to the researchers to investigate.
Thanks, armstrdj |
||
|
|
Dayle Diamond
Senior Cruncher Joined: Jan 31, 2013 Post Count: 452 Status: Offline Project Badges:
|
Thanks for the feedback. It's reassuring to know that the workunits are getting done after all.
Quick question: If the only WUs that are returned to researchers are ones that A. fail after both hitting the limit on Android resends and B. Fail again on other platforms, is it possible that the researchers have no idea how many Android errors are occurring? If only successful work units are counted in the metrics, is it possible that demand for android work units is underrepresented? |
||
|
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 823 Status: Offline Project Badges:
|
If a workunit hits the limit on errors it will be attempted on another platform once. If it continues to return an error it will be marked and removed from the grid and sent back to the researchers to investigate. Similar case in point happened a few days ago on this very project. The latest experiment is officially 166, and it looked like all previous files had run. But then workunits were appearing from #104 and similar numbered experiments (link!). ![]() |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
We have an new Android build that is currently in Alpha testing. There are a couple of more things we need to test but so far the results are good. As soon as it is ready we will promote it beta for additional testing.
Thanks, armstrdj |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
armstrdj,
Whilst it all looks promising in alpha, can you please verify if your android science application is adhering to the 'write to disk at most every nn seconds'. Set the agent to 300, but is still logging a checkpoint every few minutes. One task is now at 344 in 10 hours, or a 1:45 minutes frequency. At least, on the pc the time between does adhere i.e. checkpoints are logged at a 5 minute or greater interval, whenever one occurs on or after 5 minutes of abstinence. Running to completion in about 2 hours with 140 jobs packed implies there's an internal completion about every 1:15 minutes on the pc. |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
lavaflow,
We haven't touched any of the checkpointing code so that is likely a bug. I will investigate. Thanks, armstrdj |
||
|
|
|