| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 9
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I've got a WU with estimates of completion time increasing. There is a file called "fort.98" in its data area with a huge and ever-increasing number of lines, all which say "**** OUT OF BOUNDS *********".
Should I abort it, or is there any information that would be useful to obtain first as to why it appears to be failing? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm back from Thanksgiving at my sister's home. You have probably decided already, but if not - go ahead and abort it.
Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Yeah, I killed it. Someone might want to look at that WU. It's been about 3 days and only 4 of 10 clients have completed it. It could just be a coincidence, but most 6-10 hour (what the completed ones have taken) WUs would mostly be done by all clients in under 2 days. With 60% incomplete, I'd suspect that others might be thrashing away making huge error logs like mine was. (ach1_ 8_30 is the WU)
(Even moreso as what sounds like the same problem was reported in this thread.) |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Another machine has errored out, after 67.89 hours and having blown a credit of 1,298.2 for nothing. My bet is that all the others will run out of time, having wasted several days each.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Ughh!
Well, this is the sort of problem that we are supposed to locate with our initial trial runs. The project scientists have some local computers to debug with once we locate the problems. sigh. . . . Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That's why I was trying to raise someone's attention to this. As I expected, the others have returned Error at 2-3 days CPU time or been marked No Reply, whereupon a new batch of copies have been sent out to more machines.
----------------------------------------Edit: Now, it's sending out copies with a 1-day execution requirement and getting mostly "No Reply", so sending out more. [Edit 1 times, last edit by Former Member at Nov 30, 2007 8:51:13 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
After burning through around 1000 CPU hours to get a quorum, the result was inconclusive, so another 5 copies have been spawned. Isn't it about time someone did something to stop this mess?
|
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
I downloaded the work unit in question and I'm investigating why it is causing so many copies to be sent out. Hopefully I can have something for you in the next few days.
-Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
With over 1885 CPU hours consumed across 33 machines, was this the biggest rogue WU ever?
What ended up being the problem? |
||
|
|
|