| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 98
|
|
| Author |
|
|
Psalm103
Cruncher Joined: Jan 6, 2007 Post Count: 25 Status: Offline Project Badges:
|
I've got two betas that have been going for 17 hours now. (00011_1551 and 00011_1558). Not a single checkpoint yet. It looked at first like they'd finish in about 1.5 hrs each. The Remaining Time went to '---' after about an hour and they hit 100.000% at just over 15 hours of cpu time. Will they time-out eventually? I'll keep them running for now and see what happens. This is a reasonably quick machine at usually comes in at just under the average run time.
(Win 7 x64, 4 cores @ 2.66 GHz, 8 GB RAM @ 1600 MHz) |
||
|
|
jonnieb-uk
Ace Cruncher England Joined: Nov 30, 2011 Post Count: 6105 Status: Offline Project Badges:
|
I see that Keith Uplinger has been browsing the thread in the last hour.
----------------------------------------Hopefully words of wisdom will be forthcoming shortly! ![]() |
||
|
|
anhhai
Veteran Cruncher Joined: Mar 22, 2005 Post Count: 839 Status: Offline Project Badges:
|
should we be aborting the ones stuck at 99.x%? can the staff give us some direction?
----------------------------------------![]() |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Sorry for the long delay on a response, but we have figured out the root cause for the work units hanging. It is a work unit build problem with some of the input files being improperly formed. These input files were manually changed outside of the build script to change a special character. When more than one special character was encountered in the manual update of them, it changed the length of the line that was expected. Thus it has caused the application to appear stalled. It was technically still working just on data that was a lot longer (1000000x) than normal. We are going to set all the work units currently out there to report as being completed (server_abort).
I have disabled the assimilator and validator for the time being. This will allow for the results to stay in the database longer than normal. I will be reviewing the data that members have returned on Monday for these batches and grant credit if someone hits the resource limit (cpu timeout). I will also see about those that manually aborted them, to see if some partial credit for time spent can be given. We changed the build script so that manual intervention on removing the special character is not needed. After we clean up from this current beta, we will be sending out proper work units, no time table on that yet. Thanks, -Uplinger |
||
|
|
mikefinn
Cruncher USA Joined: Apr 27, 2007 Post Count: 43 Status: Offline Project Badges:
|
My two beta units do not have an estimated time remaining entry. <Sniped> I took a quick look at the stderr.txt file of one of them and it had only one line: Unable to open checkpoint file starting from 0 I let the work units run all night. When I checked in the morning, my computer was unresponsive to keyboard and mouse and I wound up rebooting. Before reboot, the two work units were at 99.x% with no checkpoint or time remaining entry. After reboot, the two work units were running from the beginning with 45 minutes of remaining time. But shortly after, the remaining time entry vanished and was replaced by "-- " I looked at one of the stderr.txt and all it had was: Unable to open checkpoint file starting from 0 Unable to open checkpoint file starting from 0 Unable to open checkpoint file starting from 0 |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Users can manually abort them if they would like. Work units that had not started on a members computer will abort after we trigger the server abort.
Please wait to manually abort them until we have updated the database, so that additional copies are not sent out. Thanks, -Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My window machine aborted 5 Beta WUs after 13:42 hours CPU runtime due to max time being exceeded. They ran for 13:27/13:42 and are claiming 24.0 pts. Here is the error info on one of them:
----------------------------------------Result Log Result Name: BETA_ ugm1_ ugm1_ 00012_ 0950_ 0-- <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> <stderr_txt> Unable to open checkpoint file starting from 0 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x000007FEFDC73CA2 Engaging BOINC Windows Runtime Debugger... [Edit 1 times, last edit by Former Member at Sep 19, 2014 2:48:38 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hit the update button and got the below indicating it's now ok, server aborted, running or not or suspended as in my case.
2691 World Community Grid 9/19/2014 5:06:07 PM Result BETA_ugm1_ugm1_00012_0031_1 is no longer usable 2692 World Community Grid 9/19/2014 5:06:07 PM Result BETA_ugm1_ugm1_00012_0036_1 is no longer usable 2693 World Community Grid 9/19/2014 5:06:07 PM Result BETA_ugm1_ugm1_00012_0029_1 is no longer usable 2694 World Community Grid 9/19/2014 5:06:07 PM Result BETA_ugm1_ugm1_00012_0045_0 is no longer usable 2695 World Community Grid 9/19/2014 5:06:07 PM Result BETA_ugm1_ugm1_00012_0035_1 is no longer usable 2696 World Community Grid 9/19/2014 5:06:07 PM Result BETA_ugm1_ugm1_00012_1146_1 is no longer usable 2697 World Community Grid 9/19/2014 5:06:07 PM Result BETA_ugm1_ugm1_00012_1143_0 is no longer usable 2698 World Community Grid 9/19/2014 5:06:07 PM Result BETA_ugm1_ugm1_00012_1166_0 is no longer usable |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Yes, we have server aborted any that were still in a state of running. If the work units validated they were not touched, but any that had atleast one wingman in progress got server aborted. You can manually abort work units now if you'd like. I will be working on Monday to grant credit where it is due. Monday should be when the initial deadline is so most results should be in by that point.
Again, we apologize for issues. Thanks, -Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
And those units that completed successfully but were in PVal state now say "Too Late" - in this case, don't try to read anything into that phrase, it's a known get-out route, see the FAQ.
|
||
|
|
|