Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 211
Posts: 211   Pages: 22   [ Previous Page | 13 14 15 16 17 18 19 20 21 22 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 29075 times and has 210 replies Next Thread
verheyde
Cruncher
Belgium
Joined: Dec 7, 2004
Post Count: 25
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
sad Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

My machine (running Win 7-64, on a I7-720 (8 threads)) received a re-run last night. It ran for many hours when I needed to work on the PC. That caused a suspend of the Boinc tasks, due to the parameter settings for CPU (Suspend if more than xx% used).
That suspend triggered a full reset of the beta WU. The other WUs (Aids and Clean Energy) went to sleep and restarted afterwards without problem. The Beta unit stats were reset to 0 min run time... after a short while it said 0.5% done and remained there for quite some time. Later in the day it completed, and errored out as the output file was too big :-(
It looks like it ran for many many hours, and still registered 0.23h CPU/ 2.72h Elapsed. None of those figures are correct.
The last few lines of stderr.txt when the reset happened were:

[10:15:58]: Computing pass 11845
[10:15:58]: Computing pass 11846
Commandline = projects/www.worldcommunitygrid.org/wcgrid_beta17_7.19_windows_x86_64 -SettingsFile BETA_9999987_0728.txt -DatabaseFile dataset-GDS2771-v1.txt
Initializing
wcg_learn_limit = 250000
Running


After this stderr.txt did not update anymore for quite some time, and I had other things to do than babysit my PC ;-)
[Nov 2, 2013 4:45:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ccandido
Senior Cruncher
Joined: Jun 22, 2011
Post Count: 182
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

Got 1 wu today
Here's the problems with this wu

02/11/2013 17:07:14 | World Community Grid | Task BETA_BETA_9999987_0989_3 exited with zero status but no 'finished' file
02/11/2013 17:07:14 | World Community Grid | If this happens repeatedly you may need to reset the project.
02/11/2013 17:07:14 | World Community Grid | Restarting task BETA_BETA_9999987_0989_3 using beta17 version 719 in slot 11
02/11/2013 17:10:56 | World Community Grid | Task BETA_BETA_9999987_0989_3 exited with zero status but no 'finished' file
02/11/2013 17:10:56 | World Community Grid | If this happens repeatedly you may need to reset the project.
02/11/2013 17:10:56 | World Community Grid | Restarting task BETA_BETA_9999987_0989_3 using beta17 version 719 in slot 11
02/11/2013 17:14:36 | World Community Grid | Task BETA_BETA_9999987_0989_3 exited with zero status but no 'finished' file
02/11/2013 17:14:36 | World Community Grid | If this happens repeatedly you may need to reset the project.
02/11/2013 17:14:36 | World Community Grid | Restarting task BETA_BETA_9999987_0989_3 using beta17 version 719 in slot 11
----------------------------------------


[Nov 2, 2013 5:20:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
USAFA 82
Veteran Cruncher
Colorado Springs, Colorado
Joined: Jan 20, 2005
Post Count: 1001
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

I've received nine WUs so far. Five have errored out with the 131 code, two are in progress, and two are in PV.

Win 7 Home w/SP1. This is a customer's computer that I'm updating and running WCG while I work on it. There is no AV because he doesn't connect to the internet at home:
</stderr_txt>
<message>
<file_xfer_error>
<file_name>BETA_BETA_9999984_0138_2_0</file_name>
<error_code>-131</error_code>
</file_xfer_error>

Win 7 Pro w/SP1, Norton Security Suite from Comcast:
</stderr_txt>
<message>
<file_xfer_error>
<file_name>BETA_BETA_9999987_0555_3_0</file_name>
<error_code>-131</error_code>
</file_xfer_error>

Same machine as above:
</stderr_txt>
<message>
<file_xfer_error>
<file_name>BETA_BETA_9999985_0207_4_0</file_name>
<error_code>-131</error_code>
</file_xfer_error>

Same machine as above:
</stderr_txt>
<message>
<file_xfer_error>
<file_name>BETA_BETA_9999984_0352_1_0</file_name>
<error_code>-131</error_code>
</file_xfer_error>

Win 7 Pro w/SP1, Norton Security Suite from Comcast:
</stderr_txt>
<message>
<file_xfer_error>
<file_name>BETA_BETA_9999988_0241_3_0</file_name>
<error_code>-131</error_code>
</file_xfer_error>
----------------------------------------


Cancer Survivor
Play Star Citizen!
[Nov 2, 2013 6:32:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7578
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

After 3 days of trying, finally completed the download of this WU which completed without a problem. Core 2 Duo Linux Mint 64 bit.
BETA_ BETA_ 9999988_ 0952_ 0-- joe-E6610-3 Valid 10/31/13 07:23:40 11/2/13 13:09:47 4.23 / 4.29 71.8 / 78.4
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 2, 2013 6:37:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
keithhenry
Ace Cruncher
Senile old farts of the world ....uh.....uh..... nevermind
Joined: Nov 18, 2004
Post Count: 18665
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

Let me take the lists Rob compiled and Keith's responses in a few posts and combine them here and add a couple of items.....

1) Output file too large (Error -131)
2) Maximum Disk Use Exceeded (disk_bound overstepped)
3) Memory model exceeded (memory_bound overstepped)
These 3 are associated pretty much with the same problem. The result output is growing larger than it needs to. The researchers are working to fine tune the filtering that is needed within the work units. Also as a fail safe we are working on a way to detect this mid work unit and exit gracefully so that work done is returned to the researchers and they can evaluate how to proceed with these monster result files.

4) Loss of -large- portions of CPU time at time of reporting, which looks to happen at end.
We are looking into the checkpointing issue, we believe we have a fix, but it'll need to be tested on the next round

5) Progress % erratic (e.g. happens it can from 0.5% to 50% only at end of 1st pass when there are only 2 passes) 6) Related to 5), checkpoints at times multiple hours apart... not good for part time crunchers.
5 and 6 are similar to 4 in that we have a potential fix and will be testing it next round.

7) Jobs seem stuck in memory at times, [when seemingly no more progress is made]... wont unload, even when "Leave application in memory when suspended" is off. Full client restart required to get them to unload.
This is something I need more information on. Is this an issue with Windows only? What flavor? (ex. Windows 8 32bit or Windows Vista 64bit) I have not been able to recreate on my machines, but that could be I'm looking at the wrong OS.
Was Windows7 64+32 with a newer test client 7.2.23+18 respectively. Both have a private partition > 10GB. Only know this since for the whole test cycle I did not visit the Linux box, which had 4 errors, all with -131 [which if understood how to restrain output, probably would have been valid results], noting that the 5th on that Ubu 131.0 box did get a valid, BUT, it went all the way to 5774 passes, where if there are too many, essentially only prints the end part of the log... starting at about pass 3761. The other good news on the wingman is, that the log showed exactly the same, meaning, 100% reproducing result.

BETA_ BETA_ 9999984_ 0176_ 1-- 719 Valid 10/31/13 06:04:29 11/1/13 14:16:37 3.78 80.4 / 66.3
BETA_ BETA_ 9999984_ 0176_ 0-- 719 Valid 10/31/13 06:04:22 10/31/13 11:31:28 3.91 52.2 / 66.3

Don't know which platform the original development took place, but 50:50 card is on Linux.

So, as noted, taking off the LAIM option, suspending the stuck task did not unload them [confirmed in Task Manager]. Since I had CEP2 and FAHV running on the side and suspending them, to get the Beta to start again, they did unload... not a client issue. Doing a BOINC service stop, if it matters, not a user level install effectuated the unloads. Think the latest clients have mechanisms to kill all BOINC related processes, even if zombied, but since the tasks after suspend did not use/count Elapsed or CPU time, conclude neither were orphaned.


8) Some tasks freeze on the CPU time use when running [is it the display or is it the CPU time in Task Manager indicates no CPU time use?], while elapsed time keeps accumulating and progress % goes backward. Users of BOINC manager wont see this easily, to users of BOINCTasks it's obvious since both Elapsed and CPU time is shown.
I believe some of this might be due to the large results and checkpoints, we are still investigating it.

9) Running 4 concurrently (i.e., using all available cores), appears to be very inefficient.
I believe this is due to really large files needing to be checkpointed which brings it back to we are working on a solution at this time to detect the memory usage and exit based on that.

10)Tasks restart every few minuites

If you are encountering a restart issue of your result, please let us know a few things.

1. What OS are you running (ex. Windows 8 64 bit)
2. Do you have security software on your computer. One report from Gil was McAfee.
3. Check your security software to see if you can exclude either this application or the boinc data directory. On windows this is usually C:/ProgramData/BOINC/.

All of the errors up to this point on the restart issue are Windows machines. My assumption at this time is since it's a new application the security software on your machine is killing the process and boinc is trying to restart. This kill and restart happens too many times and an error is reported to the server.

11) Wanted to add this item which Rob also mentioned in his response to 7: Content of result log is truncated with beginning lost when LARGE number of passes occurs.

12) wcg_learn_limit reached message occurs multiple times on individual passes as noted in this post . Expectation is that processing would end for that pass when this limit is reached. Probably related to this is that CPU time is not recorded in Results Status but elapsed time is.

Wish list: Printing of OS and CPU details in Result Log.
Yes, we are thinking of adding this information to the result status page, not in the result log as that would not require us to recompile the older applications on WCG to support this change.
On the wish listed log info additions... apart from the recompile consideration, new sciences going forward would have allowed the feature to be added to the coming CEP2 and Beta17 [i.e. with New -to be launched- applications, no urgency seeing for the solid established projects]. Based on various timetables, you're saying... maybe 2014. The Monkees song has long changed it's title to 'I'm [not] a belieber', but if the OS/CPU info is added to the quorum detail sub page to the Result Page, that adds a convenience [single page overview], and probably adds a little fetch load prior to opening the logs. Pasting the log for helpers then does not in a natural way present the system info... The original thought I had in an 'exceed the customers expectation' [Not an IBM credo?], do both.


Given the point of a beta - to find and identify problems before releasing in production - this small beta would seem to have been very successful. biggrin We're all anxious to get the new project running but this, and the next beta wink , will make for a smooth roll out.
----------------------------------------
Join/Website/IMODB



[Nov 2, 2013 7:00:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2977
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

Okay, I seem to have been very fortunate and managed to pick up a resend. Thus, as I managed to catch it, I'm running it on it's own - thus, in theory, this'll test as to how efficient my machine is just running one at a time (as opposed to the 4 concurrently, earlier on in the week).
----------------------------------------

[Nov 2, 2013 9:12:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

OK, I think I may have a slightly different problem. I asked my son to check out any messages he has received to see if they are similar to the ones I received:

10/31/2013 4:33:49 PM World Community Grid Starting task BETA_BETA_9999987_0629_1 using beta17 version 719
10/31/2013 4:37:12 PM World Community Grid Restarting task BETA_BETA_9999987_0629_1 using beta17 version 719

The restart is continuing forever. The BETA task has been Aborted to free up the machine for additional work. As you can see, the task originally started on Thursday at 4:33 PM and the last message I received was

11/2/2013 5:46:19 PM World Community Grid Restarting task BETA_BETA_9999987_0629_1 using beta17 version 719

I will see how the BETAs are running on the other machines...
[Nov 3, 2013 1:13:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

The same thing was occurring on the other two machines that had BETA tasks in progress. All of these tasks have now been aborted.
[Nov 3, 2013 1:37:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2977
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

Well, what a difference running just 1 WU made as opposed to running 4 concurrently - the WU that I was fortunate to receive as a repair WU, had the following efficiency;
BETA_BETA_9999984_0212_4 03:03:25 (03:01:09)

For the record (I've already posted the original times further up, but for completeness), here are the other 4;
BETA_BETA_9999986_0580_0 07:34:10 (03:41:10)
BETA_BETA_9999985_0836_0 08:22:24 (04:31:04)
BETA_BETA_9999985_0828_1 07:38:47 (04:12:15)
BETA_BETA_9999986_0697_1 05:46:34 (03:10:51)
----------------------------------------

[Nov 3, 2013 4:49:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
johncmacalister2010@gmail.com
Veteran Cruncher
Canada
Joined: Nov 16, 2010
Post Count: 799
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test starting Oct 31, 2013 [Issues Thread]

Nothing received here, yet. crying
----------------------------------------


crunching, crunching, crunching.

AMD Ryzen 5 2600 6-core Processor with Windows 11 64 Pro.

AMD Ryzen 7 3700X 8-Core Processor with Windows 11 64 Pro (part time)


smile
[Nov 3, 2013 3:34:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 211   Pages: 22   [ Previous Page | 13 14 15 16 17 18 19 20 21 22 | Next Page ]
[ Jump to Last Post ]
Post new Thread