Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: New Beta Test starting Oct 31, 2013 [Issues Thread] |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 211
|
Author |
|
branjo
Master Cruncher Slovakia Joined: Jun 29, 2012 Post Count: 1892 Status: Offline Project Badges: |
Since I am at work, I can't check the progress, so just remote observations for now: - Full (down)load: 8 for i7 + 4 for i5. - 2 already errored: 1 on my i7-3770 Win7 64b 7.2.26 after 2.63 hours, 1 on my i5-2500S MAC OS X 10.9. (Mavericks) 7.0.65 after 2.68 h. Wingmen still In progress. - The other 10 In progress. ... Good luck and cheers ETA1: 1 on Win Valid (CPU Time 1.30 h), 2 on Mac PVal (CPU time 3.57 and 3.48 h) ETA2: methink 10 days deadline for Betas is a bit long ETA3: the last 2 WU's I caught unfinished when I came back from work were resends (one on Win, the second one on MAC), so both of them errored out. But checkpoints worked fine and the RAM usage was around 250 MB. Crunching@Home since January 13 2000. Shrubbing@Home since January 5 2006 |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
I've had one Beta WU for nearly a day and I just noticed in the log that it has been restarting itself every ~10mins or so. Each time it restarts, the "estimated completion time" resets to about 10hrs. Absolutely no progress has been made. Do I abort or just let it go? Rich, What OS are you running and do you have any security software installed on your computer? If so can you check to see if the BOINC data directory is excluded. It sounds like an outside source is killing the process to have it restart. Thanks, -Uplinger |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
Unfortunately there is nothing you can do to fix this. This is a setting from the server. We have limited the result files to 10MB, but as some have reported the result file has grown to 100MB in some cases. The researchers have a list of results that have this issue and will be looking into getting this file size down. Thanks, -Uplinger I don't know if this limit is for the benefit of the users, or WCG. But I have a fast (2 Mbps) upload speed, and could easily do the 100 MB if that will help the science. You could make it user-selectable, like the number of tasks downloaded for CEP2. It's kind of a dual edged sword. First, not everyone has a good connection and some actually pay for transfers. This means some members are limited to say 10GB per month. But as you say, there may be a way to give some members who are willing to send back larger result files. The problem on our end is we don't have infinite storage. Also, large file transfers would use up ports for others to request and send back results. The servers can only handle so many connections at one time. Another issue with the large result files is that it generally uses up LOTS of memory because of this. And then subsequently write very large check point files about every 10 minutes. We are working on a solution that would limit this AND provide the results back to the researchers without putting too much stress on the uploads as well as memory usage on the member's machines. Hope this helps. Thanks, -Uplinger |
||
|
Gil II
Senior Cruncher Canada Joined: Dec 6, 2006 Post Count: 368 Status: Offline Project Badges: |
SekeRob
----------------------------------------One more for your list. I' seen this issue on other comments. Tasks restart every few minutes as in: 01/11/2013 11:28:28 AM World Community Grid Restarting task BETA_BETA_9999986_0563_4 using beta17 version 719 01/11/2013 11:28:38 AM World Community Grid Restarting task BETA_BETA_9999987_0516_4 using beta17 version 719 01/11/2013 11:31:35 AM World Community Grid Restarting task BETA_BETA_9999986_0563_4 using beta17 version 719 01/11/2013 11:31:45 AM World Community Grid Restarting task BETA_BETA_9999987_0516_4 using beta17 version 719 I have aborted 8 jobs in total with this problem. 1) Output file too large (Error -131) 2) Maximum Disk Use Exceeded (disk_bound overstepped) 3) Memory model exceeded (memory_bound overstepped) 4) Loss of -large- portions of CPU time at time of reporting, which looks to happen at end. 5) Progress % erratic (e.g. happens it can from 0.5% to 50% only at end of 1st pass when there are only 2 passes) 6) Related to 5), checkpoints at times multiple hours apart... not good for part time crunchers. 7) Jobs seem stuck in memory at times, [when seemingly no more progress is made]... wont unload, even when "Leave application in memory when suspended" is off. Full client restart required to get them to unload. 8) Some tasks freeze on the CPU time use when running [is it the display or is it the CPU time in Task Manager indicates no CPU time use?], while elapsed time keeps accumulating and progress % goes backward. Users of BOINC manager wont see this easily, to users of BOINCTasks it's obvious since both Elapsed and CPU time is shown. 9) Running 4 concurrently (i.e., using all available cores), appears to be very inefficient. 10)Tasks restart every few minuites |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
1) Output file too large (Error -131) 2) Maximum Disk Use Exceeded (disk_bound overstepped) 3) Memory model exceeded (memory_bound overstepped) These 3 are associated pretty much with the same problem. The result output is growing larger than it needs to. The researchers are working to fine tune the filtering that is needed within the work units. Also as a fail safe we are working on a way to detect this mid work unit and exit gracefully so that work done is returned to the researchers and they can evaluate how to proceed with these monster result files. 4) Loss of -large- portions of CPU time at time of reporting, which looks to happen at end. We are looking into the checkpointing issue, we believe we have a fix, but it'll need to be tested on the next round 5) Progress % erratic (e.g. happens it can from 0.5% to 50% only at end of 1st pass when there are only 2 passes) 6) Related to 5), checkpoints at times multiple hours apart... not good for part time crunchers. 5 and 6 are similar to 4 in that we have a potential fix and will be testing it next round. 7) Jobs seem stuck in memory at times, [when seemingly no more progress is made]... wont unload, even when "Leave application in memory when suspended" is off. Full client restart required to get them to unload. This is something I need more information on. Is this an issue with Windows only? What flavor? (ex. Windows 8 32bit or Windows Vista 64bit) I have not been able to recreate on my machines, but that could be I'm looking at the wrong OS. 8) Some tasks freeze on the CPU time use when running [is it the display or is it the CPU time in Task Manager indicates no CPU time use?], while elapsed time keeps accumulating and progress % goes backward. Users of BOINC manager wont see this easily, to users of BOINCTasks it's obvious since both Elapsed and CPU time is shown. I believe some of this might be due to the large results and checkpoints, we are still investigating it.Wish list: Printing of OS and CPU details in Result Log. Yes, we are thinking of adding this information to the result status page, not in the result log as that would not require us to recompile the older applications on WCG to support this change.Thanks, -Uplinger |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
If you are encountering a restart issue of your result, please let us know a few things.
1. What OS are you running (ex. Windows 8 64 bit) 2. Do you have security software on your computer. One report from Gil was McAfee. 3. Check your security software to see if you can exclude either this application or the boinc data directory. On windows this is usually C:/ProgramData/BOINC/. All of the errors up to this point on the restart issue are Windows machines. My assumption at this time is since it's a new application the security software on your machine is killing the process and boinc is trying to restart. This kill and restart happens too many times and an error is reported to the server. Thanks, -Uplinger |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
9) Running 4 concurrently (i.e., using all available cores), appears to be very inefficient. I believe this is due to really large files needing to be checkpointed which brings it back to we are working on a solution at this time to detect the memory usage and exit based on that.10)Tasks restart every few minuites Please see my post above, requesting more information. Thanks, -Uplinger |
||
|
Gil II
Senior Cruncher Canada Joined: Dec 6, 2006 Post Count: 368 Status: Offline Project Badges: |
Restart issue additional info: I am running Windows 7
---------------------------------------- |
||
|
pramo
Veteran Cruncher USA Joined: Dec 14, 2005 Post Count: 703 Status: Offline Project Badges: |
If you are encountering a restart issue of your result, please let us know a few things. 1. What OS are you running (ex. Windows 8 64 bit) 2. Do you have security software on your computer. One report from Gil was McAfee. 3. Check your security software to see if you can exclude either this application or the boinc data directory. On windows this is usually C:/ProgramData/BOINC/. All of the errors up to this point on the restart issue are Windows machines. My assumption at this time is since it's a new application the security software on your machine is killing the process and boinc is trying to restart. This kill and restart happens too many times and an error is reported to the server. Thanks, -Uplinger These few results are from various OS's, all Symantec Endpoint protection V12.1 - Same package is pushed to each machine (well, one for 32 bit and one for 64 bit) can't modify settings but the logs show no issues. This was the only w/u I had with restart issues. My aborted task that was restarting: (XP 32 bit) 11/1/2013 4:54:12 AM World Community Grid Restarting task BETA_BETA_9999984_0541_0 using beta17 version 719 11/1/2013 4:57:18 AM World Community Grid Restarting task BETA_BETA_9999984_0541_0 using beta17 version 719 another XP 32 bit machine didn't have the restart problem. 10/31/2013 4:24:55 PM World Community Grid Computation for task BETA_BETA_9999985_0055_1 finished 10/31/2013 4:24:55 PM World Community Grid Output file BETA_BETA_9999985_0055_1_0 for task BETA_BETA_9999985_0055_1 exceeds size limit. there were a few on Win7 64bit and server2008r2, no restarts. A few valid: 10/31/2013 8:09 World Community Grid Computation for task BETA_BETA_9999987_0131_1 finished 10/31/2013 8:11 World Community Grid Started upload of BETA_BETA_9999987_0131_1_0 10/31/2013 8:11 World Community Grid Finished upload of BETA_BETA_9999987_0131_1_0 some erors: 10/31/2013 8:56 World Community Grid Computation for task BETA_BETA_9999984_0548_1 finished 10/31/2013 8:56 World Community Grid Output file BETA_BETA_9999984_0548_1_0 for task BETA_BETA_9999984_0548_1 exceeds size limit. 10/31/2013 8:56 World Community Grid File size: 12270504.000000 bytes. Limit: 10485760.000000 bytes |
||
|
Thargor
Veteran Cruncher UK Joined: Feb 3, 2012 Post Count: 1291 Status: Offline Project Badges: |
If you are encountering a restart issue of your result, please let us know a few things. 1. What OS are you running (ex. Windows 8 64 bit) 2. Do you have security software on your computer. One report from Gil was McAfee. 3. Check your security software to see if you can exclude either this application or the boinc data directory. On windows this is usually C:/ProgramData/BOINC/. All of the errors up to this point on the restart issue are Windows machines. My assumption at this time is since it's a new application the security software on your machine is killing the process and boinc is trying to restart. This kill and restart happens too many times and an error is reported to the server. Thanks, -Uplinger 1. Windows 7 Home Premium 64-bit 2. Yes, spybot S'n'D & ESET NOD32 (64-bit) A/V 3. Not at home, but can give this a try when I get in... In the meantime, here's an excerpt from the HUGE error-log attached to the WU which finally failed on my Windows box at home: --- Running--- I'm assuming you can view the full log from the WU-name listed below? If not, let me know where I can c&p the full log... |
||
|
|