Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: New Beta Test starting Oct 31, 2013 [Issues Thread] |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 211
|
Author |
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
I had 2:10 units that error'd out due to too big a file size (Error code -131); believe both should have successfully completed. I understand Betas are meant to find problems, but my concern is whether this really indicative of a WCG procedural problem (i.e. nothing to do with research units themselves). We had similar types of issues with the HCC project timing units out at the end, as believe WCG got its calculations wrong (from memory all close (< 5%?) of completing OK) . The HCC (and these) units all error'd out very close to completing normally (or completed OK and then reported errors as didn’t comply with estimate). I personally would prefer less reliance on accurate estimates for file/time length (i.e. need to add larger safety margin), rather than risk wasting our efforts. Have you read this thread? I don't think it is a small error in their size estimates. |
||
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3715 Status: Offline Project Badges: |
I don't think it is a small error in their size estimates. Sure. Since usually it's difficult to see the size of good result files we don't know exactly how many times the current 10-MB limit is supposed to exceed the expected correct size of result files. If we except CEP2, result files are usually about 0.1 MB or less and, if same for this beta, 10 MB is already far too much. ---------------------------------------- [Edit 1 times, last edit by JmBoullier at Nov 2, 2013 12:07:39 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Another issue with the large result files is that it generally uses up LOTS of memory because of this. And then subsequently write very large check point files about every 10 minutes. We are working on a solution that would limit this AND provide the results back to the researchers without putting too much stress on the uploads as well as memory usage on the member's machines. Out of curiosity, what about the project causes it to behave this way? I've had 2 WUs that performed over 2000 passes. One finished fine and the other triggered the file size error at over 10x the maximum. It's interesting that some output files of the same application are more than an order of magnitude larger than others, depending on the data set. |
||
|
RichSavarie
Cruncher Canada Joined: Aug 9, 2005 Post Count: 49 Status: Offline Project Badges: |
I'm aborting the Beta WU. Been restarting itself for almost two days. No point in letting it go on. Hopefully I get another unit to try. It's not often I get to do any beta processing. Oh well.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I got 6 of the Betas on 2 different machines
machine that got 2 - both error 131 Machine that got 4 - 2 x error 131, 1 x valid and 1 x PV so 4 out of 6 with 131 - xfer errors like this... Run complete, CPU time: 8703.625000 22:01:18 (3624): called boinc_finish </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>BETA_BETA_9999987_0770_1_0</file_name> <error_code>-131</error_code> </file_xfer_error> |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1316 Status: Offline Project Badges: |
Memory usage above the limit with an untouched task by me
----------------------------------------http://www.worldcommunitygrid.org/ms/device/v...og.do?resultId=1665805900 It crashed after the full 100% run with the exceeded size limit and an elapsed time of 6h20m and used cpu 6h. I noticed just before the end memory usage of 550MB and the checkpointo.bin was grown up to 210.452kB. 02 Nov 07:53:34 Output file BETA_BETA_9999986_0997_4_0 for task BETA_BETA_9999986_0997_4 exceeds size limit. 02 Nov 07:53:34 File size: 113241326.000000 bytes. Limit: 10485760.000000 bytes |
||
|
mali vuk
Advanced Cruncher Slovenia Joined: Apr 27, 2007 Post Count: 138 Status: Offline Project Badges: |
Looks like betas are over?
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Yes and no. Yes, the initial batch was distributed in 1.5 hours on 31 October. No, some resends are still appearing. Also no, there will have to be further batches of beta tests on this new application before it can be placed in production.
|
||
|
mali vuk
Advanced Cruncher Slovenia Joined: Apr 27, 2007 Post Count: 138 Status: Offline Project Badges: |
Tony, thx.
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
1) Output file too large (Error -131) 2) Maximum Disk Use Exceeded (disk_bound overstepped) 3) Memory model exceeded (memory_bound overstepped) These 3 are associated pretty much with the same problem. The result output is growing larger than it needs to. The researchers are working to fine tune the filtering that is needed within the work units. Also as a fail safe we are working on a way to detect this mid work unit and exit gracefully so that work done is returned to the researchers and they can evaluate how to proceed with these monster result files. 4) Loss of -large- portions of CPU time at time of reporting, which looks to happen at end. We are looking into the checkpointing issue, we believe we have a fix, but it'll need to be tested on the next round 5) Progress % erratic (e.g. happens it can from 0.5% to 50% only at end of 1st pass when there are only 2 passes) 6) Related to 5), checkpoints at times multiple hours apart... not good for part time crunchers. 5 and 6 are similar to 4 in that we have a potential fix and will be testing it next round. 7) Jobs seem stuck in memory at times, [when seemingly no more progress is made]... wont unload, even when "Leave application in memory when suspended" is off. Full client restart required to get them to unload. This is something I need more information on. Is this an issue with Windows only? What flavor? (ex. Windows 8 32bit or Windows Vista 64bit) I have not been able to recreate on my machines, but that could be I'm looking at the wrong OS. 8) Some tasks freeze on the CPU time use when running [is it the display or is it the CPU time in Task Manager indicates no CPU time use?], while elapsed time keeps accumulating and progress % goes backward. Users of BOINC manager wont see this easily, to users of BOINCTasks it's obvious since both Elapsed and CPU time is shown. I believe some of this might be due to the large results and checkpoints, we are still investigating it.Wish list: Printing of OS and CPU details in Result Log. Yes, we are thinking of adding this information to the result status page, not in the result log as that would not require us to recompile the older applications on WCG to support this change.Thanks, -Uplinger On 7), Was Windows7 64+32 with a newer test client 7.2.23+18 respectively. Both have a private partition > 10GB. Only know this since for the whole test cycle I did not visit the Linux box, which had 4 errors, all with -131 [which if understood how to restrain output, probably would have been valid results], noting that the 5th on that Ubu 131.0 box did get a valid, BUT, it went all the way to 5774 passes, where if there are too many, essentially only prints the end part of the log... starting at about pass 3761. The other good news on the wingman is, that the log showed exactly the same, meaning, 100% reproducing result. BETA_ BETA_ 9999984_ 0176_ 1-- 719 Valid 10/31/13 06:04:29 11/1/13 14:16:37 3.78 80.4 / 66.3 BETA_ BETA_ 9999984_ 0176_ 0-- 719 Valid 10/31/13 06:04:22 10/31/13 11:31:28 3.91 52.2 / 66.3 Don't know which platform the original development took place, but 50:50 card is on Linux. So, as noted, taking off the LAIM option, suspending the stuck task did not unload them [confirmed in Task Manager]. Since I had CEP2 and FAHV running on the side and suspending them, to get the Beta to start again, they did unload... not a client issue. Doing a BOINC service stop, if it matters, not a user level install effectuated the unloads. Think the latest clients have mechanisms to kill all BOINC related processes, even if zombied, but since the tasks after suspend did not use/count Elapsed or CPU time, conclude neither were orphaned. On the wish listed log info additions... apart from the recompile consideration, new sciences going forward would have allowed the feature to be added to the coming CEP2 and Beta17 [i.e. with New -to be launched- applications, no urgency seeing for the solid established projects]. Based on various timetables, you're saying... maybe 2014. The Monkees song has long changed it's title to 'I'm [not] a belieber', but if the OS/CPU info is added to the quorum detail sub page to the Result Page, that adds a convenience [single page overview], and probably adds a little fetch load prior to opening the logs. Pasting the log for helpers then does not in a natural way present the system info... The original thought I had in an 'exceed the customers expectation' [Not an IBM credo?], do both. Et Al, for the envious, got 3 from 'Detached' clients... 1 PV since wingman had not reported. 2 went error -131. Efficiency... eat your heart out 99,75% and greater, on the side with 5 CEP2 only who do > 99%+, all told 8 concurrent tasks on that device. I'd definitely say... Low system load, even with oversized checkpoint log files. Avast AV, private partition scan exclusion plus a special mask that lets new science apps versions through without a need of manual approval [Otherwise you have to be there when first arriving to catch them when trapped]. MS Defender as secondary filter, and Linux-NAS firewall [in the router]. edit: Before I forget, the last dozen entries from the BOINCTasks history that logs the total job efficiency [the value just before "Reported:", with zero complaints on my part: WCG 6.40 cep2 E216672_049_J.32.C24H14N2S4Si2.00002553.4.set1d06_0 07:41:22 (07:36:36) 02-11-2013 09:51 02-11-2013 10:01 98.967 Reported: OK W7-64 374.32 MB 223.85 MB WCG 7.19 beta17 BETA_BETA_9999985_0574_4 05:58:46 (05:58:13) 02-11-2013 09:16 02-11-2013 09:18 99.847 Reported: Computation error (0,) W7-64 294.15 MB 275.42 MB WCG 6.40 cep2 E216666_202_J.32.C20H8N4O2S5Se.00050097.1.set1d06_1 07:10:21 (07:06:20) 02-11-2013 09:02 02-11-2013 09:11 99.067 Reported: OK W7-64 374.27 MB 212.16 MB WCG 7.19 beta17 BETA_BETA_9999988_0239_3 05:44:04 (05:43:12) 02-11-2013 09:01 02-11-2013 09:08 99.748 Reported: Computation error (0,) W7-64 240.99 MB 202.60 MB WCG 7.19 beta17 BETA_BETA_9999988_0238_2 05:20:34 (05:20:00) 02-11-2013 08:38 02-11-2013 08:38 99.823 Reported: OK W7-64 108.67 MB 70.56 MB WCG 6.40 cep2 E216665_336_J.33.C22H9N5O2S4.00031245.1.set1d06_0 06:37:44 (06:34:03) 02-11-2013 07:30 02-11-2013 07:39 99.074 Reported: OK W7-64 384.66 MB 219.50 MB WCG 6.40 cep2 E216665_738_J.33.C22H10N6O2S3.00084644.0.set1d06_0 06:33:17 (06:29:42) 02-11-2013 07:04 02-11-2013 07:12 99.089 Reported: OK * W7-64 371.93 MB 216.15 MB WCG 6.40 cep2 E216432_404_I.39.C35H20N2O2.00030250.2.set1d06_2 12:04:25 (12:00:00) 02-11-2013 02:27 02-11-2013 02:38 99.390 Reported: OK * W7-64 465.44 MB 295.22 MB WCG 6.40 cep2 E216663_574_J.32.C20H7N7S4Se.00001096.0.set1d06_0 05:46:25 (05:43:16) 02-11-2013 02:10 02-11-2013 02:19 99.091 Reported: OK W7-64 364.44 MB 211.13 MB WCG 6.40 cep2 E216653_751_J.27.C20H10N2S4Se.00023967.3.set1d06_0 10:25:42 (10:21:50) 02-11-2013 01:52 02-11-2013 01:59 99.382 Reported: OK * W7-64 361.05 MB 250.04 MB WCG 6.40 cep2 E216648_238_I.46.C28F9H11N4O5.00185614.0.set1d06_0 12:04:58 (12:00:00) 02-11-2013 00:52 02-11-2013 00:58 99.315 Reported: OK * W7-64 264.64 MB 132.80 MB WCG 6.40 cep2 E216662_162_J.32.C20H8N4O2S6.00086154.4.set1d06_1 04:57:35 (04:54:55) 02-11-2013 00:30 02-11-2013 00:38 99.104 Reported: OK * W7-64 369.95 MB 205.00 MB [Edit 2 times, last edit by Former Member at Nov 2, 2013 1:53:29 PM] |
||
|
|