| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 12
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Restarting task flu10101b0090_100121_0 using flu1 version 604 Task flu10101b0090_100121_0 exited with zero status but no 'finished' file If this happens repeatedly you may need to reset the project. flu10101b0090_100121_0 status running: but zero CPU time, and above mssg every 2 minutes. Eventually i had to abort it manually after giving it 2 hours to proper start or exit with a error. all the time my second core was at 0% cpu. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
And now another one:
11/05/2009 21:14:04|World Community Grid|Restarting task flu10101h0423_100152_0 using flu1 version 604 11/05/2009 21:14:45|World Community Grid|Task flu10101h0423_100152_0 exited with zero status but no 'finished' file 11/05/2009 21:14:45|World Community Grid|If this happens repeatedly you may need to reset the project. Same machine: 03/05/2009 11:25:45||Starting BOINC client version 6.2.28 for windows_intelx86 03/05/2009 11:25:45||log flags: task, file_xfer, sched_ops 03/05/2009 11:25:45||Libraries: libcurl/7.19.0 OpenSSL/0.9.8i zlib/1.2.3 03/05/2009 11:25:45||Running as a daemon 03/05/2009 11:25:45||Data directory: C:\Documents and Settings\All Users\Application Data\BOINC 03/05/2009 11:25:45||Running under account boinc_master 03/05/2009 11:25:47||Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz [x86 Family 6 Model 15 Stepping 10] 03/05/2009 11:25:47||Processor features: fpu tsc sse sse2 mmx 03/05/2009 11:25:47||OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 3, (05.01.2600.00) 03/05/2009 11:25:47||Memory: 1.96 GB physical, 3.81 GB virtual 03/05/2009 11:25:47||Disk: 111.78 GB total, 64.40 GB free 03/05/2009 11:25:47||Local time is UTC +1 hours 03/05/2009 11:25:47|World Community Grid|URL: http://www.worldcommunitygrid.org/; Computer ID: 900331; location: (none); project prefs: default 03/05/2009 11:25:47||General prefs: from World Community Grid (last modified 26-Apr-2009 03:13:29) 03/05/2009 11:25:47||Host location: none 03/05/2009 11:25:47||General prefs: using your defaults 03/05/2009 11:25:47||Reading preferences override file 03/05/2009 11:25:47||Preferences limit memory usage when active to 1003.11MB 03/05/2009 11:25:47||Preferences limit memory usage when idle to 1504.67MB 03/05/2009 11:25:47||Preferences limit disk usage to 9.31GB Anyone else getting this ? Any idea whats causing this ? I'll continue to run FLU on this Thinkpad for testing purposes, but untill things are more stable or a cause is indentified ruling out reocurrence on certain computer models i'll be removing this project from my production machines, as i cannot affort ANY instability on those compagny machines. KR, Willem |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
My lips are sealed, but if you schedule the BOINC networking with the local prefs to say, 15-30 minutes a day and set the cache/ additional buffer to about 1.5 days, your client will upload / download tasks & results most probably without concurrently finishing results and starting new jobs. My quad has issue to network and start jobs simultaneous. All jobs are affected.
----------------------------------------You could of course start off by resetting the WCG project in the BOINC client as per the warning message and see if the warnings disappear. I'm curious to hear if that works. Also please check your Result Logs if there are heartbeat warning lines. All that said, make sure in your AV software to exclude the BOINC data_dir from scanning. The path can be found in the BOINC start up message log. Let us know.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at May 12, 2009 7:07:34 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for the reply Sekerob.
Plan of attack: 1 - I left my thinkpad as it is to see if it happens again. No changes made at all. 2 - I removed FLU from all 99 production machines. 3 - I use 1 production machine to crunsh FLU, and will closly monitor this IBM\Lenovo C2D 6300 to see if it occurs here too in the next 2 weeks. If it happens again i'll start off with resetting the the project, and check the results log. So far i'm the only one that posted this, so i hope it's just my setup causing this. I'll post any updates here |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
It started on my quad with Flu which includes a beginning and end benchmark routine (think all the AutoDock based science projects have). Seemingly BOINC transmission impacts this benchmarking detrimentally.
----------------------------------------As said, with my mix WCG projects, it affects all projects, to include the new HCMD2, so this is why I set it up as described above. On larger farms it might be easier to manage too this way, as you know which time segment to monitor... but 99 devices all cramming their UL/DL into a small time segment 15-30 minutes... is bound to have bottlenecks unless you got oodles of bandwidth. Anyway, I've notified the techs of the observations. Me alone might be something host specific. 99 devices affected is pointing at something in the software. thanks for helping testing.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
With IBM paying the bandwith here i can assure you it's not that problem :) Even with 700+ computers connected and occupied by employees speed is still crazy :)
However i limited the network usage for all clients on the folowing: - 50kb\s down - 15 kb\s up This to prevent any bottlenecks, and impact on performance. To keep natural randomisation i'm not using any scheduling, and connect every 0.1 day with a 1day buffer. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
15/05/2009 11:20:27|World Community Grid|Started download of flu10201a0014_100323_wcgrid.00033.gpf.gzb 15/05/2009 11:20:27|World Community Grid|Started download of flu10201a0014_100323_wcgrid.00033.dpf.gzb 15/05/2009 11:20:28|World Community Grid|Finished download of flu10201a0014_100323_wcgrid.00033.gpf.gzb 15/05/2009 11:20:28|World Community Grid|Finished download of flu10201a0014_100323_wcgrid.00033.dpf.gzb 15/05/2009 15:41:47|World Community Grid|Computation for task flu10101m0722_100205_0 finished 15/05/2009 15:41:47|World Community Grid|Starting flu10101o0793_100325_0 15/05/2009 15:41:47|World Community Grid|Starting task flu10101o0793_100325_0 using flu1 version 604 15/05/2009 15:41:49|World Community Grid|Started upload of flu10101m0722_100205_0_0 15/05/2009 15:41:49|World Community Grid|Started upload of flu10101m0722_100205_0_1 15/05/2009 15:41:53|World Community Grid|Finished upload of flu10101m0722_100205_0_0 15/05/2009 15:41:53|World Community Grid|Started upload of flu10101m0722_100205_0_2 15/05/2009 15:41:55|World Community Grid|Finished upload of flu10101m0722_100205_0_2 15/05/2009 15:41:55|World Community Grid|Started upload of flu10101m0722_100205_0_3 15/05/2009 15:41:57|World Community Grid|Finished upload of flu10101m0722_100205_0_3 15/05/2009 15:42:04|World Community Grid|Finished upload of flu10101m0722_100205_0_1 15/05/2009 15:45:38|World Community Grid|Task flu10101o0793_100325_0 exited with zero status but no 'finished' file 15/05/2009 15:45:38|World Community Grid|If this happens repeatedly you may need to reset the project. 15/05/2009 15:45:38|World Community Grid|Restarting task flu10101o0793_100325_0 using flu1 version 604 15/05/2009 15:46:19|World Community Grid|Task flu10101o0793_100325_0 exited with zero status but no 'finished' file 15/05/2009 15:46:19|World Community Grid|If this happens repeatedly you may need to reset the project. And another one. and again on the Thinkpad T61. So far the Lenovo desktop test machine runs fine. So i'll reset the project on the thinkpad and see if it solves the problem. keep you updated. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Willem, it's an identified issue occurring when a FLU task is starting, and seemingly only FLU combined with a network component in BOINC that is not robust, causing enough delay on the science progress for them to incur a > 30 second comms interruption, thus a heartbeat issue. I can replicate and prevent (take the client off-line and only do scheduled networking as what I described previously).
----------------------------------------But, a thought has occurred, which I'll touch base on with the techs. ttyl
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at May 15, 2009 4:52:22 PM] |
||
|
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3716 Status: Offline Project Badges:
|
Willem,
----------------------------------------Are these machines using WAN to access your local network? I have already seen a (not very powerful) PC stalled when using an USB WiFi adapter to connect to an ADSL2 box. In this particular case we finally had to use a LAN connection to be able to use this PC. Just wondering... Jean. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
(take the client off-line and only do scheduled networking as what I described previously). Yeah it is possible since the thinkpad's are laptops that move around a lot from network to network in WLAN. So i'm currently testing different Network settings as you described. I saw that "connect every XX day" was set to 0, causing excatly this issue. Scheduling is now in place and lets see how that works out :) THZ |
||
|
|
|