| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 20
|
|
| Author |
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Another suggestion to speed up the test and improve your management experience: The benchmark is run on the clock automatically every 5 days or 140 hours. What you can do with a management tool like BOINCView is: Force the benchmarks to occur simultaneous on all and then know that if the hanging is to happen, it will happen on all your effected machines at the same time.
----------------------------------------The Beauty of BOINCView (BV) is, that you can effective monitor all your computers from 1 spot. Set it to refresh the view say every 5 minutes and in colour coding it tells which tasks are in what state, green, yellow, red. The color coding is adjustable. BV allows to select multiple devices in a single view and make it send an activity to all. Per process CPU efficiency shown at your fingertips. Only pre-requisite is 1 windows machine to set it up on for complete remote control which BOINCmgr GUI can do only 1 at the time. And thanks for donating your farm's spare time to WCG. ciao
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
marky1124
Cruncher Joined: Jan 10, 2005 Post Count: 29 Status: Offline Project Badges:
|
Hi Sekerob,
Thanks for directing me to BoincView. It's taken a while to set it up across the majority of my servers but now that's done it's an excellent central view. I've taken your advice and performed a benchmark to synchronise them all, that's great. One work unit did hang when I triggered the benchmark. I've notice that when a work unit is processing properly there are three processes, however when the benchmark causes a failure there are only two. E.g. $ ps -ef | grep 181155 service 4992 3349 94 01:59 pts/1 08:28:33 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2 - strace shows no activity service 4993 4992 0 01:59 pts/1 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2 - strace shows 2 second poll timeouts After killing the hung work unit processes:- $ ps -ef | grep 181155 service 6329 3349 0 11:08 pts/1 00:00:51 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2 service 6330 6329 0 11:08 pts/1 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2 service 6331 6330 0 11:08 pts/1 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2 It's the grand-child that died. The machine this happened on was running at 95%. Yesterday one of the machines that I switched to 100% had a hang. BoincView highlighted it in yellow immediately and the CPU efficiency dropped to 0. So that was really easy to manage. I'd restarted boinc on that machine at 11:04, the work unit hung at 14:04 and a benchmark occurred at 15:00. So I think that demonstrates that the issue isn't caused by a clash of the benchmark and a work unit running at less than 100%. Cheers, Mark |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Glad you like it Mark,
----------------------------------------A final thought.... with BV you can as you've done, select all or any device to perform an joined action. Next, when forcing a benchmark, first run a Activity Computing Suspend (Hosts, The Running Man Icon drop down, Suspend), then Benchmark and take them out of suspend again. Are you using the Leave in Memory option? I'd be interested to know as from my early days with BOINC I remember an issue which forced me to use that option not to loose job progress. Is it a Linux issue or does it occur on Windows too and was it a specific science? If it can be nailed to only happen to HCC that would give an idea it's not BOINC itself. It's curious that you mention 94-95% for the second or 3rd time when the job hangs. Wonder if that is a specific moment in the finishing cycle of HCC.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
marky1124
Cruncher Joined: Jan 10, 2005 Post Count: 29 Status: Offline Project Badges:
|
Hi Sekerob,
I'll test a few benchmarks across the farm and see if I can reproduce the work unit hang. I'll try both whilst running and whilst suspended. I've already done a few benchmarks, but don't have the time now to keep testing. I'm not sure about the "Leave in Memory" option. Where do I check that? It's not something I'm aware of having configured. I've only experienced the hangs on the Linux clients, and due to the wide variety of Linux platforms and BOINC builds there may be problems there. I've only witnessed it happen with the HCC work units. I try to only do HCC work, although some of our machines have consistently produced invalid HCC units, e.g. an IA64 platform, and thus it does RICE units instead (successfully). I have mislead you with my mention of 95%. That's the CPU usage ratio that I've got set. E.g. Use at most 95% CPU time. I did a fresh 6.2.14 Windows client install yesterday and that value was in the GUI client. I've found the work unit hangs to occur at stages between 37% and 93%, and possibly others. I've not observed any pattern there. Loving BoincView, thank you. Cheers, Mark |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
okay thanks, and to refer back to armstrdj's tech note, testing at 100% throttle would interest very much.
----------------------------------------"Leave In Memory" (officially "Leave applications in memory while preempted") is by default set to "No". It helps clients to restart a science quicker, not having to fetch it from disk. I understood that the behaviour had changed between BOINC 5.4 and 5.8 and that benchmarking was one of the activities to always keep the science in memory to prevent checkpoint reverting. Much appreciate the time you invest in researching this pain.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
marky1124
Cruncher Joined: Jan 10, 2005 Post Count: 29 Status: Offline Project Badges:
|
I can confirm that the work unit hangs still occur when the CPU throttle is set to 100%. It has even occurred when I suspended processing, ran the benchmark, and then resumed processing.
I have one particular machine that seems to be more suspectible than some of the others. It's running Redhat RHEL AS v3 update 4. It has two 2.00Ghz Intel Xeon processors. The tasks are HCC v603. It has an older version of the boinc manager software (5.10.21). I'd put a newer version on but I've had build problems in the past getting newer BOINC managers built on older Linux platforms. I've had tasks hang at 43%, 94% and 97% on recent testing. When it happens only two of the three wcg_ processes are left running. I have to kill those two to cause the boinc manager to restart the task from the latest checkpoint. If this is a BOINC manager software problem then I'll happily upgrade if someone can help me. Cheers, Mark |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Mark,
----------------------------------------This has always been happening with HPF2 both under the old/discontinued UD agent and on BOINC, whatever version. If you are not particularly "preferentially" attached to HPF2, suggest to create an addition device profile and deselect this project. Then, attach the devices that have shown the hangs to that new profile. Let us know. Rob
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have had a number of jobs hang on my Linux box. I have found that stopping and restarting boinc nearly always gets the job to continue. I think I will make a cron job to do that every day and see if that fixes the problem.
|
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Somewhere on this forum there was someone doing that. Trouble is, the CPU use is normal and time continuing to count in the BOINCmgr and CPU use 100% and constant even when throttled, where the % progress is frozen.
----------------------------------------
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Aug 15, 2008 2:14:57 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
In most of the hangs I have had the CPU utilization dropped to 50% and the job just sat there. I did have one that sucked a full CPU and the time to completion didn't decline.
|
||
|
|
|