Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 20
Posts: 20   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3629 times and has 19 replies Next Thread
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

Another suggestion to speed up the test and improve your management experience: The benchmark is run on the clock automatically every 5 days or 140 hours. What you can do with a management tool like BOINCView is: Force the benchmarks to occur simultaneous on all and then know that if the hanging is to happen, it will happen on all your effected machines at the same time.

The Beauty of BOINCView (BV) is, that you can effective monitor all your computers from 1 spot. Set it to refresh the view say every 5 minutes and in colour coding it tells which tasks are in what state, green, yellow, red. The color coding is adjustable. BV allows to select multiple devices in a single view and make it send an activity to all. Per process CPU efficiency shown at your fingertips. Only pre-requisite is 1 windows machine to set it up on for complete remote control which BOINCmgr GUI can do only 1 at the time.

And thanks for donating your farm's spare time to WCG.

ciao
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 31, 2008 5:01:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

Hi Sekerob,

Thanks for directing me to BoincView. It's taken a while to set it up across the majority of my servers but now that's done it's an excellent central view. I've taken your advice and performed a benchmark to synchronise them all, that's great. One work unit did hang when I triggered the benchmark.

I've notice that when a work unit is processing properly there are three processes, however when the benchmark causes a failure there are only two. E.g.

$ ps -ef | grep 181155
service 4992 3349 94 01:59 pts/1 08:28:33 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2
- strace shows no activity
service 4993 4992 0 01:59 pts/1 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2
- strace shows 2 second poll timeouts


After killing the hung work unit processes:-

$ ps -ef | grep 181155
service 6329 3349 0 11:08 pts/1 00:00:51 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2
service 6330 6329 0 11:08 pts/1 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2
service 6331 6330 0 11:08 pts/1 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049560829200505181155.jp2


It's the grand-child that died.

The machine this happened on was running at 95%. Yesterday one of the machines that I switched to 100% had a hang. BoincView highlighted it in yellow immediately and the CPU efficiency dropped to 0. So that was really easy to manage. I'd restarted boinc on that machine at 11:04, the work unit hung at 14:04 and a benchmark occurred at 15:00. So I think that demonstrates that the issue isn't caused by a clash of the benchmark and a work unit running at less than 100%.

Cheers,
Mark
[Aug 5, 2008 12:15:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

Glad you like it Mark,

A final thought.... with BV you can as you've done, select all or any device to perform an joined action. Next, when forcing a benchmark, first run a Activity Computing Suspend (Hosts, The Running Man Icon drop down, Suspend), then Benchmark and take them out of suspend again.

Are you using the Leave in Memory option? I'd be interested to know as from my early days with BOINC I remember an issue which forced me to use that option not to loose job progress.

Is it a Linux issue or does it occur on Windows too and was it a specific science? If it can be nailed to only happen to HCC that would give an idea it's not BOINC itself. It's curious that you mention 94-95% for the second or 3rd time when the job hangs. Wonder if that is a specific moment in the finishing cycle of HCC.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Aug 5, 2008 12:37:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

Hi Sekerob,

I'll test a few benchmarks across the farm and see if I can reproduce the work unit hang. I'll try both whilst running and whilst suspended. I've already done a few benchmarks, but don't have the time now to keep testing.

I'm not sure about the "Leave in Memory" option. Where do I check that? It's not something I'm aware of having configured.

I've only experienced the hangs on the Linux clients, and due to the wide variety of Linux platforms and BOINC builds there may be problems there. I've only witnessed it happen with the HCC work units. I try to only do HCC work, although some of our machines have consistently produced invalid HCC units, e.g. an IA64 platform, and thus it does RICE units instead (successfully).

I have mislead you with my mention of 95%. That's the CPU usage ratio that I've got set. E.g. Use at most 95% CPU time. I did a fresh 6.2.14 Windows client install yesterday and that value was in the GUI client. I've found the work unit hangs to occur at stages between 37% and 93%, and possibly others. I've not observed any pattern there.

Loving BoincView, thank you.

Cheers,
Mark
[Aug 6, 2008 10:12:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

okay thanks, and to refer back to armstrdj's tech note, testing at 100% throttle would interest very much.

"Leave In Memory" (officially "Leave applications in memory while preempted") is by default set to "No". It helps clients to restart a science quicker, not having to fetch it from disk. I understood that the behaviour had changed between BOINC 5.4 and 5.8 and that benchmarking was one of the activities to always keep the science in memory to prevent checkpoint reverting.

Much appreciate the time you invest in researching this pain.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Aug 6, 2008 10:37:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

I can confirm that the work unit hangs still occur when the CPU throttle is set to 100%. It has even occurred when I suspended processing, ran the benchmark, and then resumed processing.

I have one particular machine that seems to be more suspectible than some of the others. It's running Redhat RHEL AS v3 update 4. It has two 2.00Ghz Intel Xeon processors. The tasks are HCC v603. It has an older version of the boinc manager software (5.10.21). I'd put a newer version on but I've had build problems in the past getting newer BOINC managers built on older Linux platforms.

I've had tasks hang at 43%, 94% and 97% on recent testing. When it happens only two of the three wcg_ processes are left running. I have to kill those two to cause the boinc manager to restart the task from the latest checkpoint.

If this is a BOINC manager software problem then I'll happily upgrade if someone can help me.

Cheers,
Mark
[Aug 15, 2008 1:21:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

Mark,

This has always been happening with HPF2 both under the old/discontinued UD agent and on BOINC, whatever version.

If you are not particularly "preferentially" attached to HPF2, suggest to create an addition device profile and deselect this project. Then, attach the devices that have shown the hangs to that new profile.

Let us know.

Rob
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Aug 15, 2008 1:40:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

I have had a number of jobs hang on my Linux box. I have found that stopping and restarting boinc nearly always gets the job to continue. I think I will make a cron job to do that every day and see if that fixes the problem.
[Aug 15, 2008 1:51:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

Somewhere on this forum there was someone doing that. Trouble is, the CPU use is normal and time continuing to count in the BOINCmgr and CPU use 100% and constant even when throttled, where the % progress is frozen.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Aug 15, 2008 2:14:57 PM]
[Aug 15, 2008 2:13:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

In most of the hangs I have had the CPU utilization dropped to 50% and the job just sat there. I did have one that sucked a full CPU and the time to completion didn't decline.
[Aug 15, 2008 9:20:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 20   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread