Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 20
Posts: 20   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3626 times and has 19 replies Next Thread
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
work unit hangs

Hi,

I run a number of different Linux distributions on multi processor machines. My problem is that I notice on a weekly basis that individual work unit processes will hang. I've just gone through the machines and found three hanging work units on three different machines. I'd like to get rid of this manual process.

Here's an example on one machine. I look to see if any work units are over 24 hours old, since that machine processes units in about 3-4 hours.

$ ps -ef|grep wcg_|grep `date '+%b'`
service 4134 780 4 Jul26 pts/0 03:06:43 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049261055200505251647.jp2
service 4135 4134 0 Jul26 pts/0 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049261055200505251647.jp2

I then kill the hung processes and it restarts and continues successfully. e.g. kill 4134. There's very little to go on in the log file.

$ grep X0000049261055200505251647 ../log.txt
2008-07-26 09:08:38 [World Community Grid] [file_xfer] Started download of file X0000049261055200505251647_X0000049261055200505251647.jp2
2008-07-26 09:08:40 [World Community Grid] [file_xfer] Finished download of file X0000049261055200505251647_X0000049261055200505251647.jp2
2008-07-26 14:54:37 [World Community Grid] Starting X0000049261055200505251647_1
2008-07-26 14:54:37 [World Community Grid] Starting task X0000049261055200505251647_1 using hcc1 version 603
2008-07-29 09:22:06 [World Community Grid] Task X0000049261055200505251647_1 exited with zero status but no 'finished' file
2008-07-29 09:22:06 [World Community Grid] Restarting task X0000049261055200505251647_1 using hcc1 version 603


The 09:22 entry is the result of my sending the kill signal.


I'm not running the latest client because I had build issues. This particular machine is running Debian Etch, but others are running old Linux distros.


Here's the top of the log file

2008-06-25 14:11:25 [---] Starting BOINC client version 5.10.21 for i686-pc-linux-gnu
2008-06-25 14:11:25 [---] log flags: task, file_xfer, sched_ops
2008-06-25 14:11:25 [---] Libraries: libcurl/7.16.0 OpenSSL/0.9.8d zlib/1.2.3
2008-06-25 14:11:25 [---] Data directory: /home/service/boinc/BOINC
2008-06-25 14:11:25 [---] Processor: 8 GenuineIntel Intel(R) Xeon(R) CPU X5355 @ 2.66GHz [Family 6 Model 15 Stepping 7]
2008-06-25 14:11:25 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
2008-06-25 14:11:25 [---] OS: Linux: 2.6.18-6-amd64
2008-06-25 14:11:25 [---] Memory: 13.70 GB physical, 5.00 GB virtual
2008-06-25 14:11:25 [---] Disk: 19.69 GB total, 9.06 GB free
2008-06-25 14:11:25 [---] Local time is UTC +0 hours
2008-06-25 14:11:25 [---] Already attached to http://www.worldcommunitygrid.org/



When I next have one of these processes should I gather any particular evidence and paste it here? (for instance a debug stack or an strace?)

Overall I'm returning about 200 units per day. So I'm not saying this is a common problem but I'd like to resolve it, since when it happens one of the processors will sit idle until I kill the hung work unit.

Cheers,
Mark
[Jul 29, 2008 8:51:32 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs


Here's an example on one machine. I look to see if any work units are over 24 hours old, since that machine processes units in about 3-4 hours.

$ ps -ef|grep wcg_|grep `date '+%b'`
service 4134 780 4 Jul26 pts/0 03:06:43 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049261055200505251647.jp2
service 4135 4134 0 Jul26 pts/0 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000049261055200505251647.jp2


Hi Mark,

<get out clause>
It is a long time since I used Debian or Linux so my reading of the ps output may be totally wrong!
</get out clause>

Looking at the output, I am not sure that it is a hung process. It shows 03:06:43 of accumulated run time for process 4134. If you state that a WU normally takes 3-4 hours, then this could still be running.

Could yo not use boincmgr instead to see if something still running. Remember that BOINC will automatically manage the buffered work order through it's scheduler function so all can meet the individual assigned Task deadline. If you have been sent a rush job, it will preempt all other running jobs and cause them to pause.

Also if you have real users or applications on your box, it could be that there is something else that is using up the core time and WCG is not getting a look in.
[Jul 29, 2008 9:53:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

Hi Stuart,

Thanks for your reply. The ps output shows that those jobs were started on Jul 26th and I ran the ps this morning, Jul 29th, and so they've not finished in 3 days, instead of the usual 3 hours.

Also I didn't mention but another sure sign I use to establish whether all boinc processes are working is uptime's load average and top. Both showed that only 7 wcg processes were actively doing anything whereas there should be 8. (2 quad core processors).

I will check how boincmgr represents those processes when one next occurs.

This particular system does almost nothing else but boinc. No users. No apps.

Cheers,
Mark
[Jul 29, 2008 12:20:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

Hi,

I have a workunit that is hung in this manner right now. It's reported by boincmgr as 94.801% complete (10 mins to run). The other 7 have moved on in the last hour, this one hasn't.

$ cd /home/service/boinc/BOINC/slots/1
$ ls -lrt
total 84
-rw-r--r-- 1 service service 123 2008-07-29 07:28 X0000048921338200505121737.jp2
-rw-r--r-- 1 service service 101 2008-07-29 07:28 wcg_hcc1_img_6.03_i686-pc-linux-gnu
-rw-r--r-- 1 service service 87 2008-07-29 07:28 UHN_stacked2.tga
-rw-r--r-- 1 service service 87 2008-07-29 07:28 pbibmablk.tga
-rw-r--r-- 1 service service 3958 2008-07-29 07:28 init_data.xml
-rw-r--r-- 1 service service 96 2008-07-29 07:28 img_out.h5
-rw-r--r-- 1 service service 87 2008-07-29 07:28 HCC_LOGO.tga
-rw-r--r-- 1 service service 110 2008-07-29 07:28 graphics_app
-rw-r--r-- 1 service service 87 2008-07-29 07:28 boinc_wcg_skin_w-logo.tga
-rw-rw-rw- 1 service service 2481376 2008-07-29 07:28 boinc_hcc1_1
-rw-r--r-- 1 service service 14 2008-07-29 11:20 wcg_hcc.state
-rw-r--r-- 1 service service 126 2008-07-29 11:20 wcg_checkpoint.dat
-rw-r--r-- 1 service service 14 2008-07-29 11:20 wcg_checkpoint_03.ckp
-rw-r--r-- 1 service service 3900 2008-07-29 11:20 wcg_checkpoint_02.ckp
-rw-r--r-- 1 service service 3900 2008-07-29 11:20 wcg_checkpoint_01.ckp
-rw-r--r-- 1 service service 3900 2008-07-29 11:20 wcg_checkpoint_00.ckp
-rw-r--r-- 1 service service 1406 2008-07-29 11:20 stderr.txt
-rw-r--r-- 1 service service 3900 2008-07-29 11:20 cp2.raw
-rw-r--r-- 1 service service 3900 2008-07-29 11:20 cp1.raw
-rw-r--r-- 1 service service 3900 2008-07-29 11:20 cp0.raw


You can see that none of the checkpoint and status files have updated since 11:20 yesterday. I see that all the other units have updated status files within the last 15 minutes.

There's nothing obvious to me in the boinc messages at around that time, except perhaps that it downloaded a new unit and did a benchmark:-

2008-07-29 11:17:07 [World Community Grid] Computation for task X0000048921188200505121739_0 finished
2008-07-29 11:17:07 [World Community Grid] Starting X0000048920826200505050855_1
2008-07-29 11:17:07 [World Community Grid] Starting task X0000048920826200505050855_1 using hcc1 version 603
2008-07-29 11:17:09 [World Community Grid] [file_xfer] Started upload of file X0000048921188200505121739_0_0
2008-07-29 11:17:16 [World Community Grid] [file_xfer] Finished upload of file X0000048921188200505121739_0_0
2008-07-29 11:17:16 [World Community Grid] [file_xfer] Throughput 35352 bytes/sec
2008-07-29 11:24:39 [---] Running CPU benchmarks
2008-07-29 11:24:39 [---] Suspending computation - running CPU benchmarks
2008-07-29 11:25:11 [---] Benchmark results:
2008-07-29 11:25:11 [---] Number of CPUs: 8
2008-07-29 11:25:11 [---] 2015 floating point MIPS (Whetstone) per CPU
2008-07-29 11:25:11 [---] 4468 integer MIPS (Dhrystone) per CPU
2008-07-29 11:25:12 [---] Resuming computation
2008-07-29 11:40:41 [World Community Grid] Computation for task X0000048921506200505121734_0 finished
2008-07-29 11:40:41 [World Community Grid] Starting X0000048920926200505050853_0
2008-07-29 11:40:41 [World Community Grid] Starting task X0000048920926200505050853_0 using hcc1 version 603
2008-07-29 11:40:43 [World Community Grid] [file_xfer] Started upload of file X0000048921506200505121734_0_0
2008-07-29 11:40:50 [World Community Grid] [file_xfer] Finished upload of file X0000048921506200505121734_0_0
2008-07-29 11:40:50 [World Community Grid] [file_xfer] Throughput 18236 bytes/sec


Here's the hung work unit processes :-

$ ps -ef|grep wcg_ | grep Jul
service 30861 1294 11 Jul29 pts/0 03:16:21 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000048921338200505121737.jp2
service 30862 30861 0 Jul29 pts/0 00:00:00 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000048921338200505121737.jp2


There's no symbols in the .exe but here's the backtraces from gdb

$ gdb -p 30861
GNU gdb 6.4.90-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Attaching to process 30861
Reading symbols from /home/service/boinc/BOINC/projects/www.worldcommunitygrid.org/wcg_hcc1_img_6.03_i686-pc-linux-gnu...(no debugging symbols found)...done.
Using host libthread_db library "/lib/libthread_db.so.1".

warning: shared library handler failed to enable breakpoint

warning: Lowest section in system-supplied DSO at 0xffffe000 is .hash at ffffe0b4
(no debugging symbols found)
0x0823709b in ?? ()
(gdb) bt
#0 0x0823709b in ?? ()
#1 0xffe39120 in ?? ()
#2 0xffe391a8 in ?? ()
#3 0x082367e9 in ?? ()
#4 0xffe39120 in ?? ()
#5 0x00000020 in ?? ()
#6 0xffe39120 in ?? ()
#7 0xffe38f1c in ?? ()
#8 0x00002000 in ?? ()
#9 0x00000000 in ?? ()
(gdb) quit
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from program: /home/service/boinc/BOINC/projects/www.worldcommunitygrid.org/wcg_hcc1_img_6.03_i686-pc-linux-gnu, process 30861



service@kuckoo4:~/boinc/BOINC/slots/1$ gdb -p 30862
GNU gdb 6.4.90-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Attaching to process 30862

warning: process 30862 is a cloned process
Reading symbols from /home/service/boinc/BOINC/projects/www.worldcommunitygrid.org/wcg_hcc1_img_6.03_i686-pc-linux-gnu...(no debugging symbols found)...done.
Using host libthread_db library "/lib/libthread_db.so.1".

warning: shared library handler failed to enable breakpoint

warning: Lowest section in system-supplied DSO at 0xffffe000 is .hash at ffffe0b4
(no debugging symbols found)
0x0826eb0d in ?? ()
(gdb) bt
#0 0x0826eb0d in ?? ()
#1 0x08235554 in ?? ()
#2 0x083a1840 in ?? ()
#3 0x00000001 in ?? ()
#4 0x000007d0 in ?? ()
#5 0x0000788c in ?? ()
#6 0x00000000 in ?? ()
(gdb) quit


An strace of the processes doesn't show any activity with the main worker process, and a repeated poll call with the child.


$ strace -p 30861
Process 30861 attached - interrupt to quit
Process 30861 detached
$ strace -p 30862
Process 30862 attached - interrupt to quit
[ Process PID=30862 runs in 32 bit mode. ]
getppid() = 30861
poll([{fd=6, events=POLLIN}], 1, 2000) = 0
getppid() = 30861
poll([{fd=6, events=POLLIN}], 1, 2000) = 0
getppid() = 30861
poll( <unfinished ...>
Process 30862 detached


fd 6 is a pipe linking the two processes.



The stderr.txt contains the following

$ cat stderr.txt
In ExtractGlcmFeatures: End of 0 iteration of outer loop.
In ExtractGlcmFeatures: End of 1 iteration of outer loop.
In ExtractGlcmFeatures: End of 2 iteration of outer loop.
In ExtractGlcmFeatures: End of 3 iteration of outer loop.
In ExtractGlcmFeatures: End of 4 iteration of outer loop.
In ExtractGlcmFeatures: End of 5 iteration of outer loop.
In ExtractGlcmFeatures: End of 6 iteration of outer loop.
In ExtractGlcmFeatures: End of 7 iteration of outer loop.
In ExtractGlcmFeatures: End of 8 iteration of outer loop.
In ExtractGlcmFeatures: End of 9 iteration of outer loop.
In ExtractGlcmFeatures: End of 10 iteration of outer loop.
In ExtractGlcmFeatures: End of 11 iteration of outer loop.
In ExtractGlcmFeatures: End of 12 iteration of outer loop.
In ExtractGlcmFeatures: End of 13 iteration of outer loop.
In ExtractGlcmFeatures: End of 14 iteration of outer loop.
In ExtractGlcmFeatures: End of 15 iteration of outer loop.
In ExtractGlcmFeatures: End of 16 iteration of outer loop.
In ExtractGlcmFeatures: End of 17 iteration of outer loop.
In ExtractGlcmFeatures: End of 18 iteration of outer loop.
In ExtractGlcmFeatures: End of 19 iteration of outer loop.
In ExtractGlcmFeatures: End of 20 iteration of outer loop.
In ExtractGlcmFeatures: End of 21 iteration of outer loop.
In ExtractGlcmFeatures: End of 22 iteration of outer loop.
In ExtractGlcmFeatures: End of 23 iteration of outer loop.


Which seems very similar to all of the working processes as well, except most of them haven't reached iteration 23.


Can anyone help please? I'll keep this process around for any further diagnostics.

Cheers,
Mark
[Jul 30, 2008 12:22:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

Mark,

Do you run all the WCG projects or just HCC? Have you seen this occur with any of the other WCG projects on this machine? Do you use the BOINC CPU throttle? It is ok to try and manually suspend and resume that task to get it going again. When this happens again look to see if the task was suspended and resumed shortly before it got hung, as it was this time to run the cpu benchmarks.

Thanks,
armstrdj
[Jul 30, 2008 2:29:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

Hi armstrdj,

Mostly I only run HCC. I have run the other projects and not noticed hangs with them. The only except is an Itanium IA64 machine that produces invalid results for everything except the RICE project. My goal is to contribute to HCC though.

I use the 95% CPU throttle (default config?)

The task is still hung at 94.801%. It's status in boincmgr was "Running" so I clicked "Suspend" on it. It went to "Task suspended", and immediately one of the queued tasks started running. So I suspended all of the queued tasks and clicked "Resume" on my hung one. It's now showing as "Waiting to run". It's been like that for 20 minutes.

In my earlier posts I've included any mention of that work unit in the log file and the log messages around the time that the work units status files were last updated. That does show "2008-07-29 11:24:39 [---] Suspending computation - running CPU benchmarks". The work unit was last updated at 11:20. However it doesn't resume, unlike the 7 other work units that were running at that time.

How do I get in contact with the authors of the HCC work unit code?

Cheers,
Mark
[Jul 31, 2008 8:24:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

The way to do this, mostly a proven method for HPF2 is:

1. Suspend Network Connection (In case a client is not only attached to WCG)
2. Suspend in the Projects tab all projects, the last to suspend being WCG
3. Wait a minute
4. Reverse the 3, 2, 1 steps

That usually causes the hung job to return to last checkpoint and continue.

I always check if in the Process Manager (TaskManager in Windows), if CPU time is consumed.

Switch the 95% Throttle to 100%. This 95% is actually meaningless as it will cause a job to run 95/100 of full seconds, thus about probably run 19 seconds and pause 1 second. It's been reported in past that the BOINC throttle caused for jobs to hang or break.

It's of much interest that the non-resuming happens right when the benchmark occurred. The techs sometimes speak of race conditions. The throttle could be the culprit, in that it may happen to coincide that the Throttle break kicks in right when a benchmark is started in the same second.

WCG is the intermediate to communicate bugs to the scientists. The forum is the best place, but if you wish to go off-line use "Contact Us" link on e.g. WCG home page.

PS: v.v Throttle, For windows I use, thus obviously recommend, the 3rd party freeware ThreadMaster plus the company ThreadMaster GUI (See Start Here FAQ). Latter will install the former and provides very smooth throttling on all windows starting with W2K all the way through W2K8 and even Vista. The BOINC implementation is crude and hopefully one day a refined working one is offered.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 2 times, last edit by Sekerob at Jul 31, 2008 8:56:58 AM]
[Jul 31, 2008 8:45:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

Hi Sekerob,

The machine that I've got this hung work unit on at the moment is not one that I can sever/suspend the network connection to. However I followed steps 2,3 & 4 of your plan. No success. I waited for some units to complete and the hung unit remained in state "Waiting to run" despite having idle cpu. The hung process isn't consuming any significant cpu time since it last updates it's state files a couple of days ago.

When I see this happen again I'll be sure to check the logs to find out if it coincides with the benchmark runs.

I'm not completely sure I like the idea of no throttling at all. I like the idea that, small though it is, the cpu gets brief pauses which perhaps give it a slight respite from the heat generated from 100% cpu load. Given that the default setting is 95% I feel more comfortable with that.

Cheers,
Mark
[Jul 31, 2008 11:55:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: work unit hangs

Hi Mark,

Default 95%? I'm unaware of any WCG preset profile to use that value and just checked on a clean test ID. The Maximum Output profile is the only one using 100%,

Use no more than: 100 % of processor time

where the others use substantially lower percentages (60%). See the FAQ what the presets are:

http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=17033

It was a test suggestion to see if with HCC this hanging would re-occur.

ciao

PS, honestly didn't know how client responds to 95%, followed logic. Just tested that value and BOINC rounds it down effectively and runs 9 seconds, pausing 1. Practically thus 90%.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 31, 2008 12:10:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
marky1124
Cruncher
Joined: Jan 10, 2005
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work unit hangs

Hi Sekerob,

My apologies if I'm wrong about the 95%. I may have picked that up from the older united devices profiles or something. I've no idea, it's lost in the mists of time for me.

I will try setting one of the machines to 100% and try and figure out if that prevents it occuring. I see the issue most weeks but across a farm of about 40 machines. So I'm not sure how long I'll need to run for before I can conclude that setting to 100% fixed it.

As I was writing this I noticed one of the other machines (2 proc) had a hung process. I tried all the same suspend/resume ideas. No joy. I checked the log and it's hang coincided with a suspend/benchmark. It was 87% complete when it hung. I've now set that machine from 95% to 100%. I'll keep a track of things and see if your suggestion is correct.

Cheers,
Mark
[Jul 31, 2008 2:51:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 20   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread