Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 9
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1117 times and has 8 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
WU's that lockup it self, alle projects on lite clients

I'm having a strange problem with my farm. every now and then a WU stops processing or does not start after a task switch. tipical 7 out of 16 machines have this problem during a night.

The machines are 1.6 Ghz Pentium M with 256Mb memory. All 16 machines boot ltsp via the network and have their work on a NFS drive.

for example:
wu: dddt0101a0216_ZINC05185479-0000_02_01
stops after 05:10:47 which is at 59.566 %
it is on device ws214 http://www.worldcommunitygrid.org/ms/device/v...d=346669&deviceType=B


Boinc manager indicates this task is running, but there is no progress, boinc view indicates CPU efficiency 0.000%

The only way to re-activate this task is by putting it in suspend with the boinc manager. Then on the linux shell find the task:
-- result of ps -ax --

32040 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp
32041 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp
32042 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp
32043 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp
32044 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp


now send a kill -9 32040
then give a resume by the boinc manager for the task. And it will continue processing.

This problem is not with SIMAP or SETI workunits, on the same machine. It occurs random. if ther are more project on the same machine it could be a WCG unit remains in "Waiting to run" while actualy it is still in memory, and there for can not be started again.

Is this just a problem on my machines or is this a more general problem? There is a gap between the work i'm doing for the projects and the work actualy done, because some systems hang idle for 24 hours befor i notice there is a problem. On monday i found 9 out of 16 machines stalled due to this type of error.

Can anyone help me solve this problem?

edit: while typing the mail an other one went down: faah2534_ZINC01669772_xmd03400_01_0
ws203 after running the wu for 8:21:41 at 80.239 %
----------------------------------------
[Edit 1 times, last edit by Former Member at Oct 23, 2007 2:03:27 PM]
[Oct 23, 2007 1:56:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7850
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU's that lockup it self, alle projects on lite clients

Pentium M, hmmmmm. Are these laptops ? Could you have heat problems if so ? Is it the same machines consistenly ? Do they quit at random times or is there a pattern ? Just some thoughts.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Oct 23, 2007 9:06:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU's that lockup it self, alle projects on lite clients

Hello tuxnl,
This is a new problem to me. I am not a Linux expert, but I expect that the experts will want more information.
For example, how often do you switch between projects?
Do all your computers run 3 projects?
Can you post the Messages section for one of your computers up till when the problem occurred?
What Linuc distro are you running on?

Lawrence
[Oct 23, 2007 9:13:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU's that lockup it self, alle projects on lite clients

There are no heat problems cpu's run between 40 and 60 degrees Celcius, spec is above 100 degrees celcius. These systems are industrial pc's which should produce not to much heat. To check if the pc's are ok i stress them with boinc. The only way i'm sure processor is running full speed and there is a lot of memory access.

pattern, no, some systems it takes 5 day's before the first problem, others at the first day. It is more or less random. But only on these machines at the moment. the Celeron running windows have no problems, and the AMD machines have also no problem.

Project switches are normal on 1 hour bases. To be sure it is not the project switching thats wrong, i took the systems off SETI, SIMAP has no work so there is nothing coming from there at the moment. So its only WCG that is crunching.

linux version is LTSP - 4.2, which is installed on a Suse 10.0 server.

log shows nothing special, i'll check if i can get on remotely.

top - 23:42:33 up 12 days, 9:06, 0 users, load average: 0.00, 0.00, 0.00
Tasks: 40 total, 2 running, 38 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3% user, 0.0% system, 0.0% nice, 99.7% idle, 0.0% IO-wait
Mem: 246988k total, 136972k used, 110016k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 45648k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 15 0 1428 476 420 S 0.0 0.2 0:00.71 init
2 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0

-bash-2.05b# uname -a
Linux ws203 2.6.20.9-ltsp-1 #1 PREEMPT Wed May 9 06:08:26 EDT 2007 i686 unknown

-bash-2.05b# ps ax | grep wcg
21505 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d
21506 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d
21507 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d
21508 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d
21509 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d
22891 root 1560 S grep wcg


from messages tab of boinc manager:
Tue 23 Oct 2007 06:05:52 PM CEST|World Community Grid|Task faah2534_ZINC01669772_xmd03400_01_0 exited with zero status but no 'finished' file
Tue 23 Oct 2007 06:05:52 PM CEST|World Community Grid|If this happens repeatedly you may need to reset the project.
Tue 23 Oct 2007 06:06:01 PM CEST|World Community Grid|Restarting task faah2534_ZINC01669772_xmd03400_01_0 using faah version 542
----------------------------------------
[Edit 1 times, last edit by Former Member at Oct 23, 2007 9:46:58 PM]
[Oct 23, 2007 9:38:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU's that lockup it self, alle projects on lite clients

I didn't check my cluster for 1 day. The result is :
6 systems have al lockup due to being "preempted".
5 systems look like they are running but have an afficientcy of 0.000%


So out of 16 systems that could be processing WCG, actualy 5 are producing output!

Why is this only with WCG work units???
[Oct 25, 2007 7:21:28 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU's that lockup it self, alle projects on lite clients

tuxnl, we need the message log to be able to diagnose.

How does this "preempted" manifest itself? If a job is not finished and another is started e.g. due project switching it would show that. Reads like not only WCG is on those clients, thus whatever switching time, 60 minutes is default, it would do that. Recommended is to run switching at e.g. 4-5 hours, so lots of checkpoints get saved. BOINC will ensure that each project gets it weighted share of time.

The efficiency 0.000% is presumably what you see in the Tasks window Progress column. Are those tasks counting up CPU time and using any. The sciences run at 'nice' priority and understand in Linux some setting has to be changed to get it to go at really using the full idle cycles.... cant remember momentarily what that setting was.

BUT, the Why only WCG is puzzling.

The one message you posted earlier related to "zero status but no finished file" is covered in an FAQ. One cause is the system/OS automatic time synchronisation. BOINC does not like times adjusted while running, particularly backward, even a second. I've switched it off and manually adjust time if it gets too much off from internet time.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Oct 25, 2007 7:54:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU's that lockup it self, alle projects on lite clients

he Secerob also here.

The preempted, could be SIMAP, they dumpt some work on these systems, i'm closing the work fetch for SIMAP now.

Task switch is 60min. write to disk is 60 min, to reduce network stress and make sure this could not be the problem.

CPU efficientcy 0.000% is from boincview. After my restart it goes back up to 100.0 or 99.96.

i'll look in to the time sync, don't mind what the time is on this cluster.

Which log do you want, in the message tab of the boinc manager there is no logging ahich shows this problem.
[Oct 25, 2007 9:05:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU's that lockup it self, alle projects on lite clients

he Sekerob also here.

The preempted, could be SIMAP, they dumpt some work on these systems, i'm closing the work fetch for SIMAP now. That explains it

Task switch is 60min. write to disk is 60 min, to reduce network stress and make sure this could not be the problem. If at all, use switch times of 4 hours plus on multi-project machines.... it's really more efficient in several ways. The Disk save is 60 seconds, not minutes. I use 600 seconds, again for increasing efficiency

CPU efficiency 0.000% is from boincview. After my restart it goes back up to 100.0 or 99.96.Yellow or Red lines then, BV is my one stop shop to watch the lot ;D

i'll look in to the time sync, don't mind what the time is on this cluster. Let us know it if makes them messages disappear

Which log do you want, in the message tab of the boinc manager there is no logging ahich shows this problem.Those showing in the message tab and ideally 5-10 from before to 5-10 after the 'suspicious' items or times hanging.... still the question... is it HPF2 baulking or other WCG projects too?

Just reading about Ubuntu Gutsy (7.10?) and the much better font rendering in the BOINC manager. Going to get one of those LiveCD's burned and see how that takes liking to me. After Vista exposure now, the absence of WoW is deafening.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 2 times, last edit by Sekerob at Oct 25, 2007 10:16:43 AM]
[Oct 25, 2007 10:10:44 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU's that lockup it self, alle projects on lite clients

disk save is at 3600 seconds aka 60 min, not possible from the WCG site, but simap site was helpfull.

Yes it are yellow lines, and dark green ones.

dddt and faah wu's have the same problems as the HPF2.

My post of okt 23 9:38:24 PM was a faah unit that stopt processing.

The strange thing is that after a restart the workunit is completed and send in without failure notice.
[Oct 25, 2007 11:13:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread