| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 9
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm having a strange problem with my farm. every now and then a WU stops processing or does not start after a task switch. tipical 7 out of 16 machines have this problem during a night.
----------------------------------------The machines are 1.6 Ghz Pentium M with 256Mb memory. All 16 machines boot ltsp via the network and have their work on a NFS drive. for example: wu: dddt0101a0216_ZINC05185479-0000_02_01 stops after 05:10:47 which is at 59.566 % it is on device ws214 http://www.worldcommunitygrid.org/ms/device/v...d=346669&deviceType=B Boinc manager indicates this task is running, but there is no progress, boinc view indicates CPU efficiency 0.000% The only way to re-activate this task is by putting it in suspend with the boinc manager. Then on the linux shell find the task: -- result of ps -ax -- 32040 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp 32041 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp 32042 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp 32043 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp 32044 root 294552 S wcg_dddt_autodock_5.10_i686-pc-linux-gnu -dp now send a kill -9 32040 then give a resume by the boinc manager for the task. And it will continue processing. This problem is not with SIMAP or SETI workunits, on the same machine. It occurs random. if ther are more project on the same machine it could be a WCG unit remains in "Waiting to run" while actualy it is still in memory, and there for can not be started again. Is this just a problem on my machines or is this a more general problem? There is a gap between the work i'm doing for the projects and the work actualy done, because some systems hang idle for 24 hours befor i notice there is a problem. On monday i found 9 out of 16 machines stalled due to this type of error. Can anyone help me solve this problem? edit: while typing the mail an other one went down: faah2534_ZINC01669772_xmd03400_01_0 ws203 after running the wu for 8:21:41 at 80.239 % [Edit 1 times, last edit by Former Member at Oct 23, 2007 2:03:27 PM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7850 Status: Offline Project Badges:
|
Pentium M, hmmmmm. Are these laptops ? Could you have heat problems if so ? Is it the same machines consistenly ? Do they quit at random times or is there a pattern ? Just some thoughts.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello tuxnl,
This is a new problem to me. I am not a Linux expert, but I expect that the experts will want more information. For example, how often do you switch between projects? Do all your computers run 3 projects? Can you post the Messages section for one of your computers up till when the problem occurred? What Linuc distro are you running on? Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
There are no heat problems cpu's run between 40 and 60 degrees Celcius, spec is above 100 degrees celcius. These systems are industrial pc's which should produce not to much heat. To check if the pc's are ok i stress them with boinc. The only way i'm sure processor is running full speed and there is a lot of memory access.
----------------------------------------pattern, no, some systems it takes 5 day's before the first problem, others at the first day. It is more or less random. But only on these machines at the moment. the Celeron running windows have no problems, and the AMD machines have also no problem. Project switches are normal on 1 hour bases. To be sure it is not the project switching thats wrong, i took the systems off SETI, SIMAP has no work so there is nothing coming from there at the moment. So its only WCG that is crunching. linux version is LTSP - 4.2, which is installed on a Suse 10.0 server. log shows nothing special, i'll check if i can get on remotely. top - 23:42:33 up 12 days, 9:06, 0 users, load average: 0.00, 0.00, 0.00 Tasks: 40 total, 2 running, 38 sleeping, 0 stopped, 0 zombie Cpu(s): 0.3% user, 0.0% system, 0.0% nice, 99.7% idle, 0.0% IO-wait Mem: 246988k total, 136972k used, 110016k free, 0k buffers Swap: 0k total, 0k used, 0k free, 45648k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 1428 476 420 S 0.0 0.2 0:00.71 init 2 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0 -bash-2.05b# uname -a Linux ws203 2.6.20.9-ltsp-1 #1 PREEMPT Wed May 9 06:08:26 EDT 2007 i686 unknown -bash-2.05b# ps ax | grep wcg 21505 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d 21506 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d 21507 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d 21508 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d 21509 root 294508 S wcg_faah_autodock_5.42_i686-pc-linux-gnu -d 22891 root 1560 S grep wcg from messages tab of boinc manager: Tue 23 Oct 2007 06:05:52 PM CEST|World Community Grid|Task faah2534_ZINC01669772_xmd03400_01_0 exited with zero status but no 'finished' file Tue 23 Oct 2007 06:05:52 PM CEST|World Community Grid|If this happens repeatedly you may need to reset the project. Tue 23 Oct 2007 06:06:01 PM CEST|World Community Grid|Restarting task faah2534_ZINC01669772_xmd03400_01_0 using faah version 542 [Edit 1 times, last edit by Former Member at Oct 23, 2007 9:46:58 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I didn't check my cluster for 1 day. The result is :
6 systems have al lockup due to being "preempted". 5 systems look like they are running but have an afficientcy of 0.000% So out of 16 systems that could be processing WCG, actualy 5 are producing output! Why is this only with WCG work units??? |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
tuxnl, we need the message log to be able to diagnose.
----------------------------------------How does this "preempted" manifest itself? If a job is not finished and another is started e.g. due project switching it would show that. Reads like not only WCG is on those clients, thus whatever switching time, 60 minutes is default, it would do that. Recommended is to run switching at e.g. 4-5 hours, so lots of checkpoints get saved. BOINC will ensure that each project gets it weighted share of time. The efficiency 0.000% is presumably what you see in the Tasks window Progress column. Are those tasks counting up CPU time and using any. The sciences run at 'nice' priority and understand in Linux some setting has to be changed to get it to go at really using the full idle cycles.... cant remember momentarily what that setting was. BUT, the Why only WCG is puzzling. The one message you posted earlier related to "zero status but no finished file" is covered in an FAQ. One cause is the system/OS automatic time synchronisation. BOINC does not like times adjusted while running, particularly backward, even a second. I've switched it off and manually adjust time if it gets too much off from internet time.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
he Secerob also here.
The preempted, could be SIMAP, they dumpt some work on these systems, i'm closing the work fetch for SIMAP now. Task switch is 60min. write to disk is 60 min, to reduce network stress and make sure this could not be the problem. CPU efficientcy 0.000% is from boincview. After my restart it goes back up to 100.0 or 99.96. i'll look in to the time sync, don't mind what the time is on this cluster. Which log do you want, in the message tab of the boinc manager there is no logging ahich shows this problem. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
he Sekerob also here. The preempted, could be SIMAP, they dumpt some work on these systems, i'm closing the work fetch for SIMAP now. That explains it Task switch is 60min. write to disk is 60 min, to reduce network stress and make sure this could not be the problem. If at all, use switch times of 4 hours plus on multi-project machines.... it's really more efficient in several ways. The Disk save is 60 seconds, not minutes. I use 600 seconds, again for increasing efficiency CPU efficiency 0.000% is from boincview. After my restart it goes back up to 100.0 or 99.96.Yellow or Red lines then, BV is my one stop shop to watch the lot ;D i'll look in to the time sync, don't mind what the time is on this cluster. Let us know it if makes them messages disappear Which log do you want, in the message tab of the boinc manager there is no logging ahich shows this problem.Those showing in the message tab and ideally 5-10 from before to 5-10 after the 'suspicious' items or times hanging.... still the question... is it HPF2 baulking or other WCG projects too? Just reading about Ubuntu Gutsy (7.10?) and the much better font rendering in the BOINC manager. Going to get one of those LiveCD's burned and see how that takes liking to me. After Vista exposure now, the absence of WoW is deafening.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 2 times, last edit by Sekerob at Oct 25, 2007 10:16:43 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
disk save is at 3600 seconds aka 60 min, not possible from the WCG site, but simap site was helpfull.
Yes it are yellow lines, and dark green ones. dddt and faah wu's have the same problems as the HPF2. My post of okt 23 9:38:24 PM was a faah unit that stopt processing. The strange thing is that after a restart the workunit is completed and send in without failure notice. |
||
|
|
|