World Community Grid - View Thread - SMP Linux, Boinc, Rice suddenly has problems.

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: SMP Linux, Boinc, Rice suddenly has problems.

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 7

[ ]

Author

This topic has been viewed 774 times and has 6 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


SMP Linux, Boinc, Rice suddenly has problems.

My system:
dual Xeon (dual cores) on Centos 5.2, Boinc latest version.

Until now my tasks ran close to their allotted times, but a few days ago I must have received 3 bummers. all of them rice. They hardly move, piddling 25% in 3 days of work. Restarting Boinc and even rebooting the server didn't help. Those three block 3 of the 4 CPUs, the forth CPU seems to work normal, crunching through several tasks while the rices barely creep.

If I look at `top' output however, I see the jobs all scheduling fine, tjumping to the top when they can. The rices have 11:35+ minutes of CPU time (showing a total of 2:30 hours in Boinc) since last reboot. Here are the IDs, just in case they are of the "test" sort (http://www.worldcommunitygrid.org/forums/wcg/printpost?post=178880) that is supposed to improve accounting:

This is the corresponding `top' output, it doesn't even contain all CPU time because I've rebooted in between:

My kernel version: 2.6.18-92.1.6.el5PAE #1 SMP Wed Jun 25 14:21:46 EDT 2008 i686 i686 i386 GNU/Linux

I run VMware server on that machine. AFAIK the latest kernels come without fixed clock tick (which before caused problems with VMserver), so perhaps Boinc hasn't caught up with that? (Forums don't contain anything that way). I could go back to an older kernel just for comparison, but of course I also got work to do ...
BTW, Boinc CPU benchmark fails with "error". I hadn't tested this before so I don't know if this is connected the above or "normal". Turning on benchmark traces doesn't contain any significant info IMHO.

Questions:
- what debug output would help you?
- I would like to get rid of these jobs, should I just 'Abort' them (some XP user may have more luck with them)?

Please use my private email to trigger me, since I'm not checking this forum regularly.

PS: in order to give those tasks a chance (7 days left) I aborted them. The new jobs seem to behave vigorously. I'll keep an eye on their numbers.

----------------------------------------
[Edit 2 times, last edit by Former Member at Sep 17, 2008 6:36:23 PM]

[Sep 17, 2008 10:53:53 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Followup: SMP Linux, Boinc, Rice suddenly has problems.

When I aborted the 3 rices, the 4th CPU task also was just finishing, so I now have comparisons, since after that I received 1 hpf (V6.03) and again 3 rices (V6.17) almost at the same time, each of them close to 8 hours run time.
After approx 1 hour all of them had used (accordimg to top) the CPUs equally, between 3:30 and 3:55 CPU minutes. Time to completion is already however totally different: hpf indicates 1 hour (10%) done, while all rices only gained ~ 0.75%, and the remaining time hardly budged.
Methinks that something is badly wrong with this batch of rice tasks. They use CPU normally but don't really progress at all.

Looking at my result page, I find recently several tasks returned way low CPU times, other comparable jobs and the expected Boinc points differ by a factor of 5, but the point rewarded to me are in the "normal" (high) range. For example the last hpf (that finished together with the aborted rices) shows only 57 CPU minutes, but will probably yield around 60 Boinc points:
(workunitId=38801270, mine is the one with 0.57 CPU time).

Something is seriously wrong with how Boinc measures CPU time on my machine, which doesn't really matter as long as the tasks are worked correctly and in time. But in the case of recent rice tasks something seems to waste 80% of the used CPU time without producing any progress.

Please some of the maintainers of rice tasks have a look at this.

[Sep 17, 2008 8:38:38 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Followup: SMP Linux, Boinc, Rice suddenly has problems.

When did you start using the tickless kernel?

[Sep 17, 2008 9:42:31 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Followup: SMP Linux, Boinc, Rice suddenly has problems.

I am actually not even sure if this one is, I just remember discussions about tickless kernels in Centos forums w/rt the problems some VMserver users had (VMware suggests to set kernel parameters such as clock=pit, etc). I know that in the beginning of Cerntos5 a kernel with reduced tick frequency was available in some optional repository, because the new default of 1000 ticks/s was way too high for VMs.
The last time I updated the system including kernel was approx 4 weeks ago, I then found that it had switched from PAE back to normal kernel (but I have 8G memory), so I reconfigured grub to boot the otherwise same-version PAE. At the same time I upgraded the VMserver. I'm not sure exactly when the irregularities with the CPU times notation of Boinc began (usually I just let it run 24/7), but when soon after the update that rice batch didn't finish it caught my attention.

[Sep 17, 2008 10:38:55 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Followup: SMP Linux, Boinc, Rice suddenly has problems.

If you can find out the actual tickless status of your kernel, and any relevant settings, then I will be able to punt this one over to the techs to have a look at.

We did have one similar-ish report, but it wasn't specific enough to take further. So, anything that allows the WCG techs to reproduce this problem will be a tremendous help.

Thank you.

[Sep 17, 2008 10:44:58 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Followup: SMP Linux, Boinc, Rice suddenly has problems.

I am having this same problem.

I have dual AMD CPUs, running Debian 2.6 SMP kernel. BOINC v. 5.10.45 and WCG jobs:

Rice 6.17 and HPF2 v 5.20 with WUs respectively:

R00120_b649ed0a7915a533c4c504d55d06a31e_02_000_7 using rice version 617 (hangs at 0%)
lx873_00028_2 (hangs after 8.5%)

This is the 3rd time I've tried running the WCG projects and the jobs freeze. I have tried suspending them for a while then resuming them, but it does not work.

Can anyone offer any support? Is there a certain set of logs I can grab and hand over to the application developers or techs?

Thanks!

-JRinFlorida

[Sep 19, 2008 2:23:10 AM]

JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

10 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Followup: SMP Linux, Boinc, Rice suddenly has problems.

To make sure that the tasks are restarted (and go on if there is no other more serious problem) the most certain way is to stop Boinc and to restart it.

If they are still hanging at the same stage after restarting abort them and once they are reported as "Error" in your Results Status page click on the word "Error" and cut/paste the content of the error report(s) into your post.

Cheers. Jean.

----------------------------------------

Team--> Decrypthon -->Statistics/Join -->Thread

[Sep 19, 2008 3:27:01 AM]

[ ]