| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 7
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My system:
----------------------------------------dual Xeon (dual cores) on Centos 5.2, Boinc latest version. Until now my tasks ran close to their allotted times, but a few days ago I must have received 3 bummers. all of them rice. They hardly move, piddling 25% in 3 days of work. Restarting Boinc and even rebooting the server didn't help. Those three block 3 of the 4 CPUs, the forth CPU seems to work normal, crunching through several tasks while the rices barely creep. If I look at `top' output however, I see the jobs all scheduling fine, tjumping to the top when they can. The rices have 11:35+ minutes of CPU time (showing a total of 2:30 hours in Boinc) since last reboot. Here are the IDs, just in case they are of the "test" sort (http://www.worldcommunitygrid.org/forums/wcg/printpost?post=178880) that is supposed to improve accounting: 17-Sep-2008 06:58:13 [World Community Grid] Restarting task R00121_1f4b1bf2200613a9be9bb5e18a7fe7cf_00_18 using rice version 617 17-Sep-2008 06:58:13 [World Community Grid] Restarting task R00123_10f7f828418ea2579d76891c9c14fa4a_03_5 using rice version 617 17-Sep-2008 06:58:14 [World Community Grid] Restarting task R00119_1f9679589374c8a8d59d44d076674fd3_01_004_12 using rice version 617 This is the corresponding `top' output, it doesn't even contain all CPU time because I've rebooted in between: 3606 gh 35 19 85536 62m 4508 R 3 0.6 14:45.07 wcg_hpf2_rosett 3856 gh 35 19 14732 10m 4488 R 3 0.1 11:35.58 wcg_rice_6.17_i 3853 gh 35 19 12044 9644 2828 R 3 0.1 11:35.22 wcg_rice_6.17_i 3859 gh 35 19 12996 9.9m 3808 R 3 0.1 11:42.74 wcg_rice_6.17_i My kernel version: 2.6.18-92.1.6.el5PAE #1 SMP Wed Jun 25 14:21:46 EDT 2008 i686 i686 i386 GNU/Linux I run VMware server on that machine. AFAIK the latest kernels come without fixed clock tick (which before caused problems with VMserver), so perhaps Boinc hasn't caught up with that? (Forums don't contain anything that way). I could go back to an older kernel just for comparison, but of course I also got work to do ... BTW, Boinc CPU benchmark fails with "error". I hadn't tested this before so I don't know if this is connected the above or "normal". Turning on benchmark traces doesn't contain any significant info IMHO. Questions: - what debug output would help you? - I would like to get rid of these jobs, should I just 'Abort' them (some XP user may have more luck with them)? Please use my private email to trigger me, since I'm not checking this forum regularly. PS: in order to give those tasks a chance (7 days left) I aborted them. The new jobs seem to behave vigorously. I'll keep an eye on their numbers. [Edit 2 times, last edit by Former Member at Sep 17, 2008 6:36:23 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
When I aborted the 3 rices, the 4th CPU task also was just finishing, so I now have comparisons, since after that I received 1 hpf (V6.03) and again 3 rices (V6.17) almost at the same time, each of them close to 8 hours run time.
After approx 1 hour all of them had used (accordimg to top) the CPUs equally, between 3:30 and 3:55 CPU minutes. Time to completion is already however totally different: hpf indicates 1 hour (10%) done, while all rices only gained ~ 0.75%, and the remaining time hardly budged. Methinks that something is badly wrong with this batch of rice tasks. They use CPU normally but don't really progress at all. Looking at my result page, I find recently several tasks returned way low CPU times, other comparable jobs and the expected Boinc points differ by a factor of 5, but the point rewarded to me are in the "normal" (high) range. For example the last hpf (that finished together with the aborted rices) shows only 57 CPU minutes, but will probably yield around 60 Boinc points: (workunitId=38801270, mine is the one with 0.57 CPU time). Something is seriously wrong with how Boinc measures CPU time on my machine, which doesn't really matter as long as the tasks are worked correctly and in time. But in the case of recent rice tasks something seems to waste 80% of the used CPU time without producing any progress. Please some of the maintainers of rice tasks have a look at this. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
When did you start using the tickless kernel?
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I am actually not even sure if this one is, I just remember discussions about tickless kernels in Centos forums w/rt the problems some VMserver users had (VMware suggests to set kernel parameters such as clock=pit, etc). I know that in the beginning of Cerntos5 a kernel with reduced tick frequency was available in some optional repository, because the new default of 1000 ticks/s was way too high for VMs.
The last time I updated the system including kernel was approx 4 weeks ago, I then found that it had switched from PAE back to normal kernel (but I have 8G memory), so I reconfigured grub to boot the otherwise same-version PAE. At the same time I upgraded the VMserver. I'm not sure exactly when the irregularities with the CPU times notation of Boinc began (usually I just let it run 24/7), but when soon after the update that rice batch didn't finish it caught my attention. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If you can find out the actual tickless status of your kernel, and any relevant settings, then I will be able to punt this one over to the techs to have a look at.
We did have one similar-ish report, but it wasn't specific enough to take further. So, anything that allows the WCG techs to reproduce this problem will be a tremendous help. Thank you. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I am having this same problem.
I have dual AMD CPUs, running Debian 2.6 SMP kernel. BOINC v. 5.10.45 and WCG jobs: Rice 6.17 and HPF2 v 5.20 with WUs respectively: R00120_b649ed0a7915a533c4c504d55d06a31e_02_000_7 using rice version 617 (hangs at 0%) lx873_00028_2 (hangs after 8.5%) This is the 3rd time I've tried running the WCG projects and the jobs freeze. I have tried suspending them for a while then resuming them, but it does not work. Can anyone offer any support? Is there a certain set of logs I can grab and hand over to the application developers or techs? Thanks! -JRinFlorida |
||
|
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3716 Status: Offline Project Badges:
|
To make sure that the tasks are restarted (and go on if there is no other more serious problem) the most certain way is to stop Boinc and to restart it.
----------------------------------------If they are still hanging at the same stage after restarting abort them and once they are reported as "Error" in your Results Status page click on the word "Error" and cut/paste the content of the error report(s) into your post. Cheers. Jean. |
||
|
|
|