| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 5
|
|
| Author |
|
|
MrKermit
Advanced Cruncher Joined: Jun 13, 2009 Post Count: 95 Status: Offline Project Badges:
|
/Hi All,
----------------------------------------We have been having trouble with the filesystem being corrupted when running BOINC v6.6.36 and 6.2.?. Today (using 6..2.x) we had hundreds of nodes reporting "attempt to access beyond the end of the disk" messages and had to FSCK 300 nodes to fix. with 6.3.x we were seeing corruption bad enough that nodes couldn't return their data without error (and sometimes a whooooole lot of `em). We are running RH5.x and have a hard drive dedicated to the filesystem the nodes use to run BOINC from. Since our OS runs from RAMDISK, there's literally nothing else writing to this drive when the corruption occurs. Is anyone else experiencing anything like this? We don't know if it's a particular project's work, or BOINC, or ??? but over the course of days we start to see large numbers of nodes with some level of corruption in the BOINC working Directory. If we can get this fixed, we can probably automate a lot more nodes joining in for short chunks of time between work, rather than the maintenance windows and spare time we have them jumping in on now. clocking down the RAM and CPU dynamically is another riddle on our list to minimize the extra energy costs while still crunching with the power we were paying for anyway. Thanks for any insights! MrKermit ![]() |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Hi MrKermit,
----------------------------------------think the swiftest attention attractor is sending a mail to support@worldcommunitygrid.org f.a.o. knreed. He once set all your devices to quota 1 as your were eating through the whole supply and causing the Reliable/Repair queue to back up to the point work to normal clients was not going out anymore. Since you caught it guess that you've suspended till resolution. Think to remember it was lastly your filesystem corrupting. Don't use 6.3 (old old alpha). If at all, think you're much better served to use 5.10.45 also because it is not creating these special BOINC accounts. The mass roll-out dock is still based on 5.10. http://www.worldcommunitygrid.org/bg/BOINCMassInstall.pdf
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
Mr Kermit,
Can you contact support@worldcommunitygrid.org as Sekerob suggested? I will get your email and we can start figuring out what happened. BOINC and the research applications only use standard file system calls so I'd be surprised if it was anything BOINC specific that was causing your issue. If you could let us know if you are mounting the filesystem from SAN or if you actually are using local hard drives that would be some good information for us. thanks, Kevin |
||
|
|
MrKermit
Advanced Cruncher Joined: Jun 13, 2009 Post Count: 95 Status: Offline Project Badges:
|
Kevin,
----------------------------------------We are just running from a local disk in each machine, nothing fancy. It's pretty darn predictable, so I'll open up a ticket and see if we can zero in on the cause. At least it isn't something obvious or "common" :) Thanks! MrKermit ![]() |
||
|
|
Dotsch
Advanced Cruncher Joined: Feb 12, 2006 Post Count: 100 Status: Offline Project Badges:
|
I never had such issues with BOINC on any type of filesystems.
From the errors you are reporting, I guess that the root cause is a bad hard disk. Is it possible to exchange the drive, or move the data to a other drive or USB stick ? |
||
|
|
|