Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 5
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1042 times and has 4 replies Next Thread
MrKermit
Advanced Cruncher
Joined: Jun 13, 2009
Post Count: 95
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Ext3 Corruption problem

/Hi All,

We have been having trouble with the filesystem being corrupted when running BOINC v6.6.36 and 6.2.?. Today (using 6..2.x) we had hundreds of nodes reporting "attempt to access beyond the end of the disk" messages and had to FSCK 300 nodes to fix. with 6.3.x we were seeing corruption bad enough that nodes couldn't return their data without error (and sometimes a whooooole lot of `em).

We are running RH5.x and have a hard drive dedicated to the filesystem the nodes use to run BOINC from. Since our OS runs from RAMDISK, there's literally nothing else writing to this drive when the corruption occurs.

Is anyone else experiencing anything like this? We don't know if it's a particular project's work, or BOINC, or ??? but over the course of days we start to see large numbers of nodes with some level of corruption in the BOINC working Directory.

If we can get this fixed, we can probably automate a lot more nodes joining in for short chunks of time between work, rather than the maintenance windows and spare time we have them jumping in on now. clocking down the RAM and CPU dynamically is another riddle on our list to minimize the extra energy costs while still crunching with the power we were paying for anyway.

Thanks for any insights!
MrKermit
----------------------------------------

[Mar 19, 2010 3:26:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Ext3 Corruption problem

Hi MrKermit,

think the swiftest attention attractor is sending a mail to support@worldcommunitygrid.org f.a.o. knreed. He once set all your devices to quota 1 as your were eating through the whole supply and causing the Reliable/Repair queue to back up to the point work to normal clients was not going out anymore. Since you caught it guess that you've suspended till resolution.

Think to remember it was lastly your filesystem corrupting.

Don't use 6.3 (old old alpha). If at all, think you're much better served to use 5.10.45 also because it is not creating these special BOINC accounts. The mass roll-out dock is still based on 5.10. http://www.worldcommunitygrid.org/bg/BOINCMassInstall.pdf
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 19, 2010 8:43:22 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Ext3 Corruption problem

Mr Kermit,

Can you contact support@worldcommunitygrid.org as Sekerob suggested? I will get your email and we can start figuring out what happened. BOINC and the research applications only use standard file system calls so I'd be surprised if it was anything BOINC specific that was causing your issue. If you could let us know if you are mounting the filesystem from SAN or if you actually are using local hard drives that would be some good information for us.

thanks,

Kevin
[Mar 19, 2010 1:12:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
MrKermit
Advanced Cruncher
Joined: Jun 13, 2009
Post Count: 95
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Ext3 Corruption problem

Kevin,

We are just running from a local disk in each machine, nothing fancy. It's pretty darn predictable, so I'll open up a ticket and see if we can zero in on the cause. At least it isn't something obvious or "common" :)

Thanks!
MrKermit
----------------------------------------

[Mar 20, 2010 1:42:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Dotsch
Advanced Cruncher
Joined: Feb 12, 2006
Post Count: 100
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Ext3 Corruption problem

I never had such issues with BOINC on any type of filesystems.

From the errors you are reporting, I guess that the root cause is a bad hard disk.
Is it possible to exchange the drive, or move the data to a other drive or USB stick ?
[Mar 21, 2010 2:45:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread