| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 9
|
|
| Author |
|
|
se29592
Cruncher Joined: Jun 4, 2009 Post Count: 8 Status: Offline |
Since some weeks I am getting a very high error rate in my computations. Both Errors and Invalids in a number of projects. This has not been the case earlier using the same system and software. I have now disabled all projects where I do get errors:
Drug Search for Leishmaniasis The Clean Energy Project - Phase 2 Help Cure Muscular Dystrophy - Phase 2 Human Proteome Folding - Phase 2 FightAIDS@Home This started somewhere mid september. Before Sep 19 I was averaging 20k points per day whereas since then I have been getting about 5k points of valid results per day. AMD Phenom(tm) II X6 1090T Processor Fedora Core 14 - x86_64 I've tried reducing clock speed (although not over-clocked before either) with no change. This is not related to the SELinux problems others have been reported since I have resolved those with SELinux exceptions. I think this requires someone with full database access to investigate to find out what kind of patterns is associated with these failures. It could of course be this particular system, but I doubt it. Could there be changes that has been made to the MSWin x86 base drift slightly and making my previously ok system loose the votings? Or what can it be? I can see no suspicious code update in this period. $ rpm --last -qa boinc\* $ rpm --last -qa |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Have you run a virus scan as well as several system checks? This could be due to hdd errors or errors in the RAM so this is why you should run some system checks. Any bluescreens or power failures?
You may want to try to reinstall wcg as well. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello se29592,
Going from 20,000 points to 5,000 points a day is indeed astonishing. There are always some errors caused by misformed work units (such as Batch 40 in DSFL a few days ago) and by input data that the algorithm cannot handle correctly. But that does not explain a problem that reduces your output by three quarters. Nobody else has reported such a drastic change. So let us try a slow, thoughtful problem solving technique. Reduce the data that needs to be analysed by reducing BOINC to run only 1 process per computer (not per core) and eliminate the extra cache. In fact, cut the cache to just 0.1 days, which will mean no more than 1 work unit waiting to run. Then allow all projects to run. This should allow you to run at 100% speed without worrying about temperature. This should allow you to build up a picture of just where things are going wrong without overloading you with data. We ordinarily run BOINC as fast as possible with good intentions, but when problems occur it can be like an auto accident where things go wrong more quickly than we can process information. I look forward to a report. Lawrence |
||
|
|
se29592
Cruncher Joined: Jun 4, 2009 Post Count: 8 Status: Offline |
Hi Lawrence,
I will try to change one thing at the time and see what comes out of it. My first action is to continue to run the system as before but selecting the projects where I have not seen any problems to try to confirm my theory that the problems are connected to some specific projects and not eg. memory or CPU problems (wihch are more likely to hit all projects, but not guaranteed to do so). I have not noticed any instablities in the system but I'm not stressing it very much when I am at the console. Your confirmation that this is an isolated anomaly makes me more confident in continuing to try to find an error on the system level. Ironically I discovered the drastic change when I found out that I have been running at reduced (power saving) speed continuously. I'll update the tread when there is more information to share. It would be interesting to be able to query in full the restult status back in time. The limited searches available at the Result Status page do not provide (at least easily) enough information on result history. /Nils |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Why don't you post an actual Result Log [scan multiple if there are variations in the fail codes] and what's printed in the message/event log of BOINC when these tasks fail?
Signal 11? SIGSEGV? Fedora's Firewall or any other IP/port scanning / guarding software needs to let IP 127.0.0.1 (localhost) through and port 31416. If that's continually scanned or obstructed, your tasks will fail, random, frequent, always. Try crunching with the BOINC network set suspended also. Intermittent WIFI is known to upset BOINC too. All of this of course does not explain why it is not happening when you'd run e.g. HCC or Clean Water (both I think are Integer intense computations), so maybe the FPU is intermittently failing, but then HFCC would have to be failing too and that is the same program (science engine) as FAAH. Can you define "reduced power". Lower CPU cycles, lower % CPU time, default 60% (known to cause DSFL to fail for some). Maybe this affects the cycles of the CPU itself if set to power save profile for BOINC, so that I've at least in Ubuntu locked it to max cycles. (Would expect that cycle down to respond with delay). --//-- |
||
|
|
se29592
Cruncher Joined: Jun 4, 2009 Post Count: 8 Status: Offline |
Well the log files are in the data base for those interested and the errors vary. I see some SIGSEGVs for example.
Reduced power in my book means using a different frequency governor. This should not be normally visible by the application so I would not expect it to have any effect on application stability, but reducing internal clock frequency could potentially increase system stability. I will wait with posting more information until I have anything useful to post. |
||
|
|
se29592
Cruncher Joined: Jun 4, 2009 Post Count: 8 Status: Offline |
During my investigations I started seeing indications of a SELinux problem with my set-up. During one of my reboots a system SELinux relabel took place and the problems appears to be solved.
Cheers, /Nils |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello se29592,
I hope that solves the problem. I have been interested in hearing how seLinux works for PC users for more than half a decade now. Lawrence |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
From time to time, I notice that after some Linux updates (Ubuntu 10.04 LTS), the error rate could increase. For this reason, even if Ubuntu is not requesting to do it, I reboot the system after some specific updates (e.g. lib, pam, ...).
----------------------------------------I don't have a formal rational regarding reboot criterion, it is more or less experience (and feeling) based. Yves |
||
|
|
|