World Community Grid - View Thread - Computational errors in various projects [AMD/Linux x86

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Computational errors in various projects [AMD/Linux x86_64]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 9

[ ]

Author

This topic has been viewed 5659 times and has 8 replies

se29592
Cruncher
Joined: Jun 4, 2009
Post Count: 8
Status: Offline


Computational errors in various projects [AMD/Linux x86_64]

Since some weeks I am getting a very high error rate in my computations. Both Errors and Invalids in a number of projects. This has not been the case earlier using the same system and software. I have now disabled all projects where I do get errors:

Drug Search for Leishmaniasis
The Clean Energy Project - Phase 2
Help Cure Muscular Dystrophy - Phase 2
Human Proteome Folding - Phase 2
FightAIDS@Home

This started somewhere mid september. Before Sep 19 I was averaging 20k points per day whereas since then I have been getting about 5k points of valid results per day.

AMD Phenom(tm) II X6 1090T Processor
Fedora Core 14 - x86_64

I've tried reducing clock speed (although not over-clocked before either) with no change.
This is not related to the SELinux problems others have been reported since I have resolved those with SELinux exceptions.

I think this requires someone with full database access to investigate to find out what kind of patterns is associated with these failures. It could of course be this particular system, but I doubt it. Could there be changes that has been made to the MSWin x86 base drift slightly and making my previously ok system loose the votings? Or what can it be?

I can see no suspicious code update in this period.

$ rpm --last -qa boinc\*
boinc-client-doc-6.10.58-3.r22930svn.fc14     Tue 22 Feb 2011 02:13:50 PM CET
boinc-client-static-6.10.58-3.r22930svn.fc14  Tue 22 Feb 2011 02:13:48 PM CET
boinc-client-devel-6.10.58-3.r22930svn.fc14   Tue 22 Feb 2011 02:13:42 PM CET
boinc-manager-6.10.58-3.r22930svn.fc14        Tue 22 Feb 2011 02:13:29 PM CET
boinc-client-6.10.58-3.r22930svn.fc14         Tue 22 Feb 2011 02:13:03 PM CET

$ rpm --last -qa 
system-config-printer-1.2.8-2.fc14            Fri 30 Sep 2011 11:41:37 AM CEST
libdhash-0.4.3-5.fc14                         Fri 30 Sep 2011 11:41:36 AM CEST
libini_config-0.6.2-5.fc14                    Fri 30 Sep 2011 11:41:35 AM CEST
libcollection-0.6.1-5.fc14                    Fri 30 Sep 2011 11:41:34 AM CEST
libpath_utils-0.2.1-5.fc14                    Fri 30 Sep 2011 11:41:33 AM CEST
libref_array-0.1.2-5.fc14                     Fri 30 Sep 2011 11:41:31 AM CEST
system-config-printer-libs-1.2.8-2.fc14       Fri 30 Sep 2011 11:41:29 AM CEST
firefox5-5.0-1.fc14                           Wed 28 Sep 2011 05:32:42 PM CEST
xulrunner5-5.0-1.fc14                         Wed 28 Sep 2011 05:32:38 PM CEST
PackageKit-gtk-module-0.6.12-4.fc14           Wed 28 Sep 2011 02:30:28 PM CEST
dbus-glib-0.86-4.fc14                         Wed 28 Sep 2011 02:30:26 PM CEST
clearlooks-compact-gnome-theme-1.5-3.fc12     Wed 28 Sep 2011 02:25:43 PM CEST
fpaste-0.3.7-1.fc14                           Tue 27 Sep 2011 08:26:48 AM CEST
k3b-libs-2.0.2-5.fc14                         Tue 27 Sep 2011 08:26:46 AM CEST
k3b-2.0.2-5.fc14                              Tue 27 Sep 2011 08:26:45 AM CEST
k3b-common-2.0.2-5.fc14                       Tue 27 Sep 2011 08:26:44 AM CEST
nss-devel-3.12.10-4.fc14                      Mon 26 Sep 2011 09:26:49 PM CEST
openldap-devel-2.4.23-10.fc14                 Mon 26 Sep 2011 09:26:47 PM CEST
libcurl-devel-7.21.0-10.fc14                  Mon 26 Sep 2011 09:26:44 PM CEST
libsoup-devel-2.32.2-2.fc14                   Mon 26 Sep 2011 09:26:42 PM CEST
alsa-plugins-pulseaudio-1.0.24-2.fc14         Mon 26 Sep 2011 09:26:41 PM CEST
qt-webkit-4.7.4-2.fc14                        Mon 26 Sep 2011 09:26:39 PM CEST
libcurl-7.21.0-10.fc14                        Mon 26 Sep 2011 09:26:36 PM CEST
openldap-2.4.23-10.fc14                       Mon 26 Sep 2011 09:26:34 PM CEST
qt-x11-4.7.4-2.fc14                           Mon 26 Sep 2011 09:26:32 PM CEST
qt-4.7.4-2.fc14                               Mon 26 Sep 2011 09:26:27 PM CEST
nss-3.12.10-4.fc14                            Mon 26 Sep 2011 09:26:25 PM CEST
pcre-8.10-2.fc14                              Mon 26 Sep 2011 09:26:21 PM CEST
nss-tools-3.12.10-4.fc14                      Mon 26 Sep 2011 09:26:20 PM CEST
gnupg2-2.0.18-1.fc14                          Mon 26 Sep 2011 09:26:18 PM CEST
curl-7.21.0-10.fc14                           Mon 26 Sep 2011 09:26:16 PM CEST
qt-webkit-4.7.4-2.fc14                        Mon 26 Sep 2011 09:26:15 PM CEST
foomatic-4.0.8-3.fc14                         Mon 26 Sep 2011 09:26:12 PM CEST
libsoup-2.32.2-2.fc14                         Mon 26 Sep 2011 09:26:10 PM CEST
foomatic-filters-4.0.8-3.fc14                 Mon 26 Sep 2011 09:26:09 PM CEST
qt-x11-4.7.4-2.fc14                           Mon 26 Sep 2011 09:26:06 PM CEST
qt-4.7.4-2.fc14                               Mon 26 Sep 2011 09:26:01 PM CEST
libcurl-7.21.0-10.fc14                        Mon 26 Sep 2011 09:25:59 PM CEST
openldap-2.4.23-10.fc14                       Mon 26 Sep 2011 09:25:57 PM CEST
nss-3.12.10-4.fc14                            Mon 26 Sep 2011 09:25:55 PM CEST
nss-sysinit-3.12.10-4.fc14                    Mon 26 Sep 2011 09:25:54 PM CEST
ntfsprogs-2011.4.12-5.fc14                    Fri 23 Sep 2011 02:28:58 PM CEST
ntfs-3g-2011.4.12-5.fc14                      Fri 23 Sep 2011 02:28:56 PM CEST
xorg-x11-drv-savage-2.3.2-3.fc14              Tue 20 Sep 2011 10:15:33 PM CEST
unique-1.1.6-3.fc14                           Tue 20 Sep 2011 10:15:33 PM CEST
librsvg2-2.32.0-4.fc14                        Tue 20 Sep 2011 10:15:32 PM CEST
ql2400-firmware-5.06.01-1.fc14                Fri 16 Sep 2011 05:02:59 PM CEST
setup-2.8.28-2.fc14                           Fri 16 Sep 2011 05:02:56 PM CEST
ql2500-firmware-5.06.01-1.fc14                Fri 16 Sep 2011 05:02:54 PM CEST
rsyslog-4.6.3-3.fc14                          Fri 16 Sep 2011 05:02:50 PM CEST
python-boto-2.0-1.fc14                        Wed 14 Sep 2011 11:07:23 PM CEST

[Oct 24, 2011 10:16:27 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Computational errors in various projects [AMD/Linux x86_64]

Have you run a virus scan as well as several system checks? This could be due to hdd errors or errors in the RAM so this is why you should run some system checks. Any bluescreens or power failures?

You may want to try to reinstall wcg as well.

[Oct 24, 2011 2:14:09 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Computational errors in various projects [AMD/Linux x86_64]

Hello se29592,
Going from 20,000 points to 5,000 points a day is indeed astonishing. There are always some errors caused by misformed work units (such as Batch 40 in DSFL a few days ago) and by input data that the algorithm cannot handle correctly. But that does not explain a problem that reduces your output by three quarters. Nobody else has reported such a drastic change.

So let us try a slow, thoughtful problem solving technique. Reduce the data that needs to be analysed by reducing BOINC to run only 1 process per computer (not per core) and eliminate the extra cache. In fact, cut the cache to just 0.1 days, which will mean no more than 1 work unit waiting to run. Then allow all projects to run. This should allow you to run at 100% speed without worrying about temperature.

This should allow you to build up a picture of just where things are going wrong without overloading you with data. We ordinarily run BOINC as fast as possible with good intentions, but when problems occur it can be like an auto accident where things go wrong more quickly than we can process information.

I look forward to a report.

confused

Lawrence

[Oct 24, 2011 11:05:39 PM]

se29592
Cruncher
Joined: Jun 4, 2009
Post Count: 8
Status: Offline


Re: Computational errors in various projects [AMD/Linux x86_64]

Hi Lawrence,

I will try to change one thing at the time and see what comes out of it. My first action is to continue to run the system as before but selecting the projects where I have not seen any problems to try to confirm my theory that the problems are connected to some specific projects and not eg. memory or CPU problems (wihch are more likely to hit all projects, but not guaranteed to do so).

I have not noticed any instablities in the system but I'm not stressing it very much when I am at the console. Your confirmation that this is an isolated anomaly makes me more confident in continuing to try to find an error on the system level.

Ironically I discovered the drastic change when I found out that I have been running at reduced (power saving) speed continuously.

I'll update the tread when there is more information to share. It would be interesting to be able to query in full the restult status back in time. The limited searches available at the Result Status page do not provide (at least easily) enough information on result history.

/Nils

[Oct 25, 2011 7:50:50 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Computational errors in various projects [AMD/Linux x86_64]

Why don't you post an actual Result Log [scan multiple if there are variations in the fail codes] and what's printed in the message/event log of BOINC when these tasks fail?

Signal 11? SIGSEGV? Fedora's Firewall or any other IP/port scanning / guarding software needs to let IP 127.0.0.1 (localhost) through and port 31416. If that's continually scanned or obstructed, your tasks will fail, random, frequent, always. Try crunching with the BOINC network set suspended also. Intermittent WIFI is known to upset BOINC too.

All of this of course does not explain why it is not happening when you'd run e.g. HCC or Clean Water (both I think are Integer intense computations), so maybe the FPU is intermittently failing, but then HFCC would have to be failing too and that is the same program (science engine) as FAAH.

Can you define "reduced power". Lower CPU cycles, lower % CPU time, default 60% (known to cause DSFL to fail for some). Maybe this affects the cycles of the CPU itself if set to power save profile for BOINC, so that I've at least in Ubuntu locked it to max cycles. (Would expect that cycle down to respond with delay).

--//--

[Oct 25, 2011 8:48:49 AM]

se29592
Cruncher
Joined: Jun 4, 2009
Post Count: 8
Status: Offline


Re: Computational errors in various projects [AMD/Linux x86_64]

Well the log files are in the data base for those interested and the errors vary. I see some SIGSEGVs for example.

Reduced power in my book means using a different frequency governor. This should not be normally visible by the application so I would not expect it to have any effect on application stability, but reducing internal clock frequency could potentially increase system stability.

I will wait with posting more information until I have anything useful to post.

[Oct 29, 2011 1:32:01 PM]

se29592
Cruncher
Joined: Jun 4, 2009
Post Count: 8
Status: Offline


Re: Computational errors in various projects [AMD/Linux x86_64]

During my investigations I started seeing indications of a SELinux problem with my set-up. During one of my reboots a system SELinux relabel took place and the problems appears to be solved.

Cheers,

/Nils

[Nov 2, 2011 7:22:36 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Computational errors in various projects [AMD/Linux x86_64]

Hello se29592,
I hope that solves the problem. I have been interested in hearing how seLinux works for PC users for more than half a decade now.

Lawrence

[Nov 2, 2011 8:47:03 AM]

KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

5 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Computational errors in various projects [AMD/Linux x86_64]

From time to time, I notice that after some Linux updates (Ubuntu 10.04 LTS), the error rate could increase. For this reason, even if Ubuntu is not requesting to do it, I reboot the system after some specific updates (e.g. lib, pam, ...).
I don't have a formal rational regarding reboot criterion, it is more or less experience (and feeling) based.
Yves

----------------------------------------

Décrypthon team progress - KerSamson's contribution

[Nov 2, 2011 10:59:51 AM]

[ ]