World Community Grid - View Thread

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: unclear Invalids

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 21

[ ]

Author

This topic has been viewed 5630 times and has 20 replies

erich56
Senior Cruncher
Austria
Joined: Feb 24, 2007
Post Count: 300
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

14 day badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project - Phase 2

10 year badge for Mapping Cancer Markers

180 day badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


unclear Invalids

Recently, two of my finished and uploaded WUs were classified as invalid.
See here:

https://www.worldcommunitygrid.org/ms/viewBoi...By=sentTime&pageNum=1

which is too bad after 35 and 28 hours CPU time.
Stderr unfortunately does not tell what the problem was.
Anyone any idea how I can find out?

I hate to waste that many hours of CPU time :-(

----------------------------------------
[Edit 1 times, last edit by erich56 at Aug 19, 2021 12:55:57 PM]

[Aug 19, 2021 12:55:27 PM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1413
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: unclear Invalids

I had a view unexpected errors with no clear reason. I suppose it has to do with memory access violation.

The only error on my results-list still there is: https://www.worldcommunitygrid.org/ms/device/...og.do?resultId=1855097844

If I remember correctly at that time BOINC just started a 4-core ATLAS from LHC@home . . . coïncidence?

[Aug 19, 2021 1:37:55 PM]

sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:

45 day badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer


Re: unclear Invalids

Check if your D drive have enough space.

For most ARP1 invalids and some errors, there is possibly something wrong with your RAM, memtest to check. If CPU is overclocked / undervolt, then maybe it was pushed too far, reduce clocks a little more.

Some invalids I had was from a faulty non-ECC ram. Got new ECC UDIMM DDR3 on my AMD FX 4100, Asus M5A97 R2.0, Debian 11. Much better with no more invalids. Uptime 70 days with 1 corrected memory logged so far. Note: CPU, Motherboard, and RAM must all support ECC to use ECC.

[Aug 19, 2021 5:33:20 PM]

erich56
Senior Cruncher
Austria
Joined: Feb 24, 2007
Post Count: 300
Status: Offline
Project Badges:


Re: unclear Invalids

SSD has enough space.
RAM: 8 GB, DDR3, non-ECC, has undergone Memtest recently for a different reason. Test was okay.
The mainboard is an old Fujitsu D3041, Chipset Intel G41
Processor is an old Intel Core2 Quad Q9550 @ 2.83GHz, no overclocking.

So maybe this old system is not the optimal one for ARP ?

[Aug 19, 2021 6:18:51 PM]

Acibant
Advanced Cruncher
USA
Joined: Apr 15, 2020
Post Count: 126
Status: Offline
Project Badges:

50 year badge for Mapping Cancer Markers


Re: unclear Invalids

See here:

https://www.worldcommunitygrid.org/ms/viewBoi...By=sentTime&pageNum=1

We can't see your devices and work units even with that link. Can you click on the "Error" link next to one of the work units in question and copy and paste the content in this thread so we can see the exact error messages?

----------------------------------------

[Aug 19, 2021 7:09:53 PM]

erich56
Senior Cruncher
Austria
Joined: Feb 24, 2007
Post Count: 300
Status: Offline
Project Badges:


Re: unclear Invalids

oh, sorry, I was not aware that the page I set the link to cannot be seen by others.
Anyway, the Error links shows the following:

Result Name: ARP1_ 0027300_ 086_ 0--
<core_client_version>7.14.3</core_client_version>
<![CDATA[
<message>
couldn't start app: Can't get shared memory segment name: shmget() failed</message>
]]>

what makes we wonder though is that is says "couldn't start app ..."
and still the task ran for 35 hours and the other one for 28 hours.
For me, this does not fit together, does it?

[Aug 19, 2021 8:38:52 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: unclear Invalids

erich56

You have given us some specs, but what units were you actually running at the time?

ARP can be very intensive.

Mike

[Aug 19, 2021 9:08:03 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1334
Status: Offline
Project Badges:

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

14 day badge for Computing for Sustainable Water

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project


Re: unclear Invalids

erich56

If those tasks really used that much CPU time doing something constructive, that's a very strange "log file." It looks like the sort of report one gets when the BOINC run-time needs to report an issue during set-up (no startup information, et cetera)...

I presume you're running some flavour of Windows, as that seems to be the place that throws up errors of that type (and not just for ARP - my Linux boxes keep getting retries for all projects where the single original task was on Windows and failed!

There have been various discussions about that issue. I haven't followed them closely (no Windows systems!) but I seem to recall that it was a mixture of client version and the number of shared memory segments already allocated to processes - users with plenty of memory spare were getting the error (which implies a table size limit somewhere was being hit.)

Given that, Mike's "what else was it doing at the time" query is likely to be pertinent.

But I'll return to my original remark - if it had managed to start the APR1 application properly I'd've expected to see at least the initial INFO lines and the "Starting WRFMain" line. Does the BOINC log on your system indicate that the workunit(s) in question ever checkpointed - in fact, checking said log for all lines containing the relevant work unit names might be revealing!

Good luck troubleshooting - hopefully all will become clear at some point.

Cheers - Al.

[Aug 19, 2021 11:08:17 PM]

sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:


Re: unclear Invalids

7.14.3 is old, can update BOINC client. If this computer's BOINC version is already 7.16.11 or newer, when I guess either you looking at someone else's status, or your different computer, or this may be your very old result on old version of BOINC. Check the date and time.

The storage drive possibly can be corrupt and require filesystem checks for all drive letters.

Possibly do a multi-core memtest86+ and/or a Prime95 stress test. Run it for a longer, several hours. Some memory errors only show up when it is heated up and ran for a long time. Check the HDD/SSD health with S.M.A.R.T. tools. But in the end, the old computer could just be failing.

[Aug 19, 2021 11:38:44 PM]

Acibant
Advanced Cruncher
USA
Joined: Apr 15, 2020
Post Count: 126
Status: Offline
Project Badges:


Re: unclear Invalids

7.14.3 is old, can update BOINC client.

Unfortunately WCG's branded client offered on this site (at least for Windows) is still on that version and they can configure the URL to check against for updates to have it not report the existence of a newer version until they give the green light themselves on their servers. Fortunately, erich56, you can download a newer version here and install right over the old version and the work units in progress won't be lost, though they will revert back to the last point where progress was saved (checkpoint).

----------------------------------------

[Aug 20, 2021 12:33:30 AM]

[ ]