Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Active Research Forum: Africa Rainfall Project Thread: unclear Invalids |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 21
|
Author |
|
erich56
Senior Cruncher Austria Joined: Feb 24, 2007 Post Count: 294 Status: Offline Project Badges: |
Recently, two of my finished and uploaded WUs were classified as invalid.
----------------------------------------See here: which is too bad after 35 and 28 hours CPU time. Stderr unfortunately does not tell what the problem was. Anyone any idea how I can find out? I hate to waste that many hours of CPU time :-( [Edit 1 times, last edit by erich56 at Aug 19, 2021 12:55:57 PM] |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1317 Status: Offline Project Badges: |
I had a view unexpected errors with no clear reason. I suppose it has to do with memory access violation.
----------------------------------------The only error on my results-list still there is: https://www.worldcommunitygrid.org/ms/device/...og.do?resultId=1855097844 If I remember correctly at that time BOINC just started a 4-core ATLAS from LHC@home . . . coïncidence? |
||
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges: |
Check if your D drive have enough space.
For most ARP1 invalids and some errors, there is possibly something wrong with your RAM, memtest to check. If CPU is overclocked / undervolt, then maybe it was pushed too far, reduce clocks a little more. Some invalids I had was from a faulty non-ECC ram. Got new ECC UDIMM DDR3 on my AMD FX 4100, Asus M5A97 R2.0, Debian 11. Much better with no more invalids. Uptime 70 days with 1 corrected memory logged so far. Note: CPU, Motherboard, and RAM must all support ECC to use ECC. |
||
|
erich56
Senior Cruncher Austria Joined: Feb 24, 2007 Post Count: 294 Status: Offline Project Badges: |
SSD has enough space.
RAM: 8 GB, DDR3, non-ECC, has undergone Memtest recently for a different reason. Test was okay. The mainboard is an old Fujitsu D3041, Chipset Intel G41 Processor is an old Intel Core2 Quad Q9550 @ 2.83GHz, no overclocking. So maybe this old system is not the optimal one for ARP ? |
||
|
Acibant
Advanced Cruncher USA Joined: Apr 15, 2020 Post Count: 126 Status: Offline Project Badges: |
We can't see your devices and work units even with that link. Can you click on the "Error" link next to one of the work units in question and copy and paste the content in this thread so we can see the exact error messages?
---------------------------------------- |
||
|
erich56
Senior Cruncher Austria Joined: Feb 24, 2007 Post Count: 294 Status: Offline Project Badges: |
oh, sorry, I was not aware that the page I set the link to cannot be seen by others.
Anyway, the Error links shows the following: Result Name: ARP1_ 0027300_ 086_ 0-- <core_client_version>7.14.3</core_client_version> <![CDATA[ <message> couldn't start app: Can't get shared memory segment name: shmget() failed</message> ]]> what makes we wonder though is that is says "couldn't start app ..." and still the task ran for 35 hours and the other one for 28 hours. For me, this does not fit together, does it? |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12160 Status: Offline Project Badges: |
erich56
You have given us some specs, but what units were you actually running at the time? ARP can be very intensive. Mike |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 884 Status: Recently Active Project Badges: |
erich56
If those tasks really used that much CPU time doing something constructive, that's a very strange "log file." It looks like the sort of report one gets when the BOINC run-time needs to report an issue during set-up (no startup information, et cetera)... I presume you're running some flavour of Windows, as that seems to be the place that throws up errors of that type (and not just for ARP - my Linux boxes keep getting retries for all projects where the single original task was on Windows and failed! There have been various discussions about that issue. I haven't followed them closely (no Windows systems!) but I seem to recall that it was a mixture of client version and the number of shared memory segments already allocated to processes - users with plenty of memory spare were getting the error (which implies a table size limit somewhere was being hit.) Given that, Mike's "what else was it doing at the time" query is likely to be pertinent. But I'll return to my original remark - if it had managed to start the APR1 application properly I'd've expected to see at least the initial INFO lines and the "Starting WRFMain" line. Does the BOINC log on your system indicate that the workunit(s) in question ever checkpointed - in fact, checking said log for all lines containing the relevant work unit names might be revealing! Good luck troubleshooting - hopefully all will become clear at some point. Cheers - Al. |
||
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges: |
7.14.3 is old, can update BOINC client. If this computer's BOINC version is already 7.16.11 or newer, when I guess either you looking at someone else's status, or your different computer, or this may be your very old result on old version of BOINC. Check the date and time.
The storage drive possibly can be corrupt and require filesystem checks for all drive letters. Possibly do a multi-core memtest86+ and/or a Prime95 stress test. Run it for a longer, several hours. Some memory errors only show up when it is heated up and ran for a long time. Check the HDD/SSD health with S.M.A.R.T. tools. But in the end, the old computer could just be failing. |
||
|
Acibant
Advanced Cruncher USA Joined: Apr 15, 2020 Post Count: 126 Status: Offline Project Badges: |
7.14.3 is old, can update BOINC client. Unfortunately WCG's branded client offered on this site (at least for Windows) is still on that version and they can configure the URL to check against for updates to have it not report the existence of a newer version until they give the green light themselves on their servers. Fortunately, erich56, you can download a newer version here and install right over the old version and the work units in progress won't be lost, though they will revert back to the last point where progress was saved (checkpoint). |
||
|
|