Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 27
Posts: 27   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4491 times and has 26 replies Next Thread
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Hi adriverhoef,
IKWYM wink
However I do not see why the problem should be related to the content of boinc data.
Nevertheless, if nothing else could help, I will try it.
Chees,
Yves

OK Yves, let me explain my thinking: a boinc directory is related to the CPU/hardware and the OS - the operating system (with its programs and libraries and what have you) - on which boinc runs, so if you switch the boinc directories (which should be possible if you have compatible OSs), there is no need to install a complete operating system in order to try to find differences, if any.

So, if you switch the boinc directories, will the Invalids move - with the boinc directory - to the other machine, too, or will they stay on the same machine?
----------------------------------------
[Edit 1 times, last edit by adriverhoef at Jul 20, 2019 9:28:24 AM]
[Jul 20, 2019 9:13:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Hi adriverhoef,
would a project detach / fresh boinc install / project attach have a similar impact?
My feeling is that you raise an interesting possible cause ! ... since the boinc directory is very old (about 10 years) and it has been migrated several times by multiple hardware and OS updates:
- Athlon II x2, Athlon II x4, Ryzen 2700
- Ubuntu 10.04, Ubuntu 14.04, LinuxMint 17, Ubuntu 18.04

Cheers,
Yves
----------------------------------------
[Jul 20, 2019 9:25:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CPU failure ?

In the BOINC log, at startup it prints the versions of several libraries that BOINC uses. Are they the same between the two machines (Debian vs Ubuntu)? If not, I would see if the errors (invalids) follow the OS.
[Jul 20, 2019 2:14:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Hi adriverhoef,
would a project detach / fresh boinc install / project attach have a similar impact?

On the offending device? We can't tell, Yves, because we don't know yet what the cause is of the Invalids or where the problem lies.

Doneske has also a good suggestion with which you could start investigating.

You mention that you have a very old boinc directory. The interesting thing indeed is if the age of the boinc directory (i.e. the contents of the configuration files) is of any importance. I would say - and I hope - it isn't, since it would mean that probably more people are affected by this.

The oldest boinc directory that I can find at home stems from February 2018.
[Jul 20, 2019 3:13:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Hi Doneske,
Thank you !
I will check the libraries next week (the machines are located in my office).
Cheers,
Yves
----------------------------------------
[Jul 20, 2019 6:25:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Country Bumkin
Cruncher
Australia
Joined: May 14, 2008
Post Count: 14
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

A complete guess but use the package manager to confirm the Ubuntu machine has the package amd64-microcode installed and is up to date. It is standard with Linux Mint 19 but confirm it is up to date.
----------------------------------------
Regards C Bumkin
----------------------------------------
[Edit 1 times, last edit by Country Bumkin at Jul 22, 2019 3:05:23 AM]
[Jul 22, 2019 3:04:21 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Update #1
----
After close verification and comparison between the both 2700 machines:
- Libraries are OK
- Microcode is OK
- CPU release/masks are identical

Yesterday I ran the machine dry. Afterwards:
- Project reset (clean boincdata directory)
- Machine detached from WCG
- Reboot
- Machine reattached to WCG.

The machine computed SCC WUs for a couple of hours:
- at this time about 10% of the work is considered being valid
- the rest of the performed work is invalid or pending.

The next step will be to switch to LinuxMint.
If it will not help, it should be a CPU failure or a very curious RAM failure.
----------------------------------------
[Jul 26, 2019 8:04:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Bad/Intermittent RAM or CPU caches, even the swapfile could be sitting in a corrupt disk area.
[Jul 26, 2019 12:00:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

With 16 GB RAM, the swap file stays at 0.
In the mean time I have about 60 valid WUs, about 25 invalid WUs and about 120 pending WUs.
----------------------------------------
[Jul 26, 2019 5:06:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ? - Maybe solved ?

Update #2
---
Good news smile and interesting case thinking
After the purge of the boinc data directory (project reset) at about 2019-07-26 00:30 UTC, the machine generated only 21 invalid WUs (finally during the first day), the rest of the performed work (SCC and ZIKA) is valid applause

Many thanks to Adri suggesting me that a possible cause (what for a cause still remains open) could be in the content of the boinc data directory including the boinc client related files.
As mentioned, this directory was about 10 years old and experienced over the years multiple migrations related to boinc upgrade, hardware changes and OS changes (from Ubuntu 10.04 until LinuxMint 19 over 12.04, 14.04, LinuxMint 17).
Thank you all for your valuable inputs.
I would be very interested by a feedback of the TechTeam just for understanding (if possible) what did happen over the years.

Cheers,
Yves
---
PS: I will still monitor closely what this machine will do in the next future.
----------------------------------------
----------------------------------------
[Edit 2 times, last edit by KerSamson at Jul 28, 2019 7:45:35 AM]
[Jul 28, 2019 7:30:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 27   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread