Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 10
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1433 times and has 9 replies Next Thread
jmcgaw
Advanced Cruncher
US
Joined: Feb 2, 2007
Post Count: 54
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Why "No Reply"?

I don't check my results often and it appears that I really should. Just looked and I've got uncounted hundreds of projects hung on my most prolific machine with "No Reply" as the status. This has been going on since October 2nd but my other machines are still cooking along so it shouldn't be a network problem. I can't find any explanation of what this might mean or what might cause it, let alone a fix. Any help on offer?
[Oct 16, 2019 3:10:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7662
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

Sometimes machines hang for no apparent reason (although there is a reason which may be hard to track down.) It is a good policy to just reboot them from time to time. I am guessing that if you reboot the machine in question it will go back to processing units just as before.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Oct 16, 2019 3:23:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jmcgaw
Advanced Cruncher
US
Joined: Feb 2, 2007
Post Count: 54
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

No, that won't help. The computer with the problem has been rebooted several times over the past two weeks, what with updates and all, and that never kicked the accounting process in the arse. I can't even figure out what the "no reply" means. The system has been crunching units all the time and sending them off (thus the hundreds of units just sitting there on the server), and I do glance at that at least once a day. How can I find out, precisely, what this error means? Is there something wrong at this end or the other end?

UPDATE: MORE WEIRDNESS!

I just did a closer look at the device results page and found a real anomaly in that there are TWO entries for the problem machine (named Beauty). One shows the last return from the "bottom" Beauty as 10/02/2019 and the "top" Beauty as 10/16/2019 (today). I have not a single clue as to what could have happened here as there has only ever been a single machine with that name. Now the question is, is there some way to merge the results to keep the overall returns correct or, failing that, some way to remove reference to the "bottom" instance and sacrifice my stats?


----------------------------------------
[Edit 1 times, last edit by jmcgaw at Oct 16, 2019 8:11:19 PM]
[Oct 16, 2019 7:51:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 798
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

jmcgaw said:
Now the question is, is there some way to merge the results to keep the overall returns correct or, failing that, some way to remove reference to the "bottom" instance and sacrifice my stats?

Ha! laughing
The ability to merge duplicate device/host IDs into one has been a requested feature on WCG for years but hasn't been a priority for them. Other BOINC projects [that don't use custom or legacy code on the back-end] have that feature by default. I have a duplicate pair I'd like to merge as well. I hope this functionality becomes a reality on WCG. I'm holding my breath. wink

Regarding your "No Reply" question, I bet the work units that show "No Reply" all correspond to the original "Beauty" device that last returned results on 10/2/19. Have you changed any hardware on Beauty that would cause WCG server to reject it and create the new Beauty device? In other words, have you upgraded RAM, installed a new hard drive or SSD, or updated to a newer Windows 10 version?
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Oct 16, 2019 9:16:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jmcgaw
Advanced Cruncher
US
Joined: Feb 2, 2007
Post Count: 54
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

The only thing I can think of was not a hardware change: I had a glitch and had to restore a 2-day-old image of the W10 OS over the existing one. Not knowing what the project's software uses to identify a client I can't know if that would be enough to confuse the situation. The restored image came from Beauty and was restored onto Beauty.
[Oct 16, 2019 10:13:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 798
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

jmcgaw said:
The only thing I can think of was not a hardware change: I had a glitch and had to restore a 2-day-old image of the W10 OS over the existing one. Not knowing what the project's software uses to identify a client I can't know if that would be enough to confuse the situation. The restored image came from Beauty and was restored onto Beauty.

That would do it. There's a "client_state.xml" file in C:\ProgramData\BOINC that has some kind of counter that gets incremented each time the BOINC client talks to the server, and if it detects that the counter is lower, it'll abandon ship and create a new device/host ID. So when you restored an image that was two days older, it restored an older client_state.xml file with a previously used counter. Any work units in progress would simply never be heard from again since they were deleted (which explains the "No Reply" status).

I think to avoid this in the future if there's ever a need to re-image the computer, make sure BOINC is completely uninstalled and C:\ProgramData\BOINC folder deleted then BOINC reinstalled before the machine even connects to the Internet.

In the meantime, join me in holding our breath for the ability to merge devices. :P
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 1 times, last edit by hchc at Oct 17, 2019 1:35:35 AM]
[Oct 17, 2019 1:34:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1672
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

It is one of the reasons why I store the Boinc Data directory outside of the C: partition.
If Boinc Data is e.g. on D:, an image restore will not affect your boinc data, incl. client_state.xml.
Cheers,
Yves
----------------------------------------
[Oct 17, 2019 7:36:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
jmcgaw
Advanced Cruncher
US
Joined: Feb 2, 2007
Post Count: 54
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

That is something I should probably look into although this is the first time (that I recall) that anything like this has happened to me. It might prove to be a problem on a few of my stripped-down "crunching-only" machines which have only a single disk and partition remaining. Or maybe I'll just forget about the whole thing and not worry about lost credit -- I'm not competing with anybody but myself and as long as the units get crunched then the science is being done and that is the only real reason I spend my $$$ on electricity toward the projects.

Then again, I wonder if the boinc data could be stored on one of the redundant Drobo NAS units?...
[Oct 17, 2019 1:34:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 798
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

I have a couple "crunching only" machines. Everything is on one partition on a 16 GB USB flash drive. If/when that drive stops working, I'll just reinstall on a new one, and it should match the device/host ID as long as I give it the same hostname and OS. In other words, I don't bother even doing backups of that machine, since worst case I just lose the work units in progress.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Oct 18, 2019 10:14:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1672
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Why "No Reply"?

@hchc
I do systematically apply some segregation principles by configuring systems.

For Windows-based systems, e.g.:
> C: System only: OS + Applications, without any data
> D: Data, including boincdata directory
> E: Documentation material, installation directories ($setup$.exe, .msi, ...)
> F: Entertainment, e.g. .mp3, .mp4, ...
This strategy allows to differentiate the priority of backup activities, to generate images (C:) without any data, just system stuff, etc.

On Unix/Linux based systems, I follow the recommended Standard Linux File System Structure. Excepted for VM without much data, I configure systematically the machine with a separate /home partition. It's allow me to be able to reinstall "from scratch" a machine without jeopardizing my data, taking care that I do never allow the installation script to partition the disks on its own.
In this case, the boinc data are stored in /var/lib/boinc-client.
You need to be careful by reinstalling a system, since, even if /var resides on a dedicated partition, the /var directory structure will be fully resetted during a new system installation (I made the failure on day and loose my boincdata directory).
For avoiding this problem, I stop the boinc project using boinc manager:
1/ no new work
2/ suspend project (as soon as all WUs have been processed)
3/ Shutdown boinc).
Afterwards I copy /var/lib/boinc-client to home.
After the new system installation, I copy boinc-client back to /var/lib before installing boinc on the new configuration. Normally, assuming that you took properly care of ownership, group membership, and access rights, the machine should be immediately operational after the boinc installation. Afterwards, you should not forget to reactivate the project using boinc manager:
1/ Resume project
2/ Allow new tasks.
Hopefully, the explanation is clear enough.
Happy crunching,
Yves
---
PS: Regarding Windows, I elaborated this segregation strategy already during the MS-DOS era (>30 years ago)

PPS: After different ideas for Unix/Linux systems, I create an installation repository in /home/sw-depot for the software staff I do manually install. Likewise if the machine has to operate virtual machines (VM), I create a dedicated partition - /VM - for storing the virtual disks (.vdi, .vdmk).
----------------------------------------
[Oct 19, 2019 8:17:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread