Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 16
Posts: 16   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3573 times and has 15 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

I've been getting occasional crashes of wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu when doing a daily scheduled, automated, reboot. Right before the reboot, using boinccmd, a "suspend" is issued for the WCG project (my only boinc project). The crash appears to be the direct result of issuing the suspend. It does not happen every time. There are actually 3 (or 4) crashes, corresponding to the running tasks I presume. After the reboot, boinc recovers nicely, and damage is limited to a few lost CPU cycles. I haven't actually captured a core dump because of SELinux issues, but that could presumably be corrected.

I haven't notified Red Hat / Fedora of the problem, since this seems to be a WCG program that is crashing. Is there anyone at WCG that would be interested in pursuing this, or is this too trivial to bother with, since it isn't really causing a significant problem?

I'm not experiencing any other problems with my boinc runs. I know that the suspend isn't even necessary - I could just let the reboot kill boinc, and it would recover, but I'm trying to close down boinc properly, instead of just killing it. This method has been successfully running for about 1 year, and only recently started occasionally crashing.
[Jan 10, 2019 6:51:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7851
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

Since you have been running for about a year with no problems, I suspect(guessing) the problem might be hardware related on your end. Perhaps a flaky memory module or a memory heat issue. I know these can be really difficult to pin down because they are so intermittent.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jan 10, 2019 9:46:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

My CPU temps are very low, and a BIOS fast (10 minute) memory check shows no problems. If this were a hardware problem I would expect there to be random failures in other processes, but I'm not seeing that, nor am I seeing any invalid boinc results returned to WCG - just this crash when a suspend is done (sometimes). There is one other intermittent bug I'm seeing - when I go to exit my email client ("evolution"), on rare occasions gnome-shell will crash. If there is any connection between the 2, I'd expect it to be a software one, not hardware. I'm running Fedora Linux, the latest release, which may have nightly upgrades - so it isn't exactly a stable software platform - maybe not the best for boinc.
[Jan 11, 2019 12:36:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

Instead of suspend, why not just stop boinc with boinccmd? That will end the processes and not leave them suspended when the reboot happens.
[Jan 11, 2019 12:45:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

As an analogy, in a computer program, you would first close all files, then exit the program - you wouldn't just exit the program before closing the files, which would leave the output buffers unwritten. Not knowing the details of the boinc processing, it seemed to me that doing a suspend first seemed like the graceful, proper way to terminate boinc. I know boinc error recovery is very robust and the suspend isn't necessary, but it seems like there may be some wasted CPU cycles just stopping the boinc client - like the processing done after the last checkpoint up until the ungraceful exit might be wasted (without the suspend). Not knowing the boinc internals, my guess could well be wrong. And yes, you have to do a resume after the reboot, which I have handled automatically in a reboot script.

To clarify a point in my previous reply, I'm running the current "stable" Fedora release (29) - not the under development "rawhide" release.
[Jan 11, 2019 1:17:44 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2355
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

Have you considered stopping BOINC by using the more general way (with 'systemctl')?
# systemctl stop boinc-client.service
nerd
This willshould take care of stopping BOINC properly.

Regarding 'boinccmd' and the 'suspend' directive: I think 'suspend' should be used in conjunction with 'resume'.
----------------------------------------
[Edit 1 times, last edit by adriverhoef at Jan 12, 2019 3:50:37 PM]
[Jan 12, 2019 3:49:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

Boinccmd stop is the BOINC developer's provided way to stop the program. Filesystems today tend to take care of writing in-memory data to the disk during shutdown as a way of maintaining filesystem integrity except in the case of the ungraceful shutdowns obviously. Suspend doesn't do anything except make the science processes non-dispatchable to the processor (especially if LAIM is specified in the options). I just did a suspend under Fedora 29 and it worked just fine on a 4 core and 8 core system. I didn't do a reboot, just a suspend/resume. If you are using a script to perform the function, is there enough time to allow for the suspend to complete before the next command executes (assuming the next command is shutdown)? Maybe put a wait of a few seconds between commands. Does it crash if you just do a suspend/resume in boinc manager?
[Jan 12, 2019 8:23:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

I have used suspend in boincmgr, and it has never crashed on me. It also does not crash most of the time when the suspend is done by boinccmd in a cron script.

My script (more or less) to stop boinc before doing a system update and then a reboot:

boinccmd --project http://www.worldcommunitygrid.org suspend
loop and sleep until no boinc transfers are active
systemctl stop boinc-client.service
loop and sleep, waiting until the boinc process is gone

My script (more or less) to resume boinc after reboot:

(boinc is started automatically by systemd at reboot)
loop and sleep, waiting until the boinc process is up
boinccmd --project http://www.worldcommunitygrid.org resume

I would like to know if stopping boinc ("boinccmd --quit") quits gracefully (does final checkpoints before actually quitting) - the documentation doesn't say. It probably does do a final checkpoint, and using suspend / resume isn't necessary and maybe doesn't even force a final checkpoint.
[Jan 12, 2019 9:22:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

Using boincmgr, I suspended WCG. Then selected a long running task. Then got properties of that task, and noted the "slot" number, say "3". Then in the /var/lib/boinc/slots/3 directory did a "ls -l" and noted that the checkpoint file was NOT updated when I did the suspend. Therefore, I'm concluding that "suspend / resume" does NOT "force a final checkpoint". So I'm reworking my scripts to remove the suspend / resume.

However, there is still an intermittent bug in wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu, or in the system routines which it calls.
[Jan 12, 2019 11:16:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Crashing on "suspend": wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu

Simply quitting boinc (with no suspend/resume) causes final checkpoints to be written, so that no CPU cycles are wasted. After stopping boinc, listing the checkpoints with "ls -ltr /var/lib/boinc/slots/*/*check*" showed all checkpoints with the same timestamp, within 12 seconds.
[Jan 13, 2019 1:34:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 16   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread