| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 16
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I've been getting occasional crashes of wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu when doing a daily scheduled, automated, reboot. Right before the reboot, using boinccmd, a "suspend" is issued for the WCG project (my only boinc project). The crash appears to be the direct result of issuing the suspend. It does not happen every time. There are actually 3 (or 4) crashes, corresponding to the running tasks I presume. After the reboot, boinc recovers nicely, and damage is limited to a few lost CPU cycles. I haven't actually captured a core dump because of SELinux issues, but that could presumably be corrected.
I haven't notified Red Hat / Fedora of the problem, since this seems to be a WCG program that is crashing. Is there anyone at WCG that would be interested in pursuing this, or is this too trivial to bother with, since it isn't really causing a significant problem? I'm not experiencing any other problems with my boinc runs. I know that the suspend isn't even necessary - I could just let the reboot kill boinc, and it would recover, but I'm trying to close down boinc properly, instead of just killing it. This method has been successfully running for about 1 year, and only recently started occasionally crashing. |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7851 Status: Offline Project Badges:
|
Since you have been running for about a year with no problems, I suspect(guessing) the problem might be hardware related on your end. Perhaps a flaky memory module or a memory heat issue. I know these can be really difficult to pin down because they are so intermittent.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My CPU temps are very low, and a BIOS fast (10 minute) memory check shows no problems. If this were a hardware problem I would expect there to be random failures in other processes, but I'm not seeing that, nor am I seeing any invalid boinc results returned to WCG - just this crash when a suspend is done (sometimes). There is one other intermittent bug I'm seeing - when I go to exit my email client ("evolution"), on rare occasions gnome-shell will crash. If there is any connection between the 2, I'd expect it to be a software one, not hardware. I'm running Fedora Linux, the latest release, which may have nightly upgrades - so it isn't exactly a stable software platform - maybe not the best for boinc.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Instead of suspend, why not just stop boinc with boinccmd? That will end the processes and not leave them suspended when the reboot happens.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
As an analogy, in a computer program, you would first close all files, then exit the program - you wouldn't just exit the program before closing the files, which would leave the output buffers unwritten. Not knowing the details of the boinc processing, it seemed to me that doing a suspend first seemed like the graceful, proper way to terminate boinc. I know boinc error recovery is very robust and the suspend isn't necessary, but it seems like there may be some wasted CPU cycles just stopping the boinc client - like the processing done after the last checkpoint up until the ungraceful exit might be wasted (without the suspend). Not knowing the boinc internals, my guess could well be wrong. And yes, you have to do a resume after the reboot, which I have handled automatically in a reboot script.
To clarify a point in my previous reply, I'm running the current "stable" Fedora release (29) - not the under development "rawhide" release. |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2355 Status: Offline Project Badges:
|
Have you considered stopping BOINC by using the more general way (with 'systemctl')?
----------------------------------------
![]() This Regarding 'boinccmd' and the 'suspend' directive: I think 'suspend' should be used in conjunction with 'resume'. [Edit 1 times, last edit by adriverhoef at Jan 12, 2019 3:50:37 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Boinccmd stop is the BOINC developer's provided way to stop the program. Filesystems today tend to take care of writing in-memory data to the disk during shutdown as a way of maintaining filesystem integrity except in the case of the ungraceful shutdowns obviously. Suspend doesn't do anything except make the science processes non-dispatchable to the processor (especially if LAIM is specified in the options). I just did a suspend under Fedora 29 and it worked just fine on a 4 core and 8 core system. I didn't do a reboot, just a suspend/resume. If you are using a script to perform the function, is there enough time to allow for the suspend to complete before the next command executes (assuming the next command is shutdown)? Maybe put a wait of a few seconds between commands. Does it crash if you just do a suspend/resume in boinc manager?
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have used suspend in boincmgr, and it has never crashed on me. It also does not crash most of the time when the suspend is done by boinccmd in a cron script.
My script (more or less) to stop boinc before doing a system update and then a reboot: boinccmd --project http://www.worldcommunitygrid.org suspend loop and sleep until no boinc transfers are active systemctl stop boinc-client.service loop and sleep, waiting until the boinc process is gone My script (more or less) to resume boinc after reboot: (boinc is started automatically by systemd at reboot) loop and sleep, waiting until the boinc process is up boinccmd --project http://www.worldcommunitygrid.org resume I would like to know if stopping boinc ("boinccmd --quit") quits gracefully (does final checkpoints before actually quitting) - the documentation doesn't say. It probably does do a final checkpoint, and using suspend / resume isn't necessary and maybe doesn't even force a final checkpoint. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Using boincmgr, I suspended WCG. Then selected a long running task. Then got properties of that task, and noted the "slot" number, say "3". Then in the /var/lib/boinc/slots/3 directory did a "ls -l" and noted that the checkpoint file was NOT updated when I did the suspend. Therefore, I'm concluding that "suspend / resume" does NOT "force a final checkpoint". So I'm reworking my scripts to remove the suspend / resume.
However, there is still an intermittent bug in wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu, or in the system routines which it calls. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Simply quitting boinc (with no suspend/resume) causes final checkpoints to be written, so that no CPU cycles are wasted. After stopping boinc, listing the checkpoints with "ls -ltr /var/lib/boinc/slots/*/*check*" showed all checkpoints with the same timestamp, within 12 seconds.
|
||
|
|
|