Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 8
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1911 times and has 7 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Losing work on reboot

When I reboot my main system, according to the progress bar of the tasks list, I'm losing some of the work that's been done. This better than the old problem of coming back from a reboot to discover projects are now reporting "Computation Error", but it's still a problem.

Here are two screen shots. The first is just before a recent reboot, the second just after.





FAH, ARP, some of the MCM, and the one MIP projects all show a loss of work.

Under Computing Preferences, I've got "Request tasks to checkpoint at most every X seconds" set to 15. The pop-up says that this saves the project state to disk so that the work can be continued from that point, but it's clear this isn't actually happening.

So - how do I get BOINC to save the work that's been done prior to a reboot? This is especially important for ARP, as this project takes forever to run, and really should be broken into smaller parts. Nonetheless, I'll continue to run it as long as I'm able to reboot without losing a noticeable chuck of what's already been crunched.
[Nov 11, 2019 1:19:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Falconet
Master Cruncher
Portugal
Joined: Mar 9, 2009
Post Count: 3315
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Losing work on reboot

ARP checkpoints every 12.5% of progress - This could mean several hours of lost work.
MIP and FAAH2 can have some long periods without checkpointing thought nothing like ARP (I've seen MIP not checkpointing for 45 minutes, for instance).

MCM doesn't have that problem to my knowledge.

The setting you mention doesn't mean it will order the work units to checkpoint every 15 seconds. It simply means it tells work units not to checkpoint at a frequency lower than 15 seconds.

Work units checkpoint per their programming, etc.

ARP is the only project you need to worry about work losses due to the 8 fixed checkpoints each work units has.
----------------------------------------


- AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W
- AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W
- AMD Ryzen 7 7730U 8C/16T 3.0 GHz
[Nov 11, 2019 2:45:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Losing work on reboot

Thanks for the info. That's a rather odd design, especially for ARP, given how long it takes to reach 1/8 of the work.

It seems the client could really use an option to force all the projects to write a checkpoint. That way, when a user needs to do a reboot, he can suspend the work, write the checkpoints, reboot, and return to continue without work loss.
[Nov 11, 2019 4:44:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hiimebm
Senior Cruncher
United States
Joined: Oct 19, 2014
Post Count: 305
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Losing work on reboot

If you need to turn off your computer for the night but don't want to get reverted to the last checkpoint, enable the "Keep non-gpu tasks in memory" option in Options and put your PC to sleep instead.
----------------------------------------

[Nov 11, 2019 5:44:57 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Losing work on reboot

Suspension (or hibernation, if necessary) is better than powering down. Only re-boot when you need to (though I know that Windoze forces re-boots far too often). If you use the Properties button in the advanced BOINC Manager view it will tell you when a WU last checkpointed, so you might want to make sure that ARP isn't going to lose too much before you reboot.

Don't blame the developers for the long time between checkpoints in ARP. They're stuck with what they have. The techs do try to do their best for us, within the constraints imposed on them.

There is a solution involving virtual machines, but that's too much for most users. I don't bother. I use suspend and only reboot when I need to (maybe once every year or two).

All of these things have been discussed recently elsewhere in these fora. Check them out if you want more detail.
[Nov 11, 2019 6:29:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Losing work on reboot

Not only does one need to reboot under Windows more often than whatever flavor of Unix your boxes are running, it's also the only way to shake loose stuck tasks, such as the one I've currently got that's says it's 100% done, but won't finish. Suspend/Hibernate doesn't solve either need.

I don't really care how checkpoints are implemented; it's a poor design, especially for such a project where each task takes so long to run.

A better approach for the future would be shorter tasks, more checkpoints, and the ability for the client to force an unscheduled checkpoint. I don't know what percentage of users run under Windows, but I'll bet it's enough to make it worthwhile to address these design shortcomings. Wasting my CPU cycles on poor design doesn't endear me to donating them.
[Nov 11, 2019 7:31:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Losing work on reboot

If a task is totally stuck, kill with the task manager. The name of the executable together with the time it's been running should easily identify it. Or you can try restarting the manager but leaving BOINC running. Or you can suspend/resume that one task, with LAIM off if you want it to reset to the previous checkpoint. Get more creative, don't just use the big red switch approach that used to make computers a big joke - it's not funny.

Your whining about checkpoints simply shows that you don't understand the way the software works. If you don't like it, don't run that sub-project. Nothing is going to change (unless you start using a VM and dump the whole VM, but that still leave you with the 'stuck task' problem).

Life isn't always the way you'd like it to be. Sorry, but that's reality for you. Sometimes it sucks.
[Nov 11, 2019 10:55:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Losing work on reboot

Wow - could you be more condescending, please? i'm willing to bet I've been working with and supporting computers longer than you've been alive.

I've tried everything you listed, repeatedly, and the only thing that unsticks a task that shows "-----" for the time remaining running BOINC 7.14.2 under Windows 10 is to reboot the system.

Identifying design problems and making suggestions as to how to improve the implementation isn't whining. Everything changes, which is why new versions of the software are periodically released that address problems users identify. If that also requires changes to the way the task data is structured for the client to use when processing it, so be it.

Life isn't always the way you'd like it to be? That's really deep. You'll notice in this thread that the first reply, from Flaconet, was informative and helpful, and yours stand in stark contrast. Save the lectures for your minions, the sad lot they certainly must be. We're done here.
[Nov 13, 2019 2:58:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread