Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 14
Posts: 14   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4107 times and has 13 replies Next Thread
BobbyB
Veteran Cruncher
Canada
Joined: Apr 25, 2020
Post Count: 638
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Checkpoint Stuff

Looking at checkpoints stuff on an Xubuntu machine.

I look at the ARP WU properties and see 02:02:38 in CPU time since last checkpoint. I presume if this machine crashes I lose this 2 hours of processing.

Now if I want to reboot this machine, for whatever reason, what is the best way? I thought if I told the WCG project to update itself it would do checkpoints. Not so.

How about if I suspend the WCG project then reboot do I lose the 2 hours.
[Aug 16, 2020 5:19:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

ARP checkpoints every 12.5% of progress.
If you have/want to reboot without much loss suspend ARP-tasks not started yet and the running ones at 12.5%, 25%, 37.5%, 50%, 62.5%, 75% or 87.5% of progress or wait until they have finished.
[Aug 16, 2020 5:35:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

BOINCTasks can be used to automatically suspend running tasks at the next checkpoint. Prior, switch LAIM off so the job will actually unload from memory and suspend all 'Ready to Start' else they will start when the running tasks suspend.

Notably, BOINCTasks has a slight 1-2 minute delay after the checkpoint before running tasks are actually stopped. According discussion https://forum.efmer.com/index.php?topic=1340.0 you can set an alert rule in BOINCTasks when a task suspends.

FYI

edit: text appearing when mouse is put over Update button in BOINC Manager:
"Report all completed tasks, get latest credit, get latest preferences, and possibly get more tasks"
----------------------------------------
[Edit 1 times, last edit by Former Member at Aug 17, 2020 9:57:48 AM]
[Aug 17, 2020 9:00:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
BobbyB
Veteran Cruncher
Canada
Joined: Apr 25, 2020
Post Count: 638
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

Bottom line: I lose the 2 hours. Maybe the Berkley people should give the users more control over everything. Not much I can do about that.

Crystal Pellet: what you describe sounds like the default percentages... and where could I change these? I would have to guess when a checkpoint occurs or finished so as to not lose too much.

lavaflow: Don't use BOINCTasks and you lost me with LIAM. Looked at BOINCTasks. Not for me. I only have 4 IPs in the basement. Takes 30 seconds to check.

Read about the alert rule thread. It presumes you are actually watching all this or the machine you are on is crunching during idle time. The machine I am on does not crunch. It only does BOINCmanager when and if I start it to "look in" on the crunchers in the basement.

In my case I may lose more since I don't check on all this all the time. It actually happened when I tried to micro-manage to get an ARP to start before other tasks by suspending them. Forgot about it. This is why I got into the RPC thing.

But thanks for the replies and ideas.
[Aug 17, 2020 5:06:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

LAIM, not LIAM: Leave application in memory (when suspended).

People watch their machines when planning to boot for their monthly Windows fix dose... the purpose of the functionality, but also includes upgrading BOINC itself. On a 8-16-32 core machine I'd personally never would do this, but with the single ARP running, I set that to get to the checkpoint with the 3 hour interval (12.5% can't be changed, no checkpoint interval can be changed as it solely happens at the end of a simulation step) and then switch to other short interval checkpointing sciences. Not really watching then at all... just boot/upgrade.
[Aug 17, 2020 5:16:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
BobbyB
Veteran Cruncher
Canada
Joined: Apr 25, 2020
Post Count: 638
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

Yes I am watching 1 machine (Xubuntu) to do some debugging which is how this thread started. It hangs at times so I want to go to 20.04 LTS and latest boinc then see. Every time I check it is ARPing.

My only Win10 is a lowly Duo Core Centrino and I disabled auto booting after updates and no ARP. Hmmm just came to me. Change that Xubuntu to the same profile.

So I have my answer.
[Aug 17, 2020 7:23:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

Crystal Pellet: what you describe sounds like the default percentages... and where could I change these? I would have to guess when a checkpoint occurs or finished so as to not lose too much.

The ARP-tasks have the longest time between checkpoints. It's the nature of the project.
1 ARP-task analyses the meteorological data of 2 days starting with the 1st of July 2019.
Meanwhile we are at about the 8th of August 2019.
Every 2 days is divided in 6-hour blocks, so 8 blocks of data from those 2 days.
Therefore a checkpoint cannot not be made earlier before a 6-hour data block is analysed.
You can't change anything there yourself, only choose a WCG-sub project with a shorter checkpoint interval like MCM or OPN.
With BOINC Manager you can monitor the %-progress and with WU-properties you'll see when a checkpoint is made.
You could also use a log_flag in cc_config.xml: <checkpoint_debug>1</checkpoint_debug>
[Aug 17, 2020 7:28:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

Can't make checkpoint intervals shorter, but you can make them longer with the Write to Disk control. I've set it to 600 seconds at most. Would you run 16-32 and more threaded machines, the writing with 60 seconds default and short checkpoint interval projects would be kind of perpetual. Not want to know what that would do to an SSD or USB drive, but a shortened lifespan might be in the cards.
[Aug 17, 2020 8:08:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

Not want to know what that would do to an SSD or USB drive, but a shortened lifespan might be in the cards.

I have experience with a USB drive that got a couple of ARP units (my mistake). You are absolutely correct in thinking the ARP will basically destroy the USB stick because of amount of writes (I think). All of my machines running off of USB sticks have ARP excluded. ARP is only allowed on machines with a regular hard drive.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Aug 18, 2020 12:38:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
floyd
Cruncher
Joined: May 28, 2016
Post Count: 47
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint Stuff

You are absolutely correct in thinking the ARP will basically destroy the USB stick because of amount of writes (I think). All of my machines running off of USB sticks have ARP excluded.
I had a closer look at that a while ago. Sorry I don't recall all details now. An ARP task writes between 4 and 5 gigabytes to disk and that is because it re-writes the whole data at every checkpoint, mostly unmodified. That's 500 megabytes every time and even more at the end.
What you should be aware of is that OPN is even worse if you let it checkpoint at will. Recent observation showed it writing about 1 gigabyte per task (+/- 300 megabytes), and that's only one hour of work on my Ryzen 3600. That one could do 150 tasks every day. Many people seem to run OPN exclusively these days and at default settings it could be a killer.
I also tried reducing the checkpoint frequency through BOINC but found that this doesn't seem to be a hard limit. Sometimes tasks still write checkpoints when I don't expect them so I don't feel comfortable relying on that.
----------------------------------------
[Edit 1 times, last edit by floyd at Aug 18, 2020 10:04:56 AM]
[Aug 18, 2020 9:57:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 14   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread