World Community Grid - View Thread - A couple of questions about checkpointing

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: A couple of questions about checkpointing

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 6

[ ]

Author

This topic has been viewed 945 times and has 5 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


A couple of questions about checkpointing

I came here again, because I am a menopausal insomniac who noticed while waiting for an HCC checkpoint, that it didn't save when I thought it should've. I couldn't figure out why.

I have a couple of questions after reading "Project Checkpoint Saving - How to Minimize Progress Loss on Close/Restart" in the FAQs.

With BOINC, the default minimum disk write setting in the device profile is 60 seconds. This value can be increased to (Write to disk at most every: 999 seconds), BUT, increasing this value will postpone the checkpoint saving as programmed into the science application. E.g. setting 999 seconds with Genome Comparison which saves around every 600 seconds, would delay the checkpoint save till the next i.e. around 1,200 seconds. For programs that do checkpoint saves for each segment/attempt/seed completed, the save is postponed until permitted by the profile setting i.e. on first opportunity after the exampled 999 seconds. Generally the default of 60 second should be fine for most all unless one wants to reduce disk i/o.

Question #1
BOINC had many opportunities to do a disk write (48%, 49%, 50%, etc.) but didn't. My computer isn't fast enough to do a whole HCC percentage point in 60 seconds; it takes 2-7 minutes for each percantage point to occur. If I'm interpreting the above quote correctly (and it is entirely possible that I am not), a disk write should happen within the following 60 seconds after each HCC percentage point. So why are my disk writes taking 15-20 minutes to occur?

Question #2
What is the rationale for BOINC's default 60-second-minimum rule instead of the option of immediate disk writing? (I say 'option' because I realize that with some projects immediate disk writing would be undesirable.)

FYI, although I can't imagine why, but if it matters...
• I no longer have two work units running concurrently. I am back to only one tab/work unit again.
• Processor: Intel Pentium 4
• CPU: 3.06GHz
• OS: Windows XP, Home Edition, SP2
• Memory: 1022.79 MB physical, 2.41 GB virtual
• Disk: 223.58 GB total, 95.96 GB free
• Antivirus: Norton 2008
• Firewall: ZA 6.5.731.000

[Nov 5, 2007 11:32:39 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: A couple of questions about checkpointing

Where you are going wrong: applications don't try to checkpoint at every percentage point.

The way it works: each science application has points during the execution of a work unit where it is convenient to checkpoint. Some can checkpoint almost whenever they want to (this is rare). Such applications are set up by WCG to checkpoint about every 10 minutes. Other applications aren't so flexible. They take what opportunities they can, usually after a round of computation has completed and the amount of data that needs saving is small.

So, how does this mesh with the disk write limitation? If the disk write limit says it is too soon to write to disk, then the checkpoint opportunity is missed. You have to wait for the next one. By the time the next opportunity comes, if you have now gone past the limit, the checkpoint occurs (and the disk write) and the timer is reset. And so on.

So, assuming the application is trying to checkpoint less frequently than the limit, then each checkpoint opportunity will be taken.

[Nov 5, 2007 12:03:45 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: A couple of questions about checkpointing

Didactylos:

...each science application has points during the execution of a work unit where it is convenient to checkpoint...

So how is the checkpointing programmed for HCC?

[Nov 5, 2007 12:13:05 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: A couple of questions about checkpointing

It checkpoints after each filter round*. HCC is a little unusual in that the first round takes about 5 minutes, but this increases and by the end of the work unit it is 30 minutes to an hour (depending on your computer's speed, of course).

* This means it just needs to save the result of the filter, not all the complex data needed half way through filtering. And when resuming from a checkpoint, it just needs to start on the next filter, instead of trying to pick up half way.

[Nov 5, 2007 12:32:02 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: A couple of questions about checkpointing

Thank you.

[Nov 5, 2007 12:37:16 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: A couple of questions about checkpointing

Start Here FAQ updated adding the HCC checkpoint observations.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Nov 5, 2007 1:16:43 PM]

[ ]