World Community Grid - View Thread

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Checkpoint Stuff

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 14

[ ]

Author

This topic has been viewed 4106 times and has 13 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Checkpoint Stuff

If you see that checkpointing happening sooner than your setting then catch the result log and forward to the techs with the event log and your settings which are in the global_prefs or global_prefs_override files. It should not.

Checkpoints can be very well monitored by forcing them to be recorded in the event log as well with the cc_config.xml log line <checkpoint_debug>1</checkpoint_debug>
Then the entries in the event log should match those in the result log.

How it (should) work: The app checkpoints at whatever simulation step ends internally and asks the client: May I write now? With a 10 minute setting (600 seconds), if 6 minutes have passed, it should skip the write and supposing the next one is at 12 minutes, it wont ask again since it was told first time what the interval was which remains in force until end of task or a client/task restart. It will write one then. The one at 18 minutes should be skipped again, the one at 24 minutes written and on and on.

It has happened that somehow that initial question is not asked and the write is done anyway at every programmatic step completion, which was a compile parm error.

And yes, I was thinking the short tasks are likely worse per unit of time. Also ARP may appear bad at the data upload level, but if you consider 100MB for 24 hours crunching, here, it's like less than 5MB per crunch hour.

[Aug 18, 2020 11:09:21 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1316
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Checkpoint Stuff

Regarding unexpected checkpoints and intervals...

Some tasks comprise multiple runs (either implicitly, as in MIP1 with its sequences and structures, or explicitly, as in OPN1 with its multiple "jobs" per work unit. A checkpoint at the end of a sequence or "job" may be mandatory (as its content may be part of what is returned as results and/or it may be needed to manage program flow), in which case it won't pay any attention to whatever the client's checkpoint intervals might be.

I have experienced the above at first hand... I was doing some performance testing with a 30-minute checkpoint interval (to try to avoid perf stat monitoring checkpointing code paths!) and MIP1 jobs with lots of short sequences were still checkpointing every 50 seconds or so!

Another situation in which an unexpected checkpoint might appear is if the application does some data-wrangling before starting its main processing and decides to checkpoint to conserve that. An example of that is the initial checkpoint in OPN1 which happens after running AutoGRID at the start of the work unit.

And finally - changes made to the checkpoint interval won't be seen by tasks that have already started; as lavaflow suggested, it only asks the client once, and uses that interval value for its entire run (or, at least, that is how things seem to work!) And that sticks, even across a client restart -- if the job had started, it seems to continue using the original interval.

Cheers - Al.

[Aug 19, 2020 3:41:05 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Checkpoint Stuff

Let's not misunderstand, the internal checkpointing happens regardless in memory as that is the input to the next step, talking about the writing to disk or WtD.

Sticking beyond a restart is news to me. Kind of irritating if your client sits with the default 60 seconds on 16 threads, you change it to 600 seconds and after a restart the jobs that ran continue to run with WtD 'at most' 60 seconds.

I'll parse my event log file (stdoutdae.txt) at the occasion and filter out the start/end/checkpoint entries to see what's on with the 3 sciences that are always running.

25185 World Community Grid 8/19/2020 5:53:50 AM [cpu_sched] Starting task OPN1_0007959_04514_0 using opn1 version 717 in slot 5
Desktop-03
25209 World Community Grid 8/19/2020 5:54:18 AM [checkpoint] result OPN1_0007959_04514_0 checkpointed

Certainly looks like that initial checkpoint is written out within the first minute.

[Aug 19, 2020 5:38:20 AM]

floyd
Cruncher
Joined: May 28, 2016
Post Count: 47
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Checkpoint Stuff

A checkpoint at the end of a sequence or "job" may be mandatory (as its content may be part of what is returned as results and/or it may be needed to manage program flow), in which case it won't pay any attention to whatever the client's checkpoint intervals might be

That was one of my thoughts. Another one is I'm not sure what "Request tasks to checkpoint at most every x seconds" exactly means. What are those seconds? CPU time? Run time? Usually there won't be much difference, but could it even mean wall clock time? That could change behaviour a lot. However, even if I knew that it wouldn't make the effect realistically predictable so I can't rely on this option to significantly reduce disk writes. And as long as I'm not convinced that something is actually wrong I won't bother anyone about it.

changes made to the checkpoint interval won't be seen by tasks that have already started

I noticed that too. To me it means I can't set a long checkpoint interval to limit disk writes and then reduce it again when I wish to have a checkpoint soon.

that sticks, even across a client restart -- if the job had started, it seems to continue using the original interval

I'm not sure if I've seen it that way. But it's been a while.

[Aug 19, 2020 10:02:20 AM]

[ ]