| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 14
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If you see that checkpointing happening sooner than your setting then catch the result log and forward to the techs with the event log and your settings which are in the global_prefs or global_prefs_override files. It should not.
Checkpoints can be very well monitored by forcing them to be recorded in the event log as well with the cc_config.xml log line <checkpoint_debug>1</checkpoint_debug> Then the entries in the event log should match those in the result log. How it (should) work: The app checkpoints at whatever simulation step ends internally and asks the client: May I write now? With a 10 minute setting (600 seconds), if 6 minutes have passed, it should skip the write and supposing the next one is at 12 minutes, it wont ask again since it was told first time what the interval was which remains in force until end of task or a client/task restart. It will write one then. The one at 18 minutes should be skipped again, the one at 24 minutes written and on and on. It has happened that somehow that initial question is not asked and the write is done anyway at every programmatic step completion, which was a compile parm error. And yes, I was thinking the short tasks are likely worse per unit of time. Also ARP may appear bad at the data upload level, but if you consider 100MB for 24 hours crunching, here, it's like less than 5MB per crunch hour. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Offline Project Badges:
|
Regarding unexpected checkpoints and intervals...
Some tasks comprise multiple runs (either implicitly, as in MIP1 with its sequences and structures, or explicitly, as in OPN1 with its multiple "jobs" per work unit. A checkpoint at the end of a sequence or "job" may be mandatory (as its content may be part of what is returned as results and/or it may be needed to manage program flow), in which case it won't pay any attention to whatever the client's checkpoint intervals might be. I have experienced the above at first hand... I was doing some performance testing with a 30-minute checkpoint interval (to try to avoid perf stat monitoring checkpointing code paths!) and MIP1 jobs with lots of short sequences were still checkpointing every 50 seconds or so! Another situation in which an unexpected checkpoint might appear is if the application does some data-wrangling before starting its main processing and decides to checkpoint to conserve that. An example of that is the initial checkpoint in OPN1 which happens after running AutoGRID at the start of the work unit. And finally - changes made to the checkpoint interval won't be seen by tasks that have already started; as lavaflow suggested, it only asks the client once, and uses that interval value for its entire run (or, at least, that is how things seem to work!) And that sticks, even across a client restart -- if the job had started, it seems to continue using the original interval. Cheers - Al. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Let's not misunderstand, the internal checkpointing happens regardless in memory as that is the input to the next step, talking about the writing to disk or WtD.
Sticking beyond a restart is news to me. Kind of irritating if your client sits with the default 60 seconds on 16 threads, you change it to 600 seconds and after a restart the jobs that ran continue to run with WtD 'at most' 60 seconds. I'll parse my event log file (stdoutdae.txt) at the occasion and filter out the start/end/checkpoint entries to see what's on with the 3 sciences that are always running. 25185 World Community Grid 8/19/2020 5:53:50 AM [cpu_sched] Starting task OPN1_0007959_04514_0 using opn1 version 717 in slot 5 Desktop-03 25209 World Community Grid 8/19/2020 5:54:18 AM [checkpoint] result OPN1_0007959_04514_0 checkpointed Certainly looks like that initial checkpoint is written out within the first minute. |
||
|
|
floyd
Cruncher Joined: May 28, 2016 Post Count: 47 Status: Offline Project Badges:
|
A checkpoint at the end of a sequence or "job" may be mandatory (as its content may be part of what is returned as results and/or it may be needed to manage program flow), in which case it won't pay any attention to whatever the client's checkpoint intervals might be That was one of my thoughts. Another one is I'm not sure what "Request tasks to checkpoint at most every x seconds" exactly means. What are those seconds? CPU time? Run time? Usually there won't be much difference, but could it even mean wall clock time? That could change behaviour a lot. However, even if I knew that it wouldn't make the effect realistically predictable so I can't rely on this option to significantly reduce disk writes. And as long as I'm not convinced that something is actually wrong I won't bother anyone about it.changes made to the checkpoint interval won't be seen by tasks that have already started I noticed that too. To me it means I can't set a long checkpoint interval to limit disk writes and then reduce it again when I wish to have a checkpoint soon.that sticks, even across a client restart -- if the job had started, it seems to continue using the original interval I'm not sure if I've seen it that way. But it's been a while. |
||
|
|
|