| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 30
|
|
| Author |
|
|
Jean-David Beyer
Senior Cruncher USA Joined: Oct 2, 2007 Post Count: 339 Status: Offline Project Badges:
|
I got two work units this morning.
----------------------------------------One of them claims that 6:37:10 cpu time was the most recent checkpoint, and that the current cpu time was 6:50:45. 25.729% done. So it must have done two checkpoints, right? The other one is about the same. ![]() |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
Jean-David Beyer said:
----------------------------------------I got two work units this morning. One of them claims that 6:37:10 cpu time was the most recent checkpoint, and that the current cpu time was 6:50:45. 25.729% done. So it must have done two checkpoints, right? The other one is about the same. Yep. It checkpointed at 12.5% and 25.0%. 37.5% 50.0% 62.5% 75.0% 87.5% 99.0%
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
WARNING: What I've done could cause the job to blow up... having files open on a live job!
In the job slot the stderr.txt file gives some basic checkpoint info INFO: Initializing INFO: No state to restore. Start from the beginning. Starting WRFMain [07:36:16] INFO: Checkpoint taken at 2018-07-03_06:00:00 [11:24:57] INFO: Checkpoint taken at 2018-07-03_12:00:00 [15:20:40] INFO: Checkpoint taken at 2018-07-03_18:00:00 [18:03:06] INFO: Checkpoint taken at 2018-07-04_00:00:00 wcg_checkpoint.dat logs the names of the checkpoint files wcg_wrf.state wcg_checkpoint_00.ckp wrfrst_d01 wcg_checkpoint_01.ckp wrfrst_d02 wcg_checkpoint_02.ckp wrfrst_d03 wcg_checkpoint_03.ckp This ties with the 85-90MB or so sized ,ckp files going by the listed names. Net: this tells you how far progressed, but the percent does so too. Every 12.5% progress a checkpoint file is written. |
||
|
|
Jean-David Beyer
Senior Cruncher USA Joined: Oct 2, 2007 Post Count: 339 Status: Offline Project Badges:
|
Curious why the dates of those files, for each process, are the same. I thought there would be some execution time between the checkpoints, unless they all get touched each time one is written.
----------------------------------------So most, so far, seem about 90 Megabytes. 83 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_00.ckp 92575412 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_01.ckp 87327860 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_02.ckp 87327860 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_03.ckp 135 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint.dat 83 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_00.ckp 92575412 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_01.ckp 87327860 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_02.ckp 87327860 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_03.ckp 135 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint.dat ![]() |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Curious why the dates of those files, for each process, are the same. I thought there would be some execution time between the checkpoints, unless they all get touched each time one is written. So most, so far, seem about 90 Megabytes. 83 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_00.ckp 92575412 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_01.ckp 87327860 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_02.ckp 87327860 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_03.ckp 135 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint.dat 83 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_00.ckp 92575412 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_01.ckp 87327860 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_02.ckp 87327860 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_03.ckp 135 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint.dat If you look at the stdout.txt file in the same directory you will see that it works on three domains; number 1 displays a timing delta every 36 model seconds, number 2 displays a timing delta every 12 model seconds and number 3 every 4 model seconds. If you plough through the file to any 15-minute model time point you will see it writes three wrfout files, and if you make your way on to a 12-hour model time point you will see that it writes a restart file for each domain. So three domains, three checkpoint files. That stdout.txt file is a mine of timing information! |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
The reference to slots/4 and slots/5 could well refer to 2 different units on different threads because the threads are referred to as slots on my i7-3770.
That would indicate 5 items written for each checkpoint and 2 units checkpointing at almost the same time. Each unit covers a 48 hour period in 2018, presumably so that they can check predictions against what actually happened. As there are checkpoints at 12.5% intervals that means 6 hour intervals within the 48 hour time period. Lavaflow's data seems to indicate a fluctuating time period to execute each 6 hour segment. Mike |
||
|
|
rbwalton
Cruncher Joined: Dec 11, 2008 Post Count: 4 Status: Offline Project Badges:
|
Hmm. Now I see why I have 20 hours of work left on my workunit. Eight hours ago when I turned the computer on, I had 16 hours left. And yes, the thing had to re-boot for internet security program updates.
|
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Re-booting can be a problem, but I set it to advise when it wants to update. Not all updates require a re-boot but I assume they all do.
I allow the update and if it requires a re-boot, I suspend the re-boot, let any arp1 units carry on to the next checkpoint when I suspend that unit, allowing mcm1 or mip1 to start working(so as not to waste any processing time). When all arp1 units have reached their next checkpoint, I re-boot the machine. It can take up to 3 hours from start to finish (or longer for slow machines) but you don't need to keep monitoring it as you can estimate when each unit is likely to reach its checkpoint. Mike |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
In the Result Log all checkpoints have a time stamp.
Would it be possible to add a time stamp to the INFO line: Starting WRFMain Example Result Log: INFO: Initializing INFO: No state to restore. Start from the beginning. Starting WRFMain [15:07:01] INFO: Checkpoint taken at 2018-07-03_06:00:00 [22:25:44] INFO: Checkpoint taken at 2018-07-03_12:00:00 [05:11:27] INFO: Checkpoint taken at 2018-07-03_18:00:00 [10:23:57] INFO: Checkpoint taken at 2018-07-04_00:00:00 [15:47:08] INFO: Checkpoint taken at 2018-07-04_06:00:00 [22:05:58] INFO: Checkpoint taken at 2018-07-04_12:00:00 [03:55:10] INFO: Checkpoint taken at 2018-07-04_18:00:00 [08:27:12] INFO: Checkpoint taken at 2018-07-05_00:00:00 INFO: Simulation complete compressing output. 08:30:43 (30294): called boinc_finish(0) |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Crystal Pellet
It is not a published item but you can obtain the information from elapsed time. Mike |
||
|
|
|