Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 30
Posts: 30   Pages: 3   [ Previous Page | 1 2 3 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 7906 times and has 29 replies Next Thread
Jean-David Beyer
Senior Cruncher
USA
Joined: Oct 2, 2007
Post Count: 339
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

I got two work units this morning.

One of them claims that 6:37:10 cpu time was the most recent checkpoint, and that the current cpu time was 6:50:45. 25.729% done. So it must have done two checkpoints, right?

The other one is about the same.
----------------------------------------

[Dec 11, 2019 6:59:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

Jean-David Beyer said:
I got two work units this morning.

One of them claims that 6:37:10 cpu time was the most recent checkpoint, and that the current cpu time was 6:50:45. 25.729% done. So it must have done two checkpoints, right?

The other one is about the same.

Yep. It checkpointed at 12.5% and 25.0%.

37.5%
50.0%
62.5%
75.0%
87.5%
99.0%
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Dec 11, 2019 7:13:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

WARNING: What I've done could cause the job to blow up... having files open on a live job!

In the job slot the stderr.txt file gives some basic checkpoint info

INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[07:36:16] INFO: Checkpoint taken at 2018-07-03_06:00:00
[11:24:57] INFO: Checkpoint taken at 2018-07-03_12:00:00
[15:20:40] INFO: Checkpoint taken at 2018-07-03_18:00:00
[18:03:06] INFO: Checkpoint taken at 2018-07-04_00:00:00

wcg_checkpoint.dat logs the names of the checkpoint files

wcg_wrf.state
wcg_checkpoint_00.ckp
wrfrst_d01
wcg_checkpoint_01.ckp
wrfrst_d02
wcg_checkpoint_02.ckp
wrfrst_d03
wcg_checkpoint_03.ckp

This ties with the 85-90MB or so sized ,ckp files going by the listed names.

Net: this tells you how far progressed, but the percent does so too. Every 12.5% progress a checkpoint file is written.
[Dec 11, 2019 7:14:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jean-David Beyer
Senior Cruncher
USA
Joined: Oct 2, 2007
Post Count: 339
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

Curious why the dates of those files, for each process, are the same. I thought there would be some execution time between the checkpoints, unless they all get touched each time one is written.

So most, so far, seem about 90 Megabytes.

83 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_00.ckp
92575412 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_01.ckp
87327860 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_02.ckp
87327860 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_03.ckp
135 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint.dat
83 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_00.ckp
92575412 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_01.ckp
87327860 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_02.ckp
87327860 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_03.ckp
135 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint.dat
----------------------------------------

[Dec 11, 2019 9:48:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

Curious why the dates of those files, for each process, are the same. I thought there would be some execution time between the checkpoints, unless they all get touched each time one is written.

So most, so far, seem about 90 Megabytes.

83 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_00.ckp
92575412 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_01.ckp
87327860 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_02.ckp
87327860 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint_03.ckp
135 Dec 11 13:33 /home/boinc/slots/4/wcg_checkpoint.dat
83 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_00.ckp
92575412 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_01.ckp
87327860 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_02.ckp
87327860 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint_03.ckp
135 Dec 11 13:24 /home/boinc/slots/5/wcg_checkpoint.dat

If you look at the stdout.txt file in the same directory you will see that it works on three domains; number 1 displays a timing delta every 36 model seconds, number 2 displays a timing delta every 12 model seconds and number 3 every 4 model seconds.

If you plough through the file to any 15-minute model time point you will see it writes three wrfout files, and if you make your way on to a 12-hour model time point you will see that it writes a restart file for each domain.

So three domains, three checkpoint files.

That stdout.txt file is a mine of timing information!
[Dec 12, 2019 1:25:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

The reference to slots/4 and slots/5 could well refer to 2 different units on different threads because the threads are referred to as slots on my i7-3770.

That would indicate 5 items written for each checkpoint and 2 units checkpointing at almost the same time.

Each unit covers a 48 hour period in 2018, presumably so that they can check predictions against what actually happened. As there are checkpoints at 12.5% intervals that means 6 hour intervals within the 48 hour time period.

Lavaflow's data seems to indicate a fluctuating time period to execute each 6 hour segment.

Mike
[Dec 12, 2019 12:20:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
rbwalton
Cruncher
Joined: Dec 11, 2008
Post Count: 4
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

Hmm. Now I see why I have 20 hours of work left on my workunit. Eight hours ago when I turned the computer on, I had 16 hours left. And yes, the thing had to re-boot for internet security program updates.
[Dec 16, 2019 4:15:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

Re-booting can be a problem, but I set it to advise when it wants to update. Not all updates require a re-boot but I assume they all do.

I allow the update and if it requires a re-boot, I suspend the re-boot, let any arp1 units carry on to the next checkpoint when I suspend that unit, allowing mcm1 or mip1 to start working(so as not to waste any processing time). When all arp1 units have reached their next checkpoint, I re-boot the machine.

It can take up to 3 hours from start to finish (or longer for slow machines) but you don't need to keep monitoring it as you can estimate when each unit is likely to reach its checkpoint.

Mike
[Dec 16, 2019 11:15:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

In the Result Log all checkpoints have a time stamp.
Would it be possible to add a time stamp to the INFO line: Starting WRFMain

Example Result Log:
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[15:07:01] INFO: Checkpoint taken at 2018-07-03_06:00:00
[22:25:44] INFO: Checkpoint taken at 2018-07-03_12:00:00
[05:11:27] INFO: Checkpoint taken at 2018-07-03_18:00:00
[10:23:57] INFO: Checkpoint taken at 2018-07-04_00:00:00
[15:47:08] INFO: Checkpoint taken at 2018-07-04_06:00:00
[22:05:58] INFO: Checkpoint taken at 2018-07-04_12:00:00
[03:55:10] INFO: Checkpoint taken at 2018-07-04_18:00:00
[08:27:12] INFO: Checkpoint taken at 2018-07-05_00:00:00
INFO: Simulation complete compressing output.
08:30:43 (30294): called boinc_finish(0)
[Dec 27, 2019 9:04:12 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Checkpoint? No checkpoint

Crystal Pellet

It is not a published item but you can obtain the information from elapsed time.

Mike
[Dec 27, 2019 6:56:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 30   Pages: 3   [ Previous Page | 1 2 3 ]
[ Jump to Last Post ]
Post new Thread