Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 1
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8613 times and has 0 replies Next Thread
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Project Checkpoint Saving - How to Minimize Progress Loss on Close/Restart

Complex models do not make it opportune to continuously save the status files of the projects to disk. Some projects would just grind to a halt and make the PC less responsive. It's WCG's objective to ensure for that not to happen and make the process strictly 'in the background and unobtrusive'. For that reason 'Checkpoints' have been introduced to secure the data already computed. These Checkpoints also are a good moment to close the BOINC client without it resulting in substantial progress loss on restart. They are not always easily to determine, but can be seen and are summarized as follows, assuming the default setting of Write to disk (WTD) at most every: of 60 seconds ++:

  • Computing for Sustainable Water (CFSW) A checkpoint is written to disk every few minutes if the client settings are default. This happens at approximately every 0.5% progress. (See Sample Video and FAQ for description)
  • So No to Schistosoma (SN2S) A checkpoint is written to disk if the client settings permit at the end of each job included in the work unit. The time for an included job can last from less than a minute to half an hour depending on speed of device (See Sample Video and FAQ for description)
  • GO Fight Against Malaria (GFAM) A checkpoint is written to disk if the client settings permit at the end of each job included in the work unit. The time for an included job can last from less than a minute to half an hour depending on speed of device (See Sample Video and FAQ for description)
  • Drug Search for Leishmaniasis (DSFL) A checkpoint is written to disk if the client settings permit at the end of each job included in the work unit. The time for an included task can last from less than a minute to half an hour depending on speed of device (See Sample Video and FAQ for description)
  • Computing for Clean Water (C4CW): Each 0.5% progress a checkpoint is written to disk. Depending on target size and speed of host, they are usually not more than few minutes apart. (See Sample Image and FAQ for description)
  • The Clean Energy Project - Phase 2 (CEP2): Each task contains 16 jobs. At the end of each 'job', a checkpoint is written to disk. The first two complete relatively quick, those after can take considerable time, multiple hours. The tasks are presently (July 23, 2010) limited to run a maximum of 12 CPU hours. (See Sample Image and FAQ for description)
  • Discovering Dengue Drugs - Together - Phase 2 (DDDT2): The checkpoints are recorded every 2% progress for the B+C types of workunits and one every 10% for the A type (10 'loops'). They are in time spaced from every few minutes to hours apart depending on type [A-B-C] and speed of device. The initial checkpoint at start can also take a longer time. (See FAQ for description)
  • Help Cure Muscular Dystrophy, Phase 2 (HCMD2): Checkpoints are saved at each 'position' and can take from just seconds to several hours depending on the speed of the device and the complexity of the individual position. Colour changes are pure random, thus could on outside chance assume same colour even if checkpoint was reached. (See Sample Image and FAQ for description)
  • Help Fight Childhood Cancer (HFCC): The checkpointing can be observed by watching the progress percent in the graphics. It will imminently happen when this value starts going back slightly and up again until the best energy point has been computed after which progress will be continuous until nearing next checkpoint, repeating the cycle. Activate message log checkpoint recording for convenience. (See FAQ for description)
  • The Clean Energy Project (CEP1) (Complete): This project has 3 types of tasks identified by an A, B or C in the result name. Checkpoints occur approximately every 2.5% progress, thus at 2.5, 5.0, 7.5 etc. On the longer jobs that can work out as considerable time. (See Sample Image and FAQ for description)
  • Nutritious Rice for the World (NRW): This project has a fixed run time and checkpoint space lasting from under 1 minute to several minutes. (See Sample Image and FAQ for description)
  • Help Conquer Cancer (HCC): This project checkpoints after each filter round. Unusual the first takes about 5 minutes to complete and write a checkpoint save. Thereafter the timespan increases progressively to the end of the work unit where it can take 30 - 60 minutes. (See Sample Image and FAQ for description)
  • AfricanClimate@Home, Phase 1 (Complete): There are approximately 56 checkpoints evenly spread over the duration of the task which are long to very long. (See Sample Image [N.A.] and FAQ for description)
  • Discovering Dengue Drugs - Together, Phase 1 (DDD-T) (Complete): The checkpoints are in incremental steps of 0.2 to 0.25% progress, thus the relative loss on shutdown and resume is minimal (See Sample Image and FAQ for description)
  • Help Cure Muscular Dystrophy, Phase 1 (Complete): When the PDB code / Protein Symbols in the 'I' screen left hand bottom change and the 2 proteins in the main graph assume the same colors. Color changes are pure random, thus could on outside chance assume same color even if checkpoint was reached. Watch the PDB code change for absolute indication! (See Sample Image and FAQ for description)
  • Genome Comparison (Complete): Approximately every 10 minutes irrespective of computer speed(See Sample Image and FAQ for description)
  • Help Defeat Cancer (Complete): at 25% intervals - writes large files (See Sample Image and FAQ for description)
  • Human Proteome Folding, Phase 2 (HPF2): Occurs after each structure attempt, which can vary in run time depending due it's non-deterministic nature. The values indicated in the 3 spheres on left side of graphics, Solvation, Res-Res-Pair and Hydrogen bonds are not an optimal indicator, thus best is to revert to activating and observing the checkpoint log to know when one happened. (See Sample Image for BOINC Agent and FAQ for description)
  • FightAIDS@Home, Update 2007 (FA@H experiment 11 and on): When the Best Energy C graph green line has reached the end and returns to the beginning, whilst rescaling the graph and adding a red line indicating the path of the previous attempt. (See Sample Image and FAQ for description)
All times noted vary depending on the speed of the host computer! (to include temperature driven reducing/increasing of CPU cycles

Some of these checkpoints usually translate to some or considerable flickering of the hard drive light for a number of seconds, but only if the files to save are larger. BOINC uses the same check pointing! The graphical screens for this agent has less detailed elements or depending on operating system are not available. To see them, go to the Tasks Tab of the 'Advance View' (version >= 5.8), select the Work Unit in progress and hit the 'Show Graphics' button in the left margin. If the button is greyed, no graphics is available.

Accurate time intervals cannot be given between checkpoints as they are very dependent on the speed of the computer and how much idle time is available to perform the calculations.

From client 6.4 one can see the last checkpoint for a task by opening the BOINC Manager GUI, advanced view, selecting the Tasks tab and a running task. Then click on the properties button in left margin which opens up an info screen, such as the next sample highlighting CPU time and Elapsed time (wallclock)



When running on BOINC and interested to see an actual record of check-pointing shown in the Message Tab, you can create a file called cc_config.xml (see Core Client Configuration) and turn on logging with <task_debug>1</task_debug>. Restart BOINC or from 5.8.11 and up, use the 'Advanced' menu > 'Read Config File' to activate the additional message recording. Deactivation can be done by replacing the 1 by 0 and repeat the restart/read action. Set your BOINC throttle at 100% otherwise you will get a ton of output (other utilities can emulate the throttle outside of BOINC!).

From BOINC version 5.8.16 checkpoints recording in the message log can be activated with the <checkpoint_debug>1</checkpoint_debug> log flag inserted into the cc_config.xml file. This and many other log flags can be added/changed with a text-editor and effectuated immediately by using the 'Advanced' > 'Read Config File' menu option.
2-3-2007 0:29:55|World Community Grid|[checkpoint_debug] result faah1368_d116n672_x2BPW_00_2 checkpointed
2-3-2007 0:34:51|World Community Grid|[checkpoint_debug] result faah1368_d116n693_x2BPW_01_1 checkpointed
2-3-2007 0:46:48|World Community Grid|[checkpoint_debug] result faah1368_d116n693_x2BPW_01_1 checkpointed

Another little known, yet absolute verification of checkpoints, is visiting the BOINC\Slots\ directory's Slot0, Slot1, Slot2 folders where BOINC stores data for the tasks being computed and the files with last date/time saved. Per the below sample a FAAH job started at 8:28 checkpointed lastly at 9:07.



With BOINC, the default minimum disk write setting in the device profile is 60 seconds. This value can be increased to (Write to disk at most every: 999 seconds), BUT, increasing this value will postpone the checkpoint saving as programmed into the science application. E.g. setting 999 seconds with Genome Comparison which saves around every 600 seconds, would delay the checkpoint save till the next i.e. around 1,200 seconds. For programs that do checkpoint saves for each segment/attempt/seed completed, the save is postponed until permitted by the profile setting i.e. on first opportunity after the exampled 999 seconds. Generally the default of 60 second should be fine for most all unless one wants to reduce disk i/o.

The information is subject to change. Graphics could be updated or more granulated checkpoint saving applied. New project checkpoint info will be added as and when they become known.

It is recommended to exit Windows or other OSses properly thru the shutdown menu, so the correct close down routines are performed to secure any progress files for agents still running. Power Off / Reset button operation always runs potential of loosing or damaging progress files. For Vista it is recommended to wait up to 60 seconds after exiting the agent unless the tweak is applied mentioned in the Vista: How To... post. Optimal is to visit the BOINCmgr Advanced View, Advanced menu and taking the "Shut Down connected client...", prior to system restart or power down. This will end both the application and (protected) service in a controlled manner.

++ "Write to Disk at most every:" has been relabeled in client 7 and up to "Tasks checkpoint to disk at most every:"

Tips:


  • Hibernation (or sleep mode), is a good way To Not Incur Progress Loss On Resume/Power On, and is covered in a separate post in this 'Start Here ' forum.

  • When electing to not crunch during computer use aka run in a idle/resume setup with the "while computer is in use" option deselected and "Leave application in memory when suspended" (LAIM) no selected, the sciences will unload from memory each time the device receives mouse/keyboard input. This results in the application to restart at the last checkpoint after the idle/resume time has passed. If enough memory is available, say 1gb base plus 256mb for each processor core in the computer, it is better to switch the LAIM option on. Generally [for most members] BOINC computing while using the computer is unnoticeable. BOINC runs at lowest priority giving way to all user input and other applications which normally run at normal priority. A separate topic will be developed explaining the full LAIM functionality.
  • ----------------------------------------
    WCG Global & Research > Make Proposal Help: Start Here!
    Please help to make the Forums an enjoyable experience for All!
    ----------------------------------------
    [Edit 58 times, last edit by Former Member at May 4, 2012 1:08:03 PM]
    [Jan 22, 2007 7:58:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
    [ Jump to Last Post ]
    Post new Thread