Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: FightAIDS@Home Thread: Time between saves or checkpoints |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 23
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Is anyone else seeing a big increase in the time between saves or checkpoints for this project? I was seeing times of a few minutes to maybe 20 minutes before today. Now I'm seeing times of a few to several hours (or as much as 50% completed).
|
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
It has been advised that the HPF2 are very demanding, such that a single segment or chunk will take a very long time.....hence the restart (check)points lying far apart in time. Imagine a HPF2 with the number of segments of a HPF1, then going bust in the 11th hour......big waste. The startup problems proof the foresight was right, not to cram too many in one.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Jul 11, 2006 6:15:13 AM] |
||
|
Wikkel
Cruncher Joined: Dec 28, 2005 Post Count: 9 Status: Offline Project Badges: |
same here...
I have three devices running, they are all processing a FAAH chunk that runs for less than 2 hours up to a progress of 50%. Than the progress falls back to 48%, runs op to 50%, falls back to 48% etc.. When a device is power cycled the chunk starts all over again at 0%. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
FAAH are entirely differently and smaller segmented ..... 1 or 2%. Progress should be seen steadily with time. In the UD Agent ** progress can be seen under the 'i' button in steps of 0.1% (the graph screen).
----------------------------------------the 48, 50, 48, 50 etc suggests its attempting to complete a segment of 2% or thereabout, but not succeeding. Never seen this phenomena before and certainly not heard of anyone showing this on 3 machines of 1 user at the same time.....FAAH is generally very steady. If it does this for extended time, i'd abort by killing the UD_7xxxxxx.exe process in Taskmanager and fetch a new WU (Work Unit), which it should do automatically. But, lets see if a WCG technician picks up on this first. ** Guessing you have UD Agent, not BOINC. Power cycling runs risk of damaging WU....best is to properly exit UD first. EDIT 12.55pm 7.11.06: I'M SEEING THE SAME NOW BY COINCIDENCE ON THE 'i' SCREEN OF A FAAH ON UD. progresses slowly from 47.7 to 50.0%, than skips back to 47.7. Now cycled thru that 4 times in the last 2 hours......unless getting a quick reply, i'm sending this one to eternal hunting grounds, in the prescribed manner.DEVICE IS 160465, FAAH v 4.0.3.4, WU download approx. 01:21am UTC
WCG Global & Research > Make Proposal Help: Start Here!
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 7 times, last edit by Sekerob at Jul 11, 2006 11:25:48 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The graph falling back to a lower percent completed is explained here: http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=6129#49541
But I do not understand why FAAH would not have a checkpoint set well above 0%. Lawrence |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I am running FAAH under UD Agent on two dedicated machines 24/7. I usually connect to the Internet 2 or 3 times a day to send completed WUs. Yesterday, one work unit took about 10 hours with only one checkpoint after about 5 hours. If someone is running a slightly slower processor and/or is doing other work on the machine, it is entirely possible that the program would never reach the first checkpoint in an 8-10 hour work day. If these extended checkpoints are really necessary, then the project may lose a very significant amout of processing time.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
FAAH checkpoints as often as it can.
The problem with checkpointing at any random point is that the program maintains massive, fiendishly complicated data structures and call stacks in memory, and saving or restoring this state is non-trivial and would need huge checkpoint files. Instead, it waits until it reaches the end of a particular part of the computation, where it can save a relatively small file. Your concern is valid, and it is part of the reasoning behind the minimum requirements for the project. Very occassionally we have to advise someone to leave their computer on for longer than usual while FAAH gets past a tricky spot. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Didactylos:
Thanks for the explanation. I understand the reason, but it's the magnitude of the change that is difficult to fathom. I was seeing segment sizes of 1-2% which would correspond to checkpoints of about 12 minutes on my slow machine. Now I'm seeing a segment size of 50% or a checkpoint of about 7 hours on average. BTW, there is a post on another thread of a user who completes 40+% of a work unit and starts over at 0% after a proper exit and restart. Could be the same problem. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
FAAH work units do vary in size. There are a few absolute whoppers out there.
|
||
|
Wikkel
Cruncher Joined: Dec 28, 2005 Post Count: 9 Status: Offline Project Badges: |
Ok, thank you all for the additional info.
2 out of 3 devices seem to be past the 50% barrier after 4 hours of crunching. So, it seems I was a bit impatient..... |
||
|
|