World Community Grid - View Thread

Hi Pirxx,
This is dependent on the project code and the work unit. With the old HPF project, check points were written every few minutes, so you never lost even 1% of work unit completion. FAAH is much more uncertain. The computation has to reach a certain point in the code to checkpoint, but the way the work unit behaves is highly variable. A check point is written every time the green line (in the graphic) reaches the right edge and starts over. The first time this happens, a red line is drawn. But for a few work units, this never happens. Other work units do this every few minutes. Normally this happens 3 or 4 times an hour, but there is no certainty. So you can have 0, 1, 2, ... 30+ check points per work unit. There is just no telling.

As Didactylos says, the only way to be sure to capture an image of a work unit at any arbitrary point in progress is to store hundreds of megabytes containing the whole process together with all its memory arrays from virtual memory. Which just does not seem worth the trouble.

HDC is more consistent than FAAH, but does some massive streaming I/O to store check points and uses much more disk space than FAAH. It probably could not check point at all without this, but it causes some people a lot of trouble, so they have to avoid HDC. The advance word is that the upcoming projects include some much smaller requirements, but we shall just have to wait to see what their checkpointing behavior is like.

Lawrence