World Community Grid - View Thread - checkpointing working correctly?

World Community Grid Forums

Category: Completed Research

Forum: Uncovering Genome Mysteries

Thread: checkpointing working correctly?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 16

[ ]

Author

This topic has been viewed 3911 times and has 15 replies

pvh513
Senior Cruncher
Joined: Feb 26, 2011
Post Count: 260
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

100 year badge for Microbiome Immunity Project

100 year badge for Africa Rainfall Project

200 year badge for OpenPandemics - COVID-19


checkpointing working correctly?

I got my first repair job today. My two wingmen were both pending verification and when I looked at the details I found that one had checkpointed twice and the other had not. This looked suspicious to me. I know it is still very early days, but it is something to keep an eye out for.

https://secure.worldcommunitygrid.org/ms/devi....do?workunitId=1213206587

[Oct 18, 2014 9:41:20 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: checkpointing working correctly?

this seems to be true, for all my pending verification, the wingman checkpointed, while mine didnt.

[Oct 18, 2014 10:10:42 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: checkpointing working correctly?

I don't find this to be true. I've had wingmen with several checkpoints be validated just fine, and some of my own WUs with checkpoints went valid as normal too. I've even seen a wingman with a heartbeat problem get validated OK.

Where I have seen errors is where there is a checkpoint that has gone "backwards". But that's so weird that it must surely reflect something not right somewhere anyway.

Just my observations. Proves nothing.

[Oct 18, 2014 10:40:59 AM]

pvh513
Senior Cruncher
Joined: Feb 26, 2011
Post Count: 260
Status: Offline
Project Badges:


Re: checkpointing working correctly?

Well, it proves that checkpointing can work correctly under some circumstances. I have seen that too. But whether it always works correctly is still an open question I think...

[Oct 18, 2014 11:04:47 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: checkpointing working correctly?

Today I've seen 4 workunits go into PVer state, all in batch 00289. In all cases, my wingman had at least one instance of checkpoints going backwards.

This is one example:

10000 query sequences compared.
Checkpoint restored: 10115
Checkpoint restored: 10060
Checkpoint restored: 10115
10500 query sequences compared.

I'll try to catch their eventual outcome.

Can a tech explain why this strange checkpoint behaviour occurs, please?

[Oct 21, 2014 11:59:22 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: checkpointing working correctly?

Constructive comment, a suggestion for the technicians: If the result log were timestamped, as is the case with other sciences, and the <checkpoint_debug> log tag was set, you could match the time it occurs in the result log with that of the event log. Found just one in pver, where the wingman had a regressive checkpoint logged -and- a non-detrimental heartbeat issue, self not a single on 12 pages completed, none assimilated yet?

Result Name: ugm1_ ugm1_ 00327_ 1108_ 0--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
Unable to open checkpoint file starting from 0
500 query sequences compared.
1000 query sequences compared.
1500 query sequences compared.
2000 query sequences compared.
2500 query sequences compared.
3000 query sequences compared.
3500 query sequences compared.
4000 query sequences compared.
4500 query sequences compared.
5000 query sequences compared.
5500 query sequences compared.
6000 query sequences compared.
6500 query sequences compared.
7000 query sequences compared.
7500 query sequences compared.
8000 query sequences compared.
8500 query sequences compared.
9000 query sequences compared.
9500 query sequences compared.
10000 query sequences compared.
10500 query sequences compared.
08:50:06 (6444): No heartbeat from client for 30 sec - exiting
08:50:06 (6444): timer handler: client dead, exiting
Checkpoint restored: 10826
11000 query sequences compared.
11500 query sequences compared.
12000 query sequences compared.
12500 query sequences compared.
13000 query sequences compared.
13500 query sequences compared.
14000 query sequences compared.
14500 query sequences compared.
15000 query sequences compared.
15500 query sequences compared.
16000 query sequences compared.
Checkpoint restored: 16069
Checkpoint restored: 16016
16500 query sequences compared.
17000 query sequences compared.
17500 query sequences compared.
18000 query sequences compared.
18500 query sequences compared.
19000 query sequences compared.
19500 query sequences compared.
20000 query sequences compared.
20500 query sequences compared.
21000 query sequences compared.
21500 query sequences compared.
22000 query sequences compared.
22500 query sequences compared.
23000 query sequences compared.
Run complete, CPU time: 17285.640947
19:41:02 (5968): called boinc_finish

</stderr_txt>
]]>

Still in baited breath to also see the OS info reported in log or result status pages for more dyi analysis.

[Oct 21, 2014 2:22:32 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: checkpointing working correctly?

Here's a wingman with just a single checkpoint-restore in a workunit that went Invalid. Mine and the repair job went Valid.

Result Name: ugm1_ ugm1_ 00330_ 1312_ 1--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
Unable to open checkpoint file starting from 0
500 query sequences compared.
1000 query sequences compared.
1500 query sequences compared.
.
.
.
21000 query sequences compared.
21500 query sequences compared.
22000 query sequences compared.
Checkpoint restored: 22294
22500 query sequences compared.
23000 query sequences compared.
Run complete, CPU time: 6521.856046
14:50:34 (5312): called boinc_finish

So you don't have to have regressive checkpoint-restores to end up with Invalid sad

[Oct 22, 2014 7:10:15 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: checkpointing working correctly?

I've seen 4 workunits go into PVer state, all in batch 00289. In all cases, my wingman had at least one instance of checkpoints going backwards.
I'll try to catch their eventual outcome.

Those cases and 2 others have all gone Valid for me and a repair job, and Invalid for the wingman with the regressive checkpoint-restores. I hope you can discover the cause and solve it, techs.

[Oct 22, 2014 8:49:12 AM]

PMH_UK
Veteran Cruncher
UK
Joined: Apr 26, 2007
Post Count: 786
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

180 day badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: checkpointing working correctly?

Other examples of checkpoint backups:

ugm1_ ugm1_ 00376_ 0722_ 1--
...
6000 query sequences compared.
Checkpoint restored: 6304
Checkpoint restored: 6160
Checkpoint restored: 6160
Checkpoint restored: 6160
Checkpoint restored: 6160
Checkpoint restored: 6160
Checkpoint restored: 6160
Checkpoint restored: 6160
Checkpoint restored: 6160
Checkpoint restored: 6160
Checkpoint restored: 6160
6500 query sequences compared.

ugm1_ ugm1_ 00332_ 1499_ 1--
...
19000 query sequences compared.
Checkpoint restored: 19258
Checkpoint restored: 19084
Checkpoint restored: 19084
Checkpoint restored: 19084
Checkpoint restored: 19084
Checkpoint restored: 19084
Checkpoint restored: 19084
Checkpoint restored: 19084
Checkpoint restored: 19084
Checkpoint restored: 19084
Checkpoint restored: 19084
19500 query sequences compared.

ugm1_ ugm1_ 00261_ 0011_ 0--
...
4000 query sequences compared.
Checkpoint restored: 4001
Checkpoint restored: 3808
4000 query sequences compared.

ugm1_ ugm1_ 00051_ 1226_ 0--
...
11500 query sequences compared.
Checkpoint restored: 11662
Checkpoint restored: 11784
Checkpoint restored: 11547
Checkpoint restored: 11547
Checkpoint restored: 11547
Checkpoint restored: 11658
12000 query sequences compared.

Paul.

----------------------------------------

Paul.

[Oct 22, 2014 9:08:04 AM]

pvh513
Senior Cruncher
Joined: Feb 26, 2011
Post Count: 260
Status: Offline
Project Badges:


Re: checkpointing working correctly?

I have had several WUs fail after a power failure where it looks like the job didn't restart correctly / at all after the power came back on:

https://secure.worldcommunitygrid.org/ms/devi....do?workunitId=1217672773

(mine is ugm1_ ugm1_ 00352_ 0005_ 1 obviously). There is no "called boinc_finish" message at the end:

....
17500 query sequences compared.
18000 query sequences compared.
 
</stderr_txt>
]]>

I guess it could be that the power went down just when it was writing the checkpoint, but I had several of those, so I must have been very unlucky then...

I have also seen instances where the "Checkpoint restored:" message seemed to be jumping backwards.

With a bit more statistics I am getting convinced that WUs that are restarted from a checkpoint have a fairly significant probability of failing. It's not 100%, but is certainly not negligible either.

[Oct 22, 2014 10:36:57 AM]

[ ]