| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 16
|
|
| Author |
|
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges:
|
I got my first repair job today. My two wingmen were both pending verification and when I looked at the details I found that one had checkpointed twice and the other had not. This looked suspicious to me. I know it is still very early days, but it is something to keep an eye out for.
https://secure.worldcommunitygrid.org/ms/devi....do?workunitId=1213206587 |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
this seems to be true, for all my pending verification, the wingman checkpointed, while mine didnt.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I don't find this to be true. I've had wingmen with several checkpoints be validated just fine, and some of my own WUs with checkpoints went valid as normal too. I've even seen a wingman with a heartbeat problem get validated OK.
Where I have seen errors is where there is a checkpoint that has gone "backwards". But that's so weird that it must surely reflect something not right somewhere anyway. Just my observations. Proves nothing. |
||
|
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges:
|
Well, it proves that checkpointing can work correctly under some circumstances. I have seen that too. But whether it always works correctly is still an open question I think...
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Today I've seen 4 workunits go into PVer state, all in batch 00289. In all cases, my wingman had at least one instance of checkpoints going backwards.
This is one example: 10000 query sequences compared. Checkpoint restored: 10115 Checkpoint restored: 10060 Checkpoint restored: 10115 10500 query sequences compared. I'll try to catch their eventual outcome. Can a tech explain why this strange checkpoint behaviour occurs, please? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Constructive comment, a suggestion for the technicians: If the result log were timestamped, as is the case with other sciences, and the <checkpoint_debug> log tag was set, you could match the time it occurs in the result log with that of the event log. Found just one in pver, where the wingman had a regressive checkpoint logged -and- a non-detrimental heartbeat issue, self not a single on 12 pages completed, none assimilated yet?
Result Name: ugm1_ ugm1_ 00327_ 1108_ 0-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> Unable to open checkpoint file starting from 0 500 query sequences compared. 1000 query sequences compared. 1500 query sequences compared. 2000 query sequences compared. 2500 query sequences compared. 3000 query sequences compared. 3500 query sequences compared. 4000 query sequences compared. 4500 query sequences compared. 5000 query sequences compared. 5500 query sequences compared. 6000 query sequences compared. 6500 query sequences compared. 7000 query sequences compared. 7500 query sequences compared. 8000 query sequences compared. 8500 query sequences compared. 9000 query sequences compared. 9500 query sequences compared. 10000 query sequences compared. 10500 query sequences compared. 08:50:06 (6444): No heartbeat from client for 30 sec - exiting 08:50:06 (6444): timer handler: client dead, exiting Checkpoint restored: 10826 11000 query sequences compared. 11500 query sequences compared. 12000 query sequences compared. 12500 query sequences compared. 13000 query sequences compared. 13500 query sequences compared. 14000 query sequences compared. 14500 query sequences compared. 15000 query sequences compared. 15500 query sequences compared. 16000 query sequences compared. Checkpoint restored: 16069 Checkpoint restored: 16016 16500 query sequences compared. 17000 query sequences compared. 17500 query sequences compared. 18000 query sequences compared. 18500 query sequences compared. 19000 query sequences compared. 19500 query sequences compared. 20000 query sequences compared. 20500 query sequences compared. 21000 query sequences compared. 21500 query sequences compared. 22000 query sequences compared. 22500 query sequences compared. 23000 query sequences compared. Run complete, CPU time: 17285.640947 19:41:02 (5968): called boinc_finish </stderr_txt> ]]> Still in baited breath to also see the OS info reported in log or result status pages for more dyi analysis. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Here's a wingman with just a single checkpoint-restore in a workunit that went Invalid. Mine and the repair job went Valid.
Result Name: ugm1_ ugm1_ 00330_ 1312_ 1-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> Unable to open checkpoint file starting from 0 500 query sequences compared. 1000 query sequences compared. 1500 query sequences compared. . . . 21000 query sequences compared. 21500 query sequences compared. 22000 query sequences compared. Checkpoint restored: 22294 22500 query sequences compared. 23000 query sequences compared. Run complete, CPU time: 6521.856046 14:50:34 (5312): called boinc_finish So you don't have to have regressive checkpoint-restores to end up with Invalid ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I've seen 4 workunits go into PVer state, all in batch 00289. In all cases, my wingman had at least one instance of checkpoints going backwards. Those cases and 2 others have all gone Valid for me and a repair job, and Invalid for the wingman with the regressive checkpoint-restores. I hope you can discover the cause and solve it, techs.I'll try to catch their eventual outcome. |
||
|
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 786 Status: Offline Project Badges:
|
Other examples of checkpoint backups:
----------------------------------------ugm1_ ugm1_ 00376_ 0722_ 1-- ... 6000 query sequences compared. Checkpoint restored: 6304 Checkpoint restored: 6160 Checkpoint restored: 6160 Checkpoint restored: 6160 Checkpoint restored: 6160 Checkpoint restored: 6160 Checkpoint restored: 6160 Checkpoint restored: 6160 Checkpoint restored: 6160 Checkpoint restored: 6160 Checkpoint restored: 6160 6500 query sequences compared. ugm1_ ugm1_ 00332_ 1499_ 1-- ... 19000 query sequences compared. Checkpoint restored: 19258 Checkpoint restored: 19084 Checkpoint restored: 19084 Checkpoint restored: 19084 Checkpoint restored: 19084 Checkpoint restored: 19084 Checkpoint restored: 19084 Checkpoint restored: 19084 Checkpoint restored: 19084 Checkpoint restored: 19084 Checkpoint restored: 19084 19500 query sequences compared. ugm1_ ugm1_ 00261_ 0011_ 0-- ... 4000 query sequences compared. Checkpoint restored: 4001 Checkpoint restored: 3808 4000 query sequences compared. ugm1_ ugm1_ 00051_ 1226_ 0-- ... 11500 query sequences compared. Checkpoint restored: 11662 Checkpoint restored: 11784 Checkpoint restored: 11547 Checkpoint restored: 11547 Checkpoint restored: 11547 Checkpoint restored: 11658 12000 query sequences compared. Paul.
Paul.
|
||
|
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges:
|
I have had several WUs fail after a power failure where it looks like the job didn't restart correctly / at all after the power came back on:
https://secure.worldcommunitygrid.org/ms/devi....do?workunitId=1217672773 (mine is ugm1_ ugm1_ 00352_ 0005_ 1 obviously). There is no "called boinc_finish" message at the end: .... I guess it could be that the power went down just when it was writing the checkpoint, but I had several of those, so I must have been very unlucky then... I have also seen instances where the "Checkpoint restored:" message seemed to be jumping backwards. With a bit more statistics I am getting convinced that WUs that are restarted from a checkpoint have a fairly significant probability of failing. It's not 100%, but is certainly not negligible either. |
||
|
|
|