| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 6
|
|
| Author |
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Had a main fuse out last night. Powered up and found this morning one task has gone back to beginning [think to have seen 1 or 2 reports before] and projected 5 days computing after 3 hours running. Looking in log, text editor protested with file containing illegal characters which shows as
[04:17:33] [INFO] Checkpoint complete. 000000000000000000000000000000000000000000000000000000000000000000 00\00\00\00\00\00No previous checkpoint file present, assuming this is a new run. [05:01:38] [INFO] Checkpoint complete. The stderr.txt reports similar with illegal chars Writing checkpoint at step 61330. [04:33:33] INFO: Running initial simulation Reading checkpoint file state.cpt generated: Mon Apr 11 04:17:33 2016 000000000000000000000000000000000000000000000000000000000000000000 00INFO: No state to restore. Start from the beginning. [04:56:24] INFO: Running initial simulation Writing checkpoint at step 450. Letting it run as the TTC is reducing faster than the clock, but presume it will take 19 hours to complete. Unsure if there's even a point in having it continue now 12% complete and reporting normal checkpointing Mon 11 Apr 2016 09:35:50 AM CEST | World Community Grid | [checkpoint] result HST1_000601_000071_AC0021_T400_F00096_S00001_1 checkpointed None of the other running HST have this issue. |
||
|
|
Mumak
Senior Cruncher Joined: Dec 7, 2012 Post Count: 477 Status: Offline Project Badges:
|
Seems like the power down happened after the checkpoint file was written, but before the file buffers were flushed to disk.
----------------------------------------BOINC (or the app) might do a flush after writing checkpoint (or write using write-through), but I think that in case of projects writing large checkpoints or too frequent (and especially on SSDs) this might reduce performance. ![]() [Edit 1 times, last edit by Mumak at Apr 11, 2016 8:42:25 AM] |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
It's a bit of a puzzle as the client's own stdoudae.txt log also has an illegal entry identical sequence immediately after a 4:38:16 exclusive app ran [BOINC is paused when apt-get does it's nightly package library update] it goes
Resuming computation (then on new line i.e. no time stamp) \00\00\00\00 etc etc. And then next entry at 4:56:21 Starting BOINC client... Linux 4.20.35 Kernel, BOINC 7.6.31 for x86_64-pc-linux.gnu (Loosely copy typed info) So where stdout.txt in the slot log gives a time range of between 4:17:33 and 5:01:38 and stderr.txt closes in further to between 4:33:33 and 4:56:24, the stdoutdae.txt indicates last checkpoint was at 4:17:34 i.e. the power fail was well after. Stroke of misfortune, maybe, put on record in case others see this happening, just one of 2 HST1 that was running on the particular machine, 10 in total on 2 machines, second is W8.1. The 2 on the Linux had their checkpointing well in sync, just seconds apart and been doing that for about 13 hours. |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
The 2 on the Linux had their checkpointing well in sync, just seconds apart and been doing that for about 13 hours. Can you compare the Windows v. Linux performance? I don't think it has been mentioned yet. (Mint 18 has me interested.) |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Comparison is in the making when next booting to W10-64, also to run at 2.4Ghz stock [See other thread in this forum]. Did notice that T400 run about 2 hours shorter than the T300 under Linux, so it's a case of making sure to get same Tnnn, like for like experiment tasks.
|
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Notwithstanding, the task completed and was not outright dismissed... in PVal with full computing time + the time before the restart at zero on both Elapsed and CPU time. (19 hours+13 hours). Now the wait is on first wingman if this is becoming a PVer > Invalid cycle.
|
||
|
|
|