Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 6
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1863 times and has 5 replies Next Thread
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
HST1 stdout.txt contains illegal characters after power-out / resume.

Had a main fuse out last night. Powered up and found this morning one task has gone back to beginning [think to have seen 1 or 2 reports before] and projected 5 days computing after 3 hours running. Looking in log, text editor protested with file containing illegal characters which shows as

[04:17:33] [INFO] Checkpoint complete.
000000000000000000000000000000000000000000000000000000000000000000
00\00\00\00\00\00No previous checkpoint file present, assuming this is a new run.
[05:01:38] [INFO] Checkpoint complete.

The stderr.txt reports similar with illegal chars

Writing checkpoint at step 61330.
[04:33:33] INFO: Running initial simulation

Reading checkpoint file state.cpt generated: Mon Apr 11 04:17:33 2016

000000000000000000000000000000000000000000000000000000000000000000
00INFO: No state to restore. Start from the beginning.
[04:56:24] INFO: Running initial simulation
Writing checkpoint at step 450.

Letting it run as the TTC is reducing faster than the clock, but presume it will take 19 hours to complete. Unsure if there's even a point in having it continue now 12% complete and reporting normal checkpointing

Mon 11 Apr 2016 09:35:50 AM CEST | World Community Grid | [checkpoint] result HST1_000601_000071_AC0021_T400_F00096_S00001_1 checkpointed

None of the other running HST have this issue.
[Apr 11, 2016 7:47:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mumak
Senior Cruncher
Joined: Dec 7, 2012
Post Count: 477
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: HST1 stdout.txt contains illegal characters after power-out / resume.

Seems like the power down happened after the checkpoint file was written, but before the file buffers were flushed to disk.
BOINC (or the app) might do a flush after writing checkpoint (or write using write-through), but I think that in case of projects writing large checkpoints or too frequent (and especially on SSDs) this might reduce performance.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Mumak at Apr 11, 2016 8:42:25 AM]
[Apr 11, 2016 8:41:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: HST1 stdout.txt contains illegal characters after power-out / resume.

It's a bit of a puzzle as the client's own stdoudae.txt log also has an illegal entry identical sequence immediately after a 4:38:16 exclusive app ran [BOINC is paused when apt-get does it's nightly package library update] it goes

Resuming computation (then on new line i.e. no time stamp)
\00\00\00\00 etc etc.

And then next entry at

4:56:21 Starting BOINC client... Linux 4.20.35 Kernel, BOINC 7.6.31 for x86_64-pc-linux.gnu

(Loosely copy typed info)

So where stdout.txt in the slot log gives a time range of between 4:17:33 and 5:01:38 and stderr.txt closes in further to between 4:33:33 and 4:56:24, the stdoutdae.txt indicates last checkpoint was at 4:17:34 i.e. the power fail was well after.

Stroke of misfortune, maybe, put on record in case others see this happening, just one of 2 HST1 that was running on the particular machine, 10 in total on 2 machines, second is W8.1. The 2 on the Linux had their checkpointing well in sync, just seconds apart and been doing that for about 13 hours.
[Apr 11, 2016 12:23:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: HST1 stdout.txt contains illegal characters after power-out / resume.

The 2 on the Linux had their checkpointing well in sync, just seconds apart and been doing that for about 13 hours.

Can you compare the Windows v. Linux performance? I don't think it has been mentioned yet. (Mint 18 has me interested.)
[Apr 11, 2016 6:03:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: HST1 stdout.txt contains illegal characters after power-out / resume.

Comparison is in the making when next booting to W10-64, also to run at 2.4Ghz stock [See other thread in this forum]. Did notice that T400 run about 2 hours shorter than the T300 under Linux, so it's a case of making sure to get same Tnnn, like for like experiment tasks.
[Apr 11, 2016 6:18:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: HST1 stdout.txt contains illegal characters after power-out / resume.

Notwithstanding, the task completed and was not outright dismissed... in PVal with full computing time + the time before the restart at zero on both Elapsed and CPU time. (19 hours+13 hours). Now the wait is on first wingman if this is becoming a PVer > Invalid cycle.
[Apr 12, 2016 8:04:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread