Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 51
Posts: 51   Pages: 6   [ Previous Page | 1 2 3 4 5 6 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8013 times and has 50 replies Next Thread
Coleslaw
Veteran Cruncher
USA
Joined: Mar 29, 2007
Post Count: 1343
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New HFCC Beta test

Hate to bring this up....but I picked up two more...LOL
----------------------------------------

[Jul 23, 2010 3:22:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New HFCC Beta test

And it happened again - the checkpoint restart error is back:
One i7 crashed with three beta tasks at about 87%. Two of them restarted successfully from checkpoint, but BETA_HFCC_n1_00009198_n1_0000 restarted from 0% while the elapsed time value remained valid. So when it will finish, it will have taken nearly twice as long as usual because 87% were calculated twice.
Had this several times with the last HFCC production version, so it's no surprise. Maybe some day writing checkpoints will be made crash proof... ;-)
[Jul 23, 2010 4:57:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: New HFCC Beta test

astrolab,

follow the HCC result X0000032380305200405261401_0 in the log. You see it start near the top and resume near the bottom, but no suspending or pausing in between. Something is not complete in between. i.e. it's happening to the other science apps too.

22/07/2010 11:33:35 AM|World Community Grid|Starting X0000032380305200405261401_0
22/07/2010 11:33:35 AM|World Community Grid|Starting task X0000032380305200405261401_0 using hcc1 version 608

... bunches of log entries...

22/07/2010 5:05:05 PM|World Community Grid|Resuming task X0000032380305200405261401_0 using hcc1 version 608
Mysterious... probably will want to add a few more log flags to the cc_config.xml to find out what the client cpu scheduler is doing. The manual: http://boinc.berkeley.edu/wiki/Cc_config.xml and the suggested flags. Simply have the config file re-read via the advanced menu, since it's only log flags being activated.

<cpu_sched>
CPU scheduler actions (preemption and resumption).
<cpu_sched_debug>
Explain CPU scheduler decisions.
<sched_op_debug>
Details of scheduler RPCs; also shows deferral intervals and other low info. New in 5.10.24

which then will look in the file as

<log_flags>
<cpu_sched>1 </cpu_sched>
<cpu_sched_debug>1</cpu_sched_debug>
<sched_op_debug>1 </sched_op_debug>
</log_flags>


There's no sys info or client version data so it's a guess, but doubt it has anything to do with HFCC science performance... still the same as the one that processed the previous 26 million copies.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 23, 2010 6:45:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: New HFCC Beta test

And it happened again - the checkpoint restart error is back:
One i7 crashed with three beta tasks at about 87%. Two of them restarted successfully from checkpoint, but BETA_HFCC_n1_00009198_n1_0000 restarted from 0% while the elapsed time value remained valid. So when it will finish, it will have taken nearly twice as long as usual because 87% were calculated twice.
Had this several times with the last HFCC production version, so it's no surprise. Maybe some day writing checkpoints will be made crash proof... ;-)

Maybe one day you can make your PC crash proof. That will happen because it is a question of the client not being able to finish writing critical recovery information at the moment supreme. 2 managing and 1 not sort of underlines that.

The elapsed/wallclock time is not transmitted (not relevant to the credit award system), only the real CPU time. Latter you can see with a 6.10 client in the tasks properties when selecting the task and hitting that button on the left.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 23, 2010 6:52:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New HFCC Beta test


Maybe one day you can make your PC crash proof.

I always try, but MS shoots it always down again...


That will happen because it is a question of the client not being able to finish writing critical recovery information at the moment supreme. 2 managing and 1 not sort of underlines that.

Checkpointing is very important so it should be crash safe anyway. E.g. what about using two sets of checkpoint files? One will be always valid even if the other one will be corrupted while writing/crashing.
[Jul 23, 2010 7:24:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: New HFCC Beta test

There's a bunch of redundancy and recovery build in, one being the client_state.xml. There's a backup of that made very frequently.

Maybe discuss this at the Berkeley developers forum, since when it happens to science app A, it will happen to B. The number of posts I've seen urging members to back up their CPDN models are countless and then loosing hundreds of hours i.e. it's inherent to the way BOINC works, to me not something specific to a science. Whilst checkpoint recovery is very important to anyone, what would an additional layer add in overhead and how much payback would it provide? 0.01% on a global scale? If it happens to 100 of 80,000+ tasks daily, that's the percent being considered.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 23, 2010 7:55:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta's ignored by BOINC

Per Uplinger there were 1000 tasks that went out in quorum 3 (he notes the total of 3000 in the announcement). Per lunch 2,820 had already validated. The snap-up was exceptional as were the returns, for only when a Beta task is returned, will a new one be allowed to be fetched [a great motivational]... what with 3000, that's kind of 3-5 hours after start pretty improbable, lest there are the repairs... but this HFCC test is to me more of an affirmation than a true Beta... very few repairs.


Sek, sorry for the confusion here. There were initially only going to be 1000 tasks. We bumped that number up to 3000. These were sent out with quorum of 2.

Currently the BETA is running very well. Error rate is very low. I believe all tasks have been sent out.

-Uplinger

Quorum 2, Init 3 still? (all 8 I received, had, only 2 having the processing delay so the wingman got a server abort upon quorum) There were 5,906 validated in the first 24 hours which is impressive... be it 99% or 66% :D
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 23, 2010 8:08:39 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New HFCC Beta test

There's a bunch of redundancy and recovery build in, one being the client_state.xml. There's a backup of that made very frequently.

Maybe discuss this at the Berkeley developers forum, since when it happens to science app A, it will happen to B. The number of posts I've seen urging members to back up their CPDN models are countless and then loosing hundreds of hours i.e. it's inherent to the way BOINC works, to me not something specific to a science. Whilst checkpoint recovery is very important to anyone, what would an additional layer add in overhead and how much payback would it provide? 0.01% on a global scale? If it happens to 100 of 80,000+ tasks daily, that's the percent being considered.


I think it's much less than 0.01% (It's not very.likely that the system crashes exactly while a group of checkpoint files is written, and we all know that windows crashes next to never...). But if it happens it's nevertheless a waste of cpu time.
And I do not blame boinc for this error. The client_state.xml file does not hold any checkpoint information (apart from the last checkpointing time) as far as I know. It's simply a file corrupted by system failure. Too bad that it's a checkpoint file...

Sorry - could not resist writing my 100th posting... ;-)
[Jul 23, 2010 8:49:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta's ignored by BOINC

There were initially only going to be 1000 tasks. We bumped that number up to 3000. These were sent out with quorum of 2.

Currently the BETA is running very well. Error rate is very low. I believe all tasks have been sent out.

-Uplinger
By 3000 in total I take it you mean, 1000 tasks replicated 3 times, with a Minimum Quorum of 2. Out of interest, how many resends were there?
Picked 10 up overnight (last night); one was server recalled (I expect the quorum was met before it started) but the rest all validated. Nice to see the tasks go to several different systems. I saw 2 server aborts from my wingmen, so that is 3 from 30. So perhaps there are about 2700 valid or in progress.
----------------------------------------
[Edit 1 times, last edit by skgiven at Jul 24, 2010 1:38:02 AM]
[Jul 24, 2010 1:30:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta's ignored by BOINC

All my betas worked fine, even on an ancient machine that grabbed one by accident. An old Pentium picked one up and completed it in under the 4 days.
[Jul 25, 2010 8:00:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 51   Pages: 6   [ Previous Page | 1 2 3 4 5 6 | Next Page ]
[ Jump to Last Post ]
Post new Thread