Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 98
Posts: 98   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8497 times and has 97 replies Next Thread
TXR13
Cruncher
Canada
Joined: Dec 5, 2005
Post Count: 36
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

Hi Sekerob,

I do not have checkpoint debugging enabled, so I've been looking at the checkpoint files in the slot directory for the unit. Here's the current stats:

CPU time: 1 day, 8 hours, 38 minutes, 11 seconds.
Progress: 4.402%
Current time: 07:26
All checkpoint files show a time of 03:47 earlier this morning.

Last night, when I made the original post, the two checkpoint files (00 and 01) had different times showing, which is how I was able to track when the last checkpoint occurred. If memory serves, I checked the progress around 19:33, and the checkpoint 01 file was showing a time of 15:20, give or take a few minutes.

Incidentally, it's not like this system is underpowered. It's a 1.4GHz P3S, with the double cache. What's more, there's two of those CPUs, both running at 97-99% utilization for BOINC, with 4GB of RAM for them to play with. Benchmarking indicates that it has better floating point performance than all three of my P4s, and better integer performance than two of them. I know benchmarks can be misleading, and I've heard that BOINC's benchmarking in particular has been a little wacky. However, all of these machines are running the same version, so I'd think the wackiness would even itself out, unless it was a flat-out bias towards older machines.

EDIT: I have now activated checkpoint debugging on the machine in question. I'll report back as things develop (or don't). thinking
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by TXR13 at May 29, 2009 2:42:18 PM]
[May 29, 2009 2:33:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

Maybe the hardware matters but I am also suspecting that version 6.13 of the application does not behave exactly the same under Windows and under Linux.
Yesterday I have had a few WUs under Windows which behaved exactly as knreed described in its Beta Test announcement, except that the cutt off time was 4 hours. Those WUs started from the beginning on a 4-hour basis, with smaller percentage increments, and when they reached 4 hours of runtime they were at 99.9xx % and they stopped with the expected message "Finishing early because max runtime has been exceeded" in their Result Log.
Then I switched back to Ubuntu 64 to push what I believed to be "looping" WUs and I had a completely different behavior. The WUs started as normal "short" WUs, "blocked" on some tough positions for various durations, then resumed and finished when they could. None has been starting on a 4-hour basis, and for those which exceeded that time they went on to their normal end even when they passed the 4-hour limit at percentages in the 25-30 % range. The bigger one has even reached 8.58 hours without stopping at 4 or 8.

All the above on a Q6600 overclocked by 20 %, so there is no question of slow machines or SSE instructions around.

Strange. I would like some comments by the techs... Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[May 29, 2009 2:39:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

Not really, since P3 in general are outliers in the total pool of volunteers

But I'm not looking at P3s at this stage. They crashed out on 611, so all machines then were P4 and upwards. If my P4 takes 21 hours for one position, then the average time for an average P4-and-above looks like 15 hours. Yes, a fast P3 might be around 40 hours, but even 15 hours is kinda long.

It'll be interesting to see what will happen to this WU and to TXR13's. Even if our machines run the distance, will we ever get validation? It'll be pretty disappointing if they hit boinc's cpu time limit and abort.
[May 29, 2009 2:52:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

I just had a look in the client_state.xml for a running HCMD2 job, techies know how to do so too. My charts calculates a mean of 1.67 hours yesterday for the project tasks with a time out of 100x... think knreed mentioned that.

Here's the job's restrictions:

<rsc_fpops_est>11145059248785.000000</rsc_fpops_est>
<rsc_fpops_bound>1114505924878500.000000</rsc_fpops_bound>
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[May 29, 2009 2:58:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

I'd expect for a position always being allowed to finish no matter what, not a an exact 4 or 8 hour, or dead close as what RICE is programmed to do, with very short seeds.

TXR13, I think it's a choice. When you've checked in the quorum detail if the wingman finished the task and how, your job still making progress, you could decide to abort and let some other power horse try the run.

My clients now done many hundreds and yet to come across a monster, touch wood, thus they must be more rare than the discussions suggests.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[May 29, 2009 5:25:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
confused .

.
----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 2, 2009 3:52:28 AM]
[May 29, 2009 8:50:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

If only the techs could predict this it would have been a whole lot easier to size the jobs. Let it run as the tasks are allowed to overrun 100x their original predicted runtime.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[May 29, 2009 8:55:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
.

.
----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 2, 2009 3:52:22 AM]
[May 29, 2009 8:58:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

See my previous post above and find <rsc_fpops_est> in the client_state.xml (DON'T meddle with that file or you'll be served the penalty of a corrupted client!)

Just checking, if you find a task, say FLU and it has
<rsc_fpops_est>11145059248785.000000</rsc_fpops_est> all other jobs in the queue of the same date/project will have the same value. The servers compute this projected run time daily from actual returned work which is then used for new work send out. Somewhere there's a formula which allows the calculation back into seconds. The current benchmark values are part of that formula. My crawler currently computed 8:59 hours from that fpop value and 8:54 after forcing a new benchmark, hurray.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[May 29, 2009 9:27:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

Checkpoint update for me:
374.855400 secs
78189.820000 secs (2.5157%)
155955.500000 secs (3.1447%)
232562.700000 secs (4.4025%)
310771.500000 secs (5.6604%)
390030.800000 secs (11.3208%)
467918.100000 secs (11.9497%)
----------------------------------------
[Edit 5 times, last edit by Former Member at Jun 2, 2009 2:01:03 PM]
[May 30, 2009 2:15:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 98   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread