Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Help Cure Muscular Dystrophy - Phase 2 Forum Thread: Monster WU on the loose... |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 98
|
Author |
|
TXR13
Cruncher Canada Joined: Dec 5, 2005 Post Count: 36 Status: Offline Project Badges: |
Hi Sekerob,
----------------------------------------I do not have checkpoint debugging enabled, so I've been looking at the checkpoint files in the slot directory for the unit. Here's the current stats: CPU time: 1 day, 8 hours, 38 minutes, 11 seconds. Progress: 4.402% Current time: 07:26 All checkpoint files show a time of 03:47 earlier this morning. Last night, when I made the original post, the two checkpoint files (00 and 01) had different times showing, which is how I was able to track when the last checkpoint occurred. If memory serves, I checked the progress around 19:33, and the checkpoint 01 file was showing a time of 15:20, give or take a few minutes. Incidentally, it's not like this system is underpowered. It's a 1.4GHz P3S, with the double cache. What's more, there's two of those CPUs, both running at 97-99% utilization for BOINC, with 4GB of RAM for them to play with. Benchmarking indicates that it has better floating point performance than all three of my P4s, and better integer performance than two of them. I know benchmarks can be misleading, and I've heard that BOINC's benchmarking in particular has been a little wacky. However, all of these machines are running the same version, so I'd think the wackiness would even itself out, unless it was a flat-out bias towards older machines. EDIT: I have now activated checkpoint debugging on the machine in question. I'll report back as things develop (or don't). [Edit 1 times, last edit by TXR13 at May 29, 2009 2:42:18 PM] |
||
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3715 Status: Offline Project Badges: |
Maybe the hardware matters but I am also suspecting that version 6.13 of the application does not behave exactly the same under Windows and under Linux.
----------------------------------------Yesterday I have had a few WUs under Windows which behaved exactly as knreed described in its Beta Test announcement, except that the cutt off time was 4 hours. Those WUs started from the beginning on a 4-hour basis, with smaller percentage increments, and when they reached 4 hours of runtime they were at 99.9xx % and they stopped with the expected message "Finishing early because max runtime has been exceeded" in their Result Log. Then I switched back to Ubuntu 64 to push what I believed to be "looping" WUs and I had a completely different behavior. The WUs started as normal "short" WUs, "blocked" on some tough positions for various durations, then resumed and finished when they could. None has been starting on a 4-hour basis, and for those which exceeded that time they went on to their normal end even when they passed the 4-hour limit at percentages in the 25-30 % range. The bigger one has even reached 8.58 hours without stopping at 4 or 8. All the above on a Q6600 overclocked by 20 %, so there is no question of slow machines or SSE instructions around. Strange. I would like some comments by the techs... Jean. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Not really, since P3 in general are outliers in the total pool of volunteers But I'm not looking at P3s at this stage. They crashed out on 611, so all machines then were P4 and upwards. If my P4 takes 21 hours for one position, then the average time for an average P4-and-above looks like 15 hours. Yes, a fast P3 might be around 40 hours, but even 15 hours is kinda long. It'll be interesting to see what will happen to this WU and to TXR13's. Even if our machines run the distance, will we ever get validation? It'll be pretty disappointing if they hit boinc's cpu time limit and abort. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
I just had a look in the client_state.xml for a running HCMD2 job, techies know how to do so too. My charts calculates a mean of 1.67 hours yesterday for the project tasks with a time out of 100x... think knreed mentioned that.
----------------------------------------Here's the job's restrictions: <rsc_fpops_est>11145059248785.000000</rsc_fpops_est> <rsc_fpops_bound>1114505924878500.000000</rsc_fpops_bound>
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
I'd expect for a position always being allowed to finish no matter what, not a an exact 4 or 8 hour, or dead close as what RICE is programmed to do, with very short seeds.
----------------------------------------TXR13, I think it's a choice. When you've checked in the quorum detail if the wingman finished the task and how, your job still making progress, you could decide to abort and let some other power horse try the run. My clients now done many hundreds and yet to come across a monster, touch wood, thus they must be more rare than the discussions suggests.
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
.
----------------------------------------[Edit 1 times, last edit by Former Member at Jun 2, 2009 3:52:28 AM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
If only the techs could predict this it would have been a whole lot easier to size the jobs. Let it run as the tasks are allowed to overrun 100x their original predicted runtime.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
.
----------------------------------------[Edit 1 times, last edit by Former Member at Jun 2, 2009 3:52:22 AM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
See my previous post above and find <rsc_fpops_est> in the client_state.xml (DON'T meddle with that file or you'll be served the penalty of a corrupted client!)
----------------------------------------Just checking, if you find a task, say FLU and it has <rsc_fpops_est>11145059248785.000000</rsc_fpops_est> all other jobs in the queue of the same date/project will have the same value. The servers compute this projected run time daily from actual returned work which is then used for new work send out. Somewhere there's a formula which allows the calculation back into seconds. The current benchmark values are part of that formula. My crawler currently computed 8:59 hours from that fpop value and 8:54 after forcing a new benchmark, hurray.
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Checkpoint update for me:
----------------------------------------374.855400 secs 78189.820000 secs (2.5157%) 155955.500000 secs (3.1447%) 232562.700000 secs (4.4025%) 310771.500000 secs (5.6604%) 390030.800000 secs (11.3208%) 467918.100000 secs (11.9497%) [Edit 5 times, last edit by Former Member at Jun 2, 2009 2:01:03 PM] |
||
|
|