World Community Grid - View Thread - Monster WU on the loose...

World Community Grid Forums

Category: Completed Research

Forum: Help Cure Muscular Dystrophy - Phase 2 Forum

Thread: Monster WU on the loose...

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 98

[ ]

Author

This topic has been viewed 15680 times and has 97 replies

TXR13
Cruncher
Canada
Joined: Dec 5, 2005
Post Count: 36
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

1 year badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Monster WU on the loose...

Hi Sekerob,

I do not have checkpoint debugging enabled, so I've been looking at the checkpoint files in the slot directory for the unit. Here's the current stats:

CPU time: 1 day, 8 hours, 38 minutes, 11 seconds.
Progress: 4.402%
Current time: 07:26
All checkpoint files show a time of 03:47 earlier this morning.

Last night, when I made the original post, the two checkpoint files (00 and 01) had different times showing, which is how I was able to track when the last checkpoint occurred. If memory serves, I checked the progress around 19:33, and the checkpoint 01 file was showing a time of 15:20, give or take a few minutes.

Incidentally, it's not like this system is underpowered. It's a 1.4GHz P3S, with the double cache. What's more, there's two of those CPUs, both running at 97-99% utilization for BOINC, with 4GB of RAM for them to play with. Benchmarking indicates that it has better floating point performance than all three of my P4s, and better integer performance than two of them. I know benchmarks can be misleading, and I've heard that BOINC's benchmarking in particular has been a little wacky. However, all of these machines are running the same version, so I'd think the wackiness would even itself out, unless it was a flat-out bias towards older machines.

EDIT: I have now activated checkpoint debugging on the machine in question. I'll report back as things develop (or don't). thinking

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by TXR13 at May 29, 2009 2:42:18 PM]

[May 29, 2009 2:33:02 PM]

JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

10 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Monster WU on the loose...

Maybe the hardware matters but I am also suspecting that version 6.13 of the application does not behave exactly the same under Windows and under Linux.
Yesterday I have had a few WUs under Windows which behaved exactly as knreed described in its Beta Test announcement, except that the cutt off time was 4 hours. Those WUs started from the beginning on a 4-hour basis, with smaller percentage increments, and when they reached 4 hours of runtime they were at 99.9xx % and they stopped with the expected message "Finishing early because max runtime has been exceeded" in their Result Log.
Then I switched back to Ubuntu 64 to push what I believed to be "looping" WUs and I had a completely different behavior. The WUs started as normal "short" WUs, "blocked" on some tough positions for various durations, then resumed and finished when they could. None has been starting on a 4-hour basis, and for those which exceeded that time they went on to their normal end even when they passed the 4-hour limit at percentages in the 25-30 % range. The bigger one has even reached 8.58 hours without stopping at 4 or 8.

All the above on a Q6600 overclocked by 20 %, so there is no question of slow machines or SSE instructions around.

Strange. I would like some comments by the techs... Jean.

----------------------------------------

Team--> Decrypthon -->Statistics/Join -->Thread

[May 29, 2009 2:39:24 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Monster WU on the loose...

Not really, since P3 in general are outliers in the total pool of volunteers

But I'm not looking at P3s at this stage. They crashed out on 611, so all machines then were P4 and upwards. If my P4 takes 21 hours for one position, then the average time for an average P4-and-above looks like 15 hours. Yes, a fast P3 might be around 40 hours, but even 15 hours is kinda long.

It'll be interesting to see what will happen to this WU and to TXR13's. Even if our machines run the distance, will we ever get validation? It'll be pretty disappointing if they hit boinc's cpu time limit and abort.

[May 29, 2009 2:52:05 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Monster WU on the loose...

I just had a look in the client_state.xml for a running HCMD2 job, techies know how to do so too. My charts calculates a mean of 1.67 hours yesterday for the project tasks with a time out of 100x... think knreed mentioned that.

Here's the job's restrictions:

<rsc_fpops_est>11145059248785.000000</rsc_fpops_est>
<rsc_fpops_bound>1114505924878500.000000</rsc_fpops_bound>

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[May 29, 2009 2:58:45 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Monster WU on the loose...

I'd expect for a position always being allowed to finish no matter what, not a an exact 4 or 8 hour, or dead close as what RICE is programmed to do, with very short seeds.

TXR13, I think it's a choice. When you've checked in the quorum detail if the wingman finished the task and how, your job still making progress, you could decide to abort and let some other power horse try the run.

My clients now done many hundreds and yet to come across a monster, touch wood, thus they must be more rare than the discussions suggests.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[May 29, 2009 5:25:18 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


.

----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 2, 2009 3:52:28 AM]

[May 29, 2009 8:50:00 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Monster WU on the loose...

If only the techs could predict this it would have been a whole lot easier to size the jobs. Let it run as the tasks are allowed to overrun 100x their original predicted runtime.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[May 29, 2009 8:55:39 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


.

----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 2, 2009 3:52:22 AM]

[May 29, 2009 8:58:19 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Monster WU on the loose...

See my previous post above and find <rsc_fpops_est> in the client_state.xml (DON'T meddle with that file or you'll be served the penalty of a corrupted client!)

Just checking, if you find a task, say FLU and it has
<rsc_fpops_est>11145059248785.000000</rsc_fpops_est> all other jobs in the queue of the same date/project will have the same value. The servers compute this projected run time daily from actual returned work which is then used for new work send out. Somewhere there's a formula which allows the calculation back into seconds. The current benchmark values are part of that formula. My crawler currently computed 8:59 hours from that fpop value and 8:54 after forcing a new benchmark, hurray.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[May 29, 2009 9:27:56 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Monster WU on the loose...

Checkpoint update for me:
374.855400 secs
78189.820000 secs (2.5157%)
155955.500000 secs (3.1447%)
232562.700000 secs (4.4025%)
310771.500000 secs (5.6604%)
390030.800000 secs (11.3208%)
467918.100000 secs (11.9497%)

----------------------------------------
[Edit 5 times, last edit by Former Member at Jun 2, 2009 2:01:03 PM]

[May 30, 2009 2:15:00 AM]

[ ]