Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 98
Posts: 98   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8496 times and has 97 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

For your information, I have had two WUs complete on one system where Vista is the Windows operating system. I bring this up due to the name of each WU.

CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 2383_ 0-- DAVETHOMPSON-PC Pending Validation 5/27/09 20:56:56 5/28/09 11:41:48 3.36 53.1 / 0.0
CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 788_ 0-- DAVETHOMPSON-PC Pending Validation 5/27/09 19:13:18 5/28/09 03:37:25 1.25 19.7 / 0.0
[May 30, 2009 11:44:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

bono_vox,
I am also running a P4 HT at 2.40 GHz (Northwood according to Everest) with two tasks and my advice would be to run yours with two tasks whichever percentage you choose to use. A 2-core processor runs at a lower temperature if you run only one task because half of the processor is really not used, but on a P4 HT the whole processor is used anyway, therefore the difference of temperature is very small or null. I have just suspended the second task of mine and the difference of temperature seems to be less than 1 °C from 55 °C to 54 °C after a while.

Before deciding to run two tasks I have made a few tests and I found that the global throughput is about 25-30 % higher in dual mode versus single.

By the way, unless your type of processor is known to run hot you might consider removing the dust from your processor heatsink.

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[May 31, 2009 1:43:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

Well, for now I'll keep running 2 threads @ 50% on my HT P4s worried
[May 31, 2009 3:46:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TXR13
Cruncher
Canada
Joined: Dec 5, 2005
Post Count: 36
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

I've been gone for the weekend, so here's the update on my monster workunit.

CPU Time: 3 days, 20 hours, 20 minutes, 55 seconds
Percent complete: 16.352%
Current time: 19:30
Last checkpointed: 18:02:36 May 31
Previous checkpoints at: 13:52:54 May 31; 09:45:09; 05:37:48; 01:48:21; 21:19:33 May 30; 17:09:47; 13:01:05; 12:59:53; 12:58:19; 08:51:29; 04:42:09...

Obviously, there are some short-running positions that are handled very nicely. Just as obviously, many of these positions take some time to compute.

This does not appear to be typical, as the other CMD2 unit I have running on this machine is checkpointing nicely every 20 minutes, like clockwork.

Also for interest, from the client state file on the monster WU...
<rsc_fpops_est>11145059248785.000000</rsc_fpops_est>
<rsc_fpops_bound>1114505924878500.000000</rsc_fpops_bound>


I'm really scratching my head on this one. I mean, I thought the techs had set a time limit so that huge monsters like this wouldn't be an issue. Is this just a weird fluke error, or what? Does anybody even know? It's not an issue for me, I'm just really curious why this particular unit seems so loaded down with tough positions, while the other one keeps clicking along and incrementing its progress every 20 minutes.
----------------------------------------

[Jun 1, 2009 2:43:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

Hello TXR13,
HCMD2 6.13 allows an optional time limit to be added to the work unit. A few days ago I think that knreed said that there were only a small (20+?) number of old work units without time limits.
'
'Lawrence
[Jun 1, 2009 3:18:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TXR13
Cruncher
Canada
Joined: Dec 5, 2005
Post Count: 36
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

Ha! that would just be my luck, pulling one of those units without the time limit... tongue
----------------------------------------

[Jun 1, 2009 6:00:04 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

I have one of these bad boyz and mine has been crunching for 1.5 hours at 0%. worried

Edit: 2.25 hours still 0%

I gave up on it...
CMD2_ 0002-RADIA.clustersOccur-TPM1A.clustersOccur_ 4259_ 1-- 613 Aborted 5/28/09 20:53:21 6/1/09 13:18:59 86.80 2,586.6 / 0.0
frustrated
[Jun 1, 2009 1:53:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

TXR13,
The machine you are mentioning seems rather slow. Mine usually checkpoints about every minute (if I don't limit it via the "write to disk every xxx seconds " parameter) and when it hits a tough position it stays on it for about 4 hours. Which translates to 80 hours in your case. I hope I am crunching the last one of these monster WUs and it will be at about 23 hours when it passes the current position at 91.200 %. If you have such a WU that would mean about 20 days of crunching...

Maybe you should consider aborting it if it encounters another tough position after its current one.

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Jun 1, 2009 2:32:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
eviltoad
Senior Cruncher
Australia
Joined: Nov 5, 2005
Post Count: 190
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

I've just had a monster WU error out after nearly 78 hours:
CMD2_ 0001-1HCI_ A.clustersOccur-2O72_ A.clustersOccur_ 2386_ 2-- 613 Error 5/29/09 13:58:44 6/1/09 23:22:18 77.87 1,320.9 / 0.0

because the maximum CPU time was exceeded.

The system in question is built around a Q9400 (2.66GHz) with 3GB RAM and runs Ubuntu Jaunty.

I have another such WU on the same system:
CMD2_ 0002-RADIA.clustersOccur-U2AF1A.clustersOccur_ 44_ 0--
still in progress at 91% after 68+ hours.

Is it worth keeping going?
----------------------------------------

[Jun 2, 2009 1:28:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Monster WU on the loose...

Hello eviltoad,
Why not keep going. Either it will time out or return successfully in another half day.

Lawrence
[Jun 2, 2009 1:33:52 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 98   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread