World Community Grid - View Thread - this is a really long work unit

World Community Grid Forums

Category: Completed Research

Forum: FightAIDS@Home

Thread: this is a really long work unit

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 149

[ ]

Author

This topic has been viewed 19659 times and has 148 replies

mclaver
Veteran Cruncher
Joined: Dec 19, 2005
Post Count: 566
Status: Offline
Project Badges:

20 year badge for Human Proteome Folding - Phase 2

5 year badge for Discovering Dengue Drugs - Together

10 year badge for Nutritious Rice for the World

10 year badge for The Clean Energy Project

20 year badge for Help Fight Childhood Cancer

5 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

20 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

20 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

100 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Second really long work unit received

I would agree with you if the ratio of claimed to Granted remaind consistent. I had 821.5 claimed / 422.8 granted on one of the 50xx units which is only 51%. On other units on this same quad machine I typically get 80-85% ratio of Claimed to Grant. That may have happened becasue I had four of the 50xx going at the same time and they each took over 60 hours, but a far worse ratio of claimed to grant than any other work I have done.

----------------------------------------

[Aug 4, 2008 1:30:46 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Second really long work unit received

Since I touched on the subject, have you observed maximum RAM and VM uses as being extraordinary on these monsters. Here appended the table giving ranges from actual observation for the different WCG sciences (from Start Here forum)

I've been forcing the running from the faah41xx faah42xx range on the quad in the last few days and they showed nothing extraordinary and credits were in the ballpark.

faah4194_ TL3_ MIN3_ xmd19230_ 07_ 1-- 628290 Valid 08/02/2008 19:59:59 08/04/2008 04:02:15 5.07 78.0 / 74.0
faah4194_ TL3_ MIN3_ xmd00600_ 06_ 0-- 628290 Valid 08/02/2008 19:59:59 08/04/2008 03:14:15 4.85 74.7 / 75.5
faah4194_ TL3_ MIN3_ xmd18930_ 03_ 1-- 628290 Valid 08/02/2008 19:59:59 08/04/2008 03:14:15 4.99 76.8 / 67.4
faah4194_ TL3_ MIN3_ xmd01160_ 01_ 0-- 628290 Valid 08/02/2008 19:59:58 08/04/2008 03:12:58 4.85 74.8 / 78.8
faah4194_ TL3_ MIN3_ xmd05560_ 0A_ 0-- 628290 Valid 08/02/2008 19:59:58 08/03/2008 21:34:31 4.74 73.0 / 67.1
faah4194_ TL3_ MIN3_ xmd00110_ 0B_ 0-- 628290 Valid 08/02/2008 19:59:58 08/03/2008 20:09:38 4.47 68.8 / 68.0
faah4193_ TL3_ Min1_ xmd19980_ 0A_ 0-- 628290 Valid 08/01/2008 06:36:18 08/03/2008 07:53:24 4.32 66.5 / 75.1
faah4201_ TMC126_ Npl3_ wH_ xmd16630_ 03_ 0-- 628290 Valid 08/01/2008 02:00:45 08/03/2008 07:52:08 5.66 87.2 / 85.8
faah4201_ TMC126_ Npl3_ wH_ xmd17300_ 03_ 1-- 628290 Valid 08/01/2008 00:55:21 08/02/2008 19:01:13 5.98 92.1 / 86.6
faah4201_ TMC126_ Npl3_ wH_ xmd02960_ 03_ 1-- 628290 Valid 07/31/2008 23:31:51 08/02/2008 18:39:04 5.74 88.4 / 90.1
faah4201_ TMC126_ Npl3_ wH_ xmd11950_ 01_ 1-- 628290 Valid 07/31/2008 23:30:43 08/02/2008 15:53:10 6.20 95.5 / 91.1
faah4200_ TMC126_ Npl3_ MIN3_ xmd01090_ 03_ 0-- 628290 Valid 07/31/2008 19:51:15 08/02/2008 12:22:05 5.79 89.2 / 83.5
faah4200_ TMC126_ Npl3_ MIN3_ xmd01950_ 01_ 1-- 628290 Valid 07/31/2008 18:49:34 08/02/2008 09:40:03 6.24 96.0 / 108.2
faah4200_ TMC126_ Npl3_ MIN3_ xmd15310_ 03_ 1-- 628290 Valid 07/31/2008 17:52:33 08/02/2008 09:03:35 5.88 90.6 / 87.0
faah4200_ TMC126_ Npl3_ MIN3_ xmd13130_ 00_ 1-- 628290 Valid 07/31/2008 18:02:51 08/02/2008 09:00:43 5.69 87.7 / 94.1

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Aug 4, 2008 1:58:48 PM]

[Aug 4, 2008 1:57:28 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Second really long work unit received

I feel sorry for people who had their machines set up to download large work queues! Mine are set for a 2 day queue but some machines ended up with 15+ calendar days worth of work due to the error in the WU size estimation. I should be able to sneak them in under the deadline, so for now I'll just let things take their course.

As for the credit debacle, things seem to be normal on my end. Yes my x64 machines get less than they claim but they always have in the past too. It's just more shocking to see 800 claimed and 600 granted vs 80 claimed and 60 granted. Percentage wise, it's the same though.

XS,

This is exactly why I have my parms set like this:

This does a couple of things.

1). It limits how many WU's can be sitting waiting to run to only those running.

2). When any WU gets within about 8 minutes of completion, then (I suspect it's the client) a new WU is requested.

3). It keeps the request / return time to it's minimal time (presuming of course your system is set to run WU's at 100%).

4). If something comes up where WCG needs to request numerous batches be processed quickly, the settings above; permit the dispatching and subsequent initiation of the WU in the shortest period of time (at least for my system).

5). I don't have days of WU's to complete.

6). Others aren't waiting on my results for 2 weeks to get the data for validation.

7). I don't time out because of exceeding date return specifications.

8). Should for some reason, my system fail, I only loose at most the # of WU's running. A lot better than having 10 days worth of stuff in my queue and the scheduler having to re-send out all those WU's again at a past-due / no-reply time period which doesn't help advance the science in the most expeditious time period.

9). Should WCG come out and say the scheduler is going to be down for say 12 hrs, I can change the value to say 2 days and have enough WU's to continue working on until the scheduler is returned to service.

This is just how I've got my parms set up... perhaps my parms will help you too? Dunno. cool

Just remember, YMMV wink

I'm sure the CA's or the Techs here can articulate better values than what I'm using. I have been using the above arguments now for the past 2 weeks and I can only articulate to you the behavior being exhibited on my client.

----------------------------------------
[Edit 2 times, last edit by Former Member at Aug 4, 2008 4:00:38 PM]

[Aug 4, 2008 3:27:43 PM]

mclaver
Veteran Cruncher
Joined: Dec 19, 2005
Post Count: 566
Status: Offline
Project Badges:


Re: Second really long work unit received

I do agree with you that the ratio of claimed to grant looks a lot better for the non 50xx units. When I looked at that, I noticed that my AMD 9500 does not seem to get the same ratio as my AMD 9850. The 9500 is runing XP 64 and the 9850 is running 32 bit vista. Is it possible that 32-bit vista does a better job managing the quad interface you previous talked to me about than 64-bit XP.

----------------------------------------

[Aug 4, 2008 3:56:20 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Second really long work unit received

We've been looking at many angles when the problem with the HCC project occurred. Why i ask about ram/vm use is, because very large jobs in terms of resource use could cause system trashing when 4 run simultaneous. Anyone got system use data from observation?

What was obvious at the time was that Vista 32 HP Q6600 quad, 2gb Ram (mine), did a dandy job, but the dual clovers crawled through as did some other combinations. Difference is, FAAH are large resource users, HCC is not.

What you could test if you still have some is, to suspend any not yet started first, suspend any FAAH50xx jobs in progress except 1 and see if they go faster.

Side note: There have been requests at the BOINC forum to add a feature to control the number of concurrent jobs of same. e.g. 4 simultaneous Quantum Monte Carlo would fry my box. This though has never been an issue with WCG before.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 4, 2008 4:20:47 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Second really long work unit received

We have generally received a set of work from Scripps that fit a certain profile and we have been able to estimate the duration reasonably well for that profile. Recently, this has been changing as has been dramatically demonstrated with this set of longer workunits. We will be working with Scripps to understand why this is happening and see what can be done so that we can size the workunits for distribution in a consistent way.

Some information. We get our work from Scripps in batches. The workunits within each batch have similar characteristics. Batches tend to have between 10,000 - 25,000 workunits in each batch. We go through 1-3 batches a day depending on the exact size of the batches. We also usually have 7-14 days worth of work from Scripps ready to be loaded and sent to the members.

We are going to modify our processes going forward (starting today) so that we send out a limited number of workunits for each batch as soon as the batch is ready to be loaded and sent to members. This work will be sent to the reliable hosts so that we can get information about the behavior of that work as soon as possible. This process will limit the impact to the member community as we should be able to identify surprises like this before we send out tens of thousands of 'surprises'.

[Aug 4, 2008 4:22:15 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Second really long work unit received

Knreed,

Thanks for the update. Will this change apply across the board to all project WU's or will this be limited to a specific set of WU's within a WU?

[Aug 4, 2008 4:26:41 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Second really long work unit received

This graph from the link in my sig is updated very frequently and indicates which project is a candidate for limited launch analysis:

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 4, 2008 4:41:46 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Second really long work unit received

For those of you concerned about having your result be stopped becuase it runs too long, please look at the following to determine if you are at risk (and possible prevent):

Open client_state.xml in a text editor (this file is located in your BOINC installation directory - or if you are using 6.2 then in something like C:\Documents and Settings\All Users\Application Data\BOINC).

Look in <host_info> for <p_fpops>. This is fpops value your computer got while running the BOINC benchmark.

Next look for the field <rsc_fpops_bound> within the <workunit> tag for one of these long running workunits.

The client will stop running the workunit after rsc_fpops_bound/p_fpops seconds goes by.

I haven not tested this, so I do not know if it will work. However, you should be able to stop BOINC (stop the client and the manager). Then open both client_state.xml and client_state_prev.xml and modify the value for rsc_fpops_bound for the long running faah workunits so that it is something larger like 2000000000000000. Then start things up again. This should increase the cpu limit. Unfortunately we cannot send updated values for this from the server.

We normally load workunits onto the grid with a value of rsc_fpops_bound of 10 times rsc_fpops_est. However, due to the much longer than expected runtimes, the limit is only twice the actual average run time. Due to variance with the benchmark some computers will reach the limit (while others running the same workunit will be able to complete it).

We have modified the rsc_fpops_est and rsc_fpops_bound so that all new copies of the work sent out will have correct values.

Over half the workunits for these batches have already completed. I will update this post shortly with information about what % are hitting the cpu limit.

[Aug 4, 2008 5:03:04 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Second really long work unit received

This graph from the link in my sig is updated very frequently and indicates which project is a candidate for limited launch analysis:

Quick notes:

We have reduced the length of HPF2 work to around 8 hours on average.
We are reducing the length of FightAIDS@Home work to around 6.5 hours on average.

[Aug 4, 2008 5:05:06 PM]

[ ]