World Community Grid - View Thread - GPU Validation server issue??

World Community Grid Forums

Category: Completed Research

Forum: Help Conquer Cancer

Thread: GPU Validation server issue??

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 33

[ ]

Author

This topic has been viewed 6171 times and has 32 replies

Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

20 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water


Re: GPU Validation server issue??

Results Page is very slow ore even sometimes impossible to reach.
I have also noticed a big increase in the PV pile. I have about 1200 WUs PV per rig with an HD7970 GPU and about half 600 WUs per GTX580 GPU. This makes it at the moment in cumulative over 10'000 PV WUs in the pile. I have the feeling that validators are drowning under the WU flow. tongue

----------------------------------------

[Nov 19, 2012 11:21:33 PM]

nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: GPU Validation server issue??

Maybe while they're working on this, they can figure how to exclude 6.10.58 clients from even getting any GPU WUs. My PVs are littered with 6.10.58 errors. Sometimes there's a few minutes between sent and returned and sometimes it's days.

Same here. Been like that since the beta. Can't tell you how many times I've recommended that people update to some version of 7. whistling

I have PVs that are 17 days old.

----------------------------------------

In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.

----------------------------------------
[Edit 1 times, last edit by nanoprobe at Nov 19, 2012 11:25:36 PM]

[Nov 19, 2012 11:23:04 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: GPU Validation server issue??

There are a few things going on:

1) The homogenous_app_version feature of the server code which we use to ensure that nvidia results are compared with nvidia results, ati with ati, cpu with cpu, did not support the 'reliable' mechanism until early Sunday morning (UTC). Old results are now getting cleared and pending validations are dropping.

2) When we released the new app version and the new workunits, we succeeded in reducing the number of results per day. However, we made a mistake and we caused the size per row on the table to be larger than two rows were previously. This has resulted in a drop in performance of the database. We have corrected our error so smaller rows are being created now but it will take a couple of days for this change to go through the database and result in improved performance. We are temporarily also reducing the time that old results stay in the database in order to shrink the size of the table.

Item #2 is directly related to the slow/failed load times of the result status table.

As far as 6.10.58 vs 7.0 - based on the last time I ran the numbers, both saw similar rates of success from running GPU.

[Nov 20, 2012 1:10:44 AM]

pirogue
Veteran Cruncher
USA
Joined: Dec 8, 2008
Post Count: 685
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

2 year badge for The Clean Energy Project

2 year badge for Influenza Antiviral Drug Search

10 year badge for Help Cure Muscular Dystrophy - Phase 2

10 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

10 year badge for Computing for Clean Water

2 year badge for Computing for Sustainable Water

10 year badge for Outsmart Ebola Together

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: GPU Validation server issue??

As far as 6.10.58 vs 7.0 - based on the last time I ran the numbers, both saw similar rates of success from running GPU.

Interesting. From my small sample:
If there's a WU where there's a PV, an error, and an in progress, it's around 99 out of 100 where the error is from a 6.10.x machine.

----------------------------------------

[Nov 20, 2012 1:36:25 AM]

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for Computing for Clean Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

5 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: GPU Validation server issue??

Thanks for disseminating this information. You always give us a clearer picture of what is behind the things we see.

Have you guys had any thoughts yet about how best to resolve the short period of cached units?

I think most of us would appreciate the move to a full day being possible but that may be in the order of >2500 wu's per gpu for some.

We cannot know all the issues that this may cause at your end but from where I sit such a move would help us overcome the shorter outages that are inevitable with infant projects.

----------------------------------------

[Nov 20, 2012 1:46:27 AM]

Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:


Re: GPU Validation server issue??

I do not know exactly how things work at WCG servers, but isn't it possible to dynamically set the daily task limit per GPU for a given host.
When a host (example my MARS machine) returns crunched WUs back to the WCG servers, GPU/CPU time and elapsed time are known (they are identical in my rigs). Over say 3 days an average value can be computed. From that average value can be calculated the daily task crunching capability of MARS and then the server would allocate a daily limit for MARS.
In this way every host would have its own daily limit tuned to its average capability. When I set the cache size to say two days then the server knows how many tasks to be sent (if available).
If that host gets upgraded then its average will increase and the limit would be adapted accordingly. Some hosts would have very high daily allocations but others very small. In average over the whole grid there should be no change and there would be less hosts waiting for WUs. To avoid continous adaptations we could set a threshold (example +10%) that would trigger the daily allocation adaptation.
Does this make sense or my reasoning is flawed somewhere. confused

----------------------------------------

[Nov 20, 2012 7:28:11 AM]

nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:


Re: GPU Validation server issue??

As far as 6.10.58 vs 7.0 - based on the last time I ran the numbers, both saw similar rates of success from running GPU.

Interesting. From my small sample:
If there's a WU where there's a PV, an error, and an in progress, it's around 99 out of 100 where the error is from a 6.10.x machine.

Same here.

----------------------------------------

In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.

[Nov 20, 2012 10:52:36 AM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together


Re: GPU Validation server issue??

In this way every host would have its own daily limit tuned to its average capability. When I set the cache size to say two days then the server knows how many tasks to be sent (if available).

The daily quota shouldn't be a problem, since AFAIK it's already so high that unless you're generating any errors you shouldn't hit this limit.

As for giving you upto 7 days of cache on the other hand is not such a good idea. The problem is a fast GPU can do 1000+ tasks/day, meaning a full 7-day cache would be 7000+ tasks. If you have 1000 such computers, you've got 7 million tasks in the database, and with so many to look through all database-traffic will go slower.

By limiting cache to example 500 GPU-tasks max at once, a computer can still crunch 1000+ per day, and 1000 such computers can still crunch 7 million in a week. But, at any given time there'll only be 0.5 million tasks in the datase, if no grace-period and everything is immediately validated.

With a 1-day grace-period and some random waiting for validation it will probably be somewhere between 1.5 - 2 million tasks in database at once. While large it's much less chance to run into database-performance-problems.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Nov 20, 2012 3:32:02 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: GPU Validation server issue??

... and about performance problems, the home of BOINC now had to quadruple the task sizes [See Quota? thread for discussion], and set quota too.

[Nov 20, 2012 3:43:48 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: GPU Validation server issue??

... and about performance problems, the home of BOINC now had to quadruple the task sizes [See Quota? thread for discussion], and set quota too.

What and/or who was that in response to?
BOINC has no problem with a miriad of other projects being able to be granted credits. If in fact this is an issue, HCC1 should find a way to increase the 'run time' of their WU's. If HCC1 cannot for some reason differentiate CPU from GPU..... Their bad. Learn to!!!!!!
If it's all about the science, let's let it be that way. GPU's are faster, get rid of CPU WU's. There's HFCC for them. What? The HCC CPU'ers don't care about children????

CMON, this argument is getting stale;-(

[Nov 20, 2012 7:51:40 PM]

[ ]