World Community Grid - View Thread - BOINC 7.0.42 - Suggest examination of client's requests to server when running GPU tasks

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: BOINC 7.0.42 - Suggest examination of client's requests to server when running GPU tasks

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 3

[ ]

Author

This topic has been viewed 715 times and has 2 replies

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


BOINC 7.0.42 - Suggest examination of client's requests to server when running GPU tasks

The onslaught of the extra server traffic due to the introduction of the GPU version of HCC1 has required considerable upgrading and reconfiguration of the servers, and the techs are now working on coalescing more sub-jobs into each WU that is sent & received.

It is possible that some of the extra traffic is due to some irregularities in the way that GPGPU-enabled clients (v7.0.42 at least) handle server requests in some circumstances.

I have noticed:
- During the recent long server outage, the v7.0.42 client on my one machine that has an active GPU did not implement a strategy of backing off from network activity attempts for increasing periods of time when server contact attempts failed, and instead tried to upload the result as each WU finished.
On machines without an active GPU, I noticed in the Projects tab that communication was being deferred for at least 4h35m at times.

- Again there is no progressively-increasing deferral of network activity when a request to fetch new work fails. Instead, after such a failure, a request for new work is issued each time a result file is uploaded, which can be more than once per minute on some fast machines running GPU tasks (my HD7870 does 40-45 WU/hr, OldChap says his 7970 does twice that). Reasons for work-fetch failure that I've observed are no work available, or the machine is on its limit for the maximum number of tasks allowed in the cache.
When it is on the maximum tasks limit, it may download a new WU each time one finishes, which puts it back on the limit ...

- When BOINC clients issue a request for work after a period without fetching work, they often issue several requests, just a few minutes apart. They seem to deliberately request less work than the deficit in their work cache, then after some tasks are received they re-calculate their needs and issue another request if needed. Etc. After the first batch of requested WUs in a session is received, BOINC 7.0.42 sometimes temporarily changes the time estimates of the GPU HCC1 tasks that are Ready to Start in my cache from a realistic value (approx 12m) to 3m43s (always this value). This causes it to then underestimate the amount of work in the cache by a factor of about 3, so it issues a huge 2nd request for work. This can hit the limit for the max no of Wus allowed for the machine, and commence the mode of work-fetch behaviour described above.

I have not followed this behaviour in detail but it has always stopped occurring after a while, without my intervention.
If these extra server requests could represent a significant fraction of total server requests, I suggest it would be worthwhile investigating.

I just grabbed a copy of the stdoutdae.txt file of the GPU machine for the period covering the server outage and can park it in the clouds for viewing if needed.

[Jan 23, 2013 12:27:37 PM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: BOINC 7.0.42 - Suggest examination of client's requests to server when running GPU tasks

I have noticed:
- During the recent long server outage, the v7.0.42 client on my one machine that has an active GPU did not implement a strategy of backing off from network activity attempts for increasing periods of time when server contact attempts failed, and instead tried to upload the result as each WU finished.
On machines without an active GPU, I noticed in the Projects tab that communication was being deferred for at least 4h35m at times.

It's not clear here, if you means file-uploads or scheduler-requests, but let's answer both...

For file-uploads, all transfers will be tried once, but if any problems they'll get a project-wide deferral that can increase upto 12 hours. The reason for all transfers being tried once is mainly due to CPDN having multiple upload-servers, there a single server being down (often for a week) effectively blocked uploads to all servers since only the "bad" server was tried. Trying all files once also will handle in case only a single file or something has problem, example due to a file-handle being incorrect, or possibly wrong disk-quotas so uploading to directory 456 fails while uploading to all other directories works.

For scheduler-requests, all connection-errors will increase the deferral, if my recollection isn't too fuzzy it's upto 6 hours.

Not sure for file-uploads if deferral worked correctly during the outage, but atleast scheduler-request did have 1+ hour deferral for me so seemed to work as it should...

- Again there is no progressively-increasing deferral of network activity when a request to fetch new work fails. Instead, after such a failure, a request for new work is issued each time a result file is uploaded, which can be more than once per minute on some fast machines running GPU tasks (my HD7870 does 40-45 WU/hr, OldChap says his 7970 does twice that). Reasons for work-fetch failure that I've observed are no work available, or the machine is on its limit for the maximum number of tasks allowed in the cache.
When it is on the maximum tasks limit, it may download a new WU each time one finishes, which puts it back on the limit ...

The deferrals is reset each time a task is "ready to report", just because where can be a limit on number in progress. Some projects has very low limits, if not mistaken one allows only 1 task per cpu, so to not unneccessary let the computer sit idle, the request is done immediately.

In case no tasks is finished on the other hand, and where's a "no work from project", where is a deferral that's some random number AFAIK between 0.5 * limit and 1.5 * limit, where "limit" starts at 10 minutes and is doubled for each successive "no work from project", upto a max of 24 hours.

So, if example you've only selected to run DDDT2, your computer will only ask for work once per day.

- When BOINC clients issue a request for work after a period without fetching work, they often issue several requests, just a few minutes apart. They seem to deliberately request less work than the deficit in their work cache, then after some tasks are received they re-calculate their needs and issue another request if needed. Etc. After the first batch of requested WUs in a session is received, BOINC 7.0.42 sometimes temporarily changes the time estimates of the GPU HCC1 tasks that are Ready to Start in my cache from a realistic value (approx 12m) to 3m43s (always this value). This causes it to then underestimate the amount of work in the cache by a factor of about 3, so it issues a huge 2nd request for work. This can hit the limit for the max no of Wus allowed for the machine, and commence the mode of work-fetch behaviour described above.

I have not followed this behaviour in detail but it has always stopped occurring after a while, without my intervention.

Hmm, no idea on this one...

If these extra server requests could represent a significant fraction of total server requests, I suggest it would be worthwhile investigating.

One very simple measure to decrease the server-load is to increase the deferral-time on scheduler-requests, with current WCG-settings it's only 11 seconds between requests. Increasing this to example 1 minute would cut-down unneccessary frequent requests, and have minimal impact for anyone trying to build-up a cache even if his computer is totally empty.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Jan 23, 2013 1:03:41 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:


Re: BOINC 7.0.42 - Suggest examination of client's requests to server when running GPU tasks

Thanks, Ingleside. At the time of the outage, the error messages seemed to be coming thick and fast, but closer examination of the stdoutdae.txt file shows that there were only 2 server transactions and 5 messages in the log, for each file. Just lots of files! There must be good reasons that the results from each WU are split into several files, but combining some of the files would reduce server overheads.

I feel that there must be some way of reducing the frequency of work fetch requests, which I expect trigger many actions at the server end. Increasing the minimum interval as you suggest would be one way I guess.

[OT] I have an unrelated BOINC question which I'll ask in a separate post. [/OT]

[Jan 23, 2013 3:42:34 PM]

[ ]