| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3
|
|
| Author |
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
The onslaught of the extra server traffic due to the introduction of the GPU version of HCC1 has required considerable upgrading and reconfiguration of the servers, and the techs are now working on coalescing more sub-jobs into each WU that is sent & received.
It is possible that some of the extra traffic is due to some irregularities in the way that GPGPU-enabled clients (v7.0.42 at least) handle server requests in some circumstances. I have noticed: - During the recent long server outage, the v7.0.42 client on my one machine that has an active GPU did not implement a strategy of backing off from network activity attempts for increasing periods of time when server contact attempts failed, and instead tried to upload the result as each WU finished. On machines without an active GPU, I noticed in the Projects tab that communication was being deferred for at least 4h35m at times. - Again there is no progressively-increasing deferral of network activity when a request to fetch new work fails. Instead, after such a failure, a request for new work is issued each time a result file is uploaded, which can be more than once per minute on some fast machines running GPU tasks (my HD7870 does 40-45 WU/hr, OldChap says his 7970 does twice that). Reasons for work-fetch failure that I've observed are no work available, or the machine is on its limit for the maximum number of tasks allowed in the cache. When it is on the maximum tasks limit, it may download a new WU each time one finishes, which puts it back on the limit ... - When BOINC clients issue a request for work after a period without fetching work, they often issue several requests, just a few minutes apart. They seem to deliberately request less work than the deficit in their work cache, then after some tasks are received they re-calculate their needs and issue another request if needed. Etc. After the first batch of requested WUs in a session is received, BOINC 7.0.42 sometimes temporarily changes the time estimates of the GPU HCC1 tasks that are Ready to Start in my cache from a realistic value (approx 12m) to 3m43s (always this value). This causes it to then underestimate the amount of work in the cache by a factor of about 3, so it issues a huge 2nd request for work. This can hit the limit for the max no of Wus allowed for the machine, and commence the mode of work-fetch behaviour described above. I have not followed this behaviour in detail but it has always stopped occurring after a while, without my intervention. If these extra server requests could represent a significant fraction of total server requests, I suggest it would be worthwhile investigating. I just grabbed a copy of the stdoutdae.txt file of the GPU machine for the period covering the server outage and can park it in the clouds for viewing if needed. |
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
I have noticed: - During the recent long server outage, the v7.0.42 client on my one machine that has an active GPU did not implement a strategy of backing off from network activity attempts for increasing periods of time when server contact attempts failed, and instead tried to upload the result as each WU finished. On machines without an active GPU, I noticed in the Projects tab that communication was being deferred for at least 4h35m at times. It's not clear here, if you means file-uploads or scheduler-requests, but let's answer both... For file-uploads, all transfers will be tried once, but if any problems they'll get a project-wide deferral that can increase upto 12 hours. The reason for all transfers being tried once is mainly due to CPDN having multiple upload-servers, there a single server being down (often for a week) effectively blocked uploads to all servers since only the "bad" server was tried. Trying all files once also will handle in case only a single file or something has problem, example due to a file-handle being incorrect, or possibly wrong disk-quotas so uploading to directory 456 fails while uploading to all other directories works. For scheduler-requests, all connection-errors will increase the deferral, if my recollection isn't too fuzzy it's upto 6 hours. Not sure for file-uploads if deferral worked correctly during the outage, but atleast scheduler-request did have 1+ hour deferral for me so seemed to work as it should... - Again there is no progressively-increasing deferral of network activity when a request to fetch new work fails. Instead, after such a failure, a request for new work is issued each time a result file is uploaded, which can be more than once per minute on some fast machines running GPU tasks (my HD7870 does 40-45 WU/hr, OldChap says his 7970 does twice that). Reasons for work-fetch failure that I've observed are no work available, or the machine is on its limit for the maximum number of tasks allowed in the cache. When it is on the maximum tasks limit, it may download a new WU each time one finishes, which puts it back on the limit ... The deferrals is reset each time a task is "ready to report", just because where can be a limit on number in progress. Some projects has very low limits, if not mistaken one allows only 1 task per cpu, so to not unneccessary let the computer sit idle, the request is done immediately. In case no tasks is finished on the other hand, and where's a "no work from project", where is a deferral that's some random number AFAIK between 0.5 * limit and 1.5 * limit, where "limit" starts at 10 minutes and is doubled for each successive "no work from project", upto a max of 24 hours. So, if example you've only selected to run DDDT2, your computer will only ask for work once per day. - When BOINC clients issue a request for work after a period without fetching work, they often issue several requests, just a few minutes apart. They seem to deliberately request less work than the deficit in their work cache, then after some tasks are received they re-calculate their needs and issue another request if needed. Etc. After the first batch of requested WUs in a session is received, BOINC 7.0.42 sometimes temporarily changes the time estimates of the GPU HCC1 tasks that are Ready to Start in my cache from a realistic value (approx 12m) to 3m43s (always this value). This causes it to then underestimate the amount of work in the cache by a factor of about 3, so it issues a huge 2nd request for work. This can hit the limit for the max no of Wus allowed for the machine, and commence the mode of work-fetch behaviour described above. I have not followed this behaviour in detail but it has always stopped occurring after a while, without my intervention. Hmm, no idea on this one... If these extra server requests could represent a significant fraction of total server requests, I suggest it would be worthwhile investigating. One very simple measure to decrease the server-load is to increase the deferral-time on scheduler-requests, with current WCG-settings it's only 11 seconds between requests. Increasing this to example 1 minute would cut-down unneccessary frequent requests, and have minimal impact for anyone trying to build-up a cache even if his computer is totally empty. ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
Thanks, Ingleside. At the time of the outage, the error messages seemed to be coming thick and fast, but closer examination of the stdoutdae.txt file shows that there were only 2 server transactions and 5 messages in the log, for each file. Just lots of files! There must be good reasons that the results from each WU are split into several files, but combining some of the files would reduce server overheads.
I feel that there must be some way of reducing the frequency of work fetch requests, which I expect trigger many actions at the server end. Increasing the minimum interval as you suggest would be one way I guess. [OT] I have an unrelated BOINC question which I'll ask in a separate post. [/OT] |
||
|
|
|