| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 8
|
|
| Author |
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Saw this some time this morning in log:
----------------------------------------06-06-07 5:10:53|World Community Grid|Scheduler request failed: couldn't resolve host name 06-06-07 5:10:53|World Community Grid|Deferring scheduler requests for 1 minutes and 0 seconds 06-06-07 5:11:53|World Community Grid|Started upload of file faah0583_d280cb330_x1hpv_00_2_1 06-06-07 5:11:54|World Community Grid|Started upload of file faah0583_d280cb330_x1hpv_00_2_0 06-06-07 5:11:54|World Community Grid|Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi 06-06-07 5:11:54|World Community Grid|Reason: To fetch work 06-06-07 5:11:54|World Community Grid|Requesting 5080 seconds of new work 06-06-07 5:12:09|World Community Grid|Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_1: http error 06-06-07 5:12:09|World Community Grid|Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_1 06-06-07 5:12:09|World Community Grid|Scheduler request failed: couldn't resolve host name 06-06-07 5:12:09|World Community Grid|Deferring scheduler requests for 1 minutes and 0 seconds Sometime before a WU finished, around 2.30 CET, the Web connection went south. Upon the WU finishing, BOINC attempted to send the product up to the servers, decided it could not connect and went into a cycle of every minute re-attempting, until i hit the reset button on the router about 7.20 CET. As it went about, it kept on increasing the number of seconds of work, which per above time CET had reached already 5080 seconds. Issue 1: I thought to have read that BOINC would increase the time inbetween re-attempts from 1 minutes to 10 minutes and longer. It did not and kept trying every minute....the Log file grew 200k. Issue 2: probably resulting from 1, is that in the time span of 4.5 hours or so until discovering the Router Reset requirement, the 1 minute cycles ate about 1.75 hours out of the science crunching....the new WU had progressed only 2.75 hours. My CPU idle time rating is 93.6%. Eventually when connection was re-established, the WU went up and was validated at 5:09:09 UTC producing below: faah0583_ d280cb330_ x1hpv_ 00 Valid 06/05/2006 20:25:27 06/07/2006 05:09:09 7.27 58 / 58 (Moi) faah0583_ d280cb330_ x1hpv_ 00 Valid 06/05/2006 20:24:32 06/06/2006 08:41:12 9.24 57 / 58 faah0583_ d280cb330_ x1hpv_ 00 Valid 06/05/2006 20:22:16 06/06/2006 01:59:06 4.73 122 / 58 The intermediate solution i see is to hit the Networkconnection Suspend option to stop BOINC (5.4.9) from sending anything, so at least the science continues uninterruptedly, but far from ideal.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I think I can reassure you there is no problem at all with this.
First (and most importantly) BOINC doesn't stop crunching when it makes a scheduler request. Under normal circumstances, BOINC will have a fair amount of work tucked away to be getting on with. As you can see from the log, work unit crunching was not interrupted at any time. Second, there is a change in the latest BOINC version so it only backs off exponentially if the failure is due to the server being unavailable. If it is a local problem, BOINC will keep trying until you fix it. This means that if BOINC does run out of work, it will get new work automatically as soon as the connection is reestablished. Your claim that time was lost is probably due to a misunderstanding of what is going on, but you haven't provided enough information to demonstrate this properly. We can look into it further if you want. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
hmmmm if my pc runs at 93.6% efficiency in terms of CPU time allocated to BOINC v rest, then over the timespan of the WU starting and the connection problem being resolved, i would have expected for the WU to have logged 4.5 hours or so. It only progressed 2.75 hours.
----------------------------------------The log stdoutdae.txt shows that BOINC started to try uninterruptedly with 1 minute interfalls to obtain new work at 02:57 CET: 2006-06-07 02:57:38 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi 2006-06-07 02:57:38 [World Community Grid] Reason: To fetch work 2006-06-07 02:57:38 [World Community Grid] Requesting 119 seconds of new work 2006-06-07 02:57:54 [---] Project communication failed: attempting access to reference site 2006-06-07 02:57:58 [World Community Grid] Scheduler request failed: couldn't resolve host name 2006-06-07 02:57:58 [World Community Grid] Deferring scheduler requests for 1 minutes and 0 seconds 2006-06-07 02:58:11 [---] Access to reference site failed - check network connection or proxy configuration. @ 3:14 CET it finishes the WU and tries to send result whilst continuing to obtain new work: 2006-06-07 03:14:07 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi 2006-06-07 03:14:07 [World Community Grid] Reason: To fetch work 2006-06-07 03:14:07 [World Community Grid] Requesting 708 seconds of new work 2006-06-07 03:14:28 [World Community Grid] Scheduler request failed: couldn't resolve host name 2006-06-07 03:14:28 [World Community Grid] Deferring scheduler requests for 1 minutes and 0 seconds 2006-06-07 03:15:00 [---] Rescheduling CPU: application exited 2006-06-07 03:15:00 [World Community Grid] Computation for task faah0583_d280cb330_x1hpv_00_2 finished 2006-06-07 03:15:00 [World Community Grid] Starting task ex328_1B_0 using rosetta version 422 2006-06-07 03:15:01 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_0 2006-06-07 03:15:01 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_1 2006-06-07 03:15:18 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_0: http error 2006-06-07 03:15:18 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_0 2006-06-07 03:15:18 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_1: http error 2006-06-07 03:15:18 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_1 @ 7:06 CET checked computer to see problem and reset router...log continues 2006-06-07 07:06:32 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi 2006-06-07 07:06:32 [World Community Grid] Reason: To fetch work 2006-06-07 07:06:32 [World Community Grid] Requesting 5796 seconds of new work 2006-06-07 07:06:35 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_0: http error 2006-06-07 07:06:35 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_0 2006-06-07 07:06:35 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_1: http error 2006-06-07 07:06:35 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_1 2006-06-07 07:06:47 [World Community Grid] Scheduler request failed: couldn't resolve host name 2006-06-07 07:06:47 [World Community Grid] Deferring scheduler requests for 1 minutes and 0 seconds 2006-06-07 07:07:35 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_0 2006-06-07 07:07:35 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_1 2006-06-07 07:07:49 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi 2006-06-07 07:07:49 [World Community Grid] Reason: To fetch work 2006-06-07 07:07:49 [World Community Grid] Requesting 5844 seconds of new work 2006-06-07 07:07:51 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_0: http error 2006-06-07 07:07:51 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_0 2006-06-07 07:07:51 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_1: http error 2006-06-07 07:07:51 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_1 2006-06-07 07:08:04 [World Community Grid] Scheduler request failed: couldn't resolve host name 2006-06-07 07:08:04 [World Community Grid] Deferring scheduler requests for 1 minutes and 0 seconds 2006-06-07 07:08:51 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_0 2006-06-07 07:08:51 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_1 2006-06-07 07:08:58 [World Community Grid] Finished upload of file faah0583_d280cb330_x1hpv_00_2_0 2006-06-07 07:08:58 [World Community Grid] Throughput 7938 bytes/sec 2006-06-07 07:09:06 [World Community Grid] Finished upload of file faah0583_d280cb330_x1hpv_00_2_1 2006-06-07 07:09:06 [World Community Grid] Throughput 26565 bytes/sec 2006-06-07 07:09:09 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi 2006-06-07 07:09:09 [World Community Grid] Reason: To fetch work 2006-06-07 07:09:09 [World Community Grid] Requesting 5910 seconds of new work, and reporting 1 completed tasks 2006-06-07 07:09:15 [World Community Grid] Scheduler request succeeded 2006-06-07 07:09:17 [World Community Grid] Started download of file ex442_10_ex442.fasta 2006-06-07 07:09:17 [World Community Grid] Started download of file ex442_10_ex442.psipred 2006-06-07 07:09:19 [World Community Grid] Finished download of file ex442_10_ex442.fasta 2006-06-07 07:09:19 [World Community Grid] Throughput 123 bytes/sec 2006-06-07 07:09:19 [World Community Grid] Finished download of file ex442_10_ex442.psipred 2006-06-07 07:09:19 [World Community Grid] Throughput 418 bytes/sec 2006-06-07 07:09:19 [World Community Grid] Started download of file ex442_10_ex442.psipred_ss2 2006-06-07 07:09:19 [World Community Grid] Started download of file ex442_10_aaex44203_05.075_v1_3 2006-06-07 07:09:21 [World Community Grid] Finished download of file ex442_10_ex442.psipred_ss2 2006-06-07 07:09:21 [World Community Grid] Throughput 2369 bytes/sec 2006-06-07 07:09:21 [World Community Grid] Started download of file ex442_10_aaex44209_05.075_v1_3 2006-06-07 07:09:28 [World Community Grid] Finished download of file ex442_10_aaex44203_05.075_v1_3 2006-06-07 07:09:28 [World Community Grid] Throughput 157961 bytes/sec 2006-06-07 07:09:38 [World Community Grid] Finished download of file ex442_10_aaex44209_05.075_v1_3 2006-06-07 07:09:38 [World Community Grid] Throughput 243640 bytes/sec 2006-06-07 07:09:39 [---] Rescheduling CPU: files downloaded Okay, 4.5 hours was 4 hours and 10 minutes of uninterruptedly minutely logging of fetching expanded from 3:10 with result sending attempts until 7:06. In the 4 wallclock hours the WU unit crunched, it only progressed 2 hours 46 minutes CPU time, ergo 1.5 hours went somewhere south. There can only be 2 conclusions with which one can do anything: 1. The log shows with interruption of 1 minutes, perpetual attempt to fetch work with ever increasing number of seconds work (whatever that means) 2. Substantial time was lost keeping the failing comms request looping going. The "Rescheduling CPU" has been observed many times. 3. Watching taskmanager science project entry which counts synchronous to BOINC, its very visible that CPU time stalls whilst comms between BOINC and WCG servers takes place. Had the fetching / sending attempts backed off for longer periods, WHICH I THINK SHOULD IDEALLY UNDERSTAND IT SHOULD HAVE DONE, would the 1.5 hours lost have been 5 minutes maybe...got a real 5mb download and 320k upload adsl, so sending fetching times are normally only parts of minutes. I'd be happy to zip the log and send it to address of support choice....meantime, i think you gave the answer...the problem was local thus BOINC keeps on trying (without pausing exponentially???), whilst if it is serverside it would extend the requests exponentially. I think there's something to fix in BOINC 5.4.9 client! PS curiously i've seen, since the introduction of the 1 week deadline switch messages which go like: 2006-06-07 13:14:42 [---] Using earliest-deadline-first scheduling because computer is overcommitted. 2006-06-07 13:14:45 [---] Suspending work fetch because computer is overcommitted. It would stop the WU in progress dead in its track and go to a different WU. Then when that prevailing WU is finished it would give the following: 2006-06-07 21:39:20 [---] Resuming round-robin CPU scheduling. 2006-06-07 21:39:20 [---] Allowing work fetch again. Why would it persevere in fetching work attempts most of the night, when in fact like was put, sufficient work is in cache (keep 2 days)?
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
To answer your last question first: BOINC tries to fill it's cache completely. This means that as soon as it has a tiny gap in the cache (and the network policy allows), it will send a scheduler request for that amount. This is why the scheduler request is usually just for a few seconds of work. However, the scheduler response gives BOINC a whole work unit, and BOINC doesn't need to phone home again for quite a while.
Now, the lost time: BOINC's activity isn't going to affect CPU time normally. However, network activity can tie up CPU, particularly if you have a cheap network card or integrated networking. If the router is fried, your computer could easily sit there bouncing messages at it all day. There's not a lot BOINC can do about this. Should it happen again, check task manager and see which process is tying up your CPU. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
That piece i did not provide, but in taskmanager it showed yesterdaymorning to have allocated 1 hour 19 minutes 29 seconds to BOINC.EXE. This morning , never having been off since last boot 5 days ago 14 hours ago (Systeminfo | Find "Tempo"), it showed 1:19:37. That's a whole 8 seconds in 24 hours, when not futilly trying to get work.
----------------------------------------No my network/comms hardware is fine....i can run streaming radio from Pandora, Seattle, it barely taking 1 or 2 % cpu time. Not the close to 30% that BOINC.exe took trying its luck. My suggestion thus remain to put a mod request forward to the BOINC.exe client to go thru same backoff extension routine as the servers. This perpetual top-up to the brim attempting is....(various words from Dizionario Brittannica omitted). This tiny gap filling seems to be responsible for the Overcommitting / Round Robin switching routines as well? The estimated times of WU's not being based on past client experience, are utterly wacky as it litterally does 99% in 9 clock hours and not the proposed 17:18:01 for each WU in cache presently. have a nice day ciao
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
You're missing the point.
The network hardware will put more strain on the CPU when it isn't working properly. If you don't like 5.4.9, I suggest you talk to the BOINC developers, or downgrade. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
hmmm 5.4.9 opposed to a previous version has nothing to do with it, any BOINC version has way off the mark time estimates.
----------------------------------------Fact remains that BOINC.exe ate 1:19 hh:mm in a 4 hour wallclock stretch taking away time from the science crunch. My comment porposal was to make the BOINC agent to back off over longer periods i.e. 1 minute 10 minutes 100 minutes, since it anyway has 2 days work in cache. I'm sure WCG has more cloud bringing this up to the BOINC development forum than a private individual has. So let me rephrase, because BOINC ad infinitum kept trying to contact the Server, the Network hardware was kept busy.....but my good network hardware barely burdens the CPU! No i'm not missing the point. I'l seek out answer elsewhere and revert. Sorry to have bothered you....have a nice day anyway ![]()
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
(>.<)
Obviously I wasn't clear. The removal of the exponential backoff is a new feature in 5.4.9. This is by design. If you don't like it, by all means tell the BOINC programmers. |
||
|
|
|