Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 8
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 911 times and has 7 replies Next Thread
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
BOINC Scheduler deferring science processing

Saw this some time this morning in log:

06-06-07 5:10:53|World Community Grid|Scheduler request failed: couldn't resolve host name
06-06-07 5:10:53|World Community Grid|Deferring scheduler requests for 1 minutes and 0 seconds
06-06-07 5:11:53|World Community Grid|Started upload of file faah0583_d280cb330_x1hpv_00_2_1
06-06-07 5:11:54|World Community Grid|Started upload of file faah0583_d280cb330_x1hpv_00_2_0
06-06-07 5:11:54|World Community Grid|Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi
06-06-07 5:11:54|World Community Grid|Reason: To fetch work
06-06-07 5:11:54|World Community Grid|Requesting 5080 seconds of new work
06-06-07 5:12:09|World Community Grid|Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_1: http error
06-06-07 5:12:09|World Community Grid|Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_1
06-06-07 5:12:09|World Community Grid|Scheduler request failed: couldn't resolve host name
06-06-07 5:12:09|World Community Grid|Deferring scheduler requests for 1 minutes and 0 seconds

Sometime before a WU finished, around 2.30 CET, the Web connection went south. Upon the WU finishing, BOINC attempted to send the product up to the servers, decided it could not connect and went into a cycle of every minute re-attempting, until i hit the reset button on the router about 7.20 CET. As it went about, it kept on increasing the number of seconds of work, which per above time CET had reached already 5080 seconds.

Issue 1: I thought to have read that BOINC would increase the time inbetween re-attempts from 1 minutes to 10 minutes and longer. It did not and kept trying every minute....the Log file grew 200k.

Issue 2: probably resulting from 1, is that in the time span of 4.5 hours or so until discovering the Router Reset requirement, the 1 minute cycles ate about 1.75 hours out of the science crunching....the new WU had progressed only 2.75 hours. My CPU idle time rating is 93.6%.

Eventually when connection was re-established, the WU went up and was validated at 5:09:09 UTC producing below:

faah0583_ d280cb330_ x1hpv_ 00 Valid 06/05/2006 20:25:27 06/07/2006 05:09:09 7.27 58 / 58 (Moi)
faah0583_ d280cb330_ x1hpv_ 00 Valid 06/05/2006 20:24:32 06/06/2006 08:41:12 9.24 57 / 58
faah0583_ d280cb330_ x1hpv_ 00 Valid 06/05/2006 20:22:16 06/06/2006 01:59:06 4.73 122 / 58

The intermediate solution i see is to hit the Networkconnection Suspend option to stop BOINC (5.4.9) from sending anything, so at least the science continues uninterruptedly, but far from ideal.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jun 7, 2006 4:04:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC Scheduler deferring science processing

I think I can reassure you there is no problem at all with this.

First (and most importantly) BOINC doesn't stop crunching when it makes a scheduler request. Under normal circumstances, BOINC will have a fair amount of work tucked away to be getting on with. As you can see from the log, work unit crunching was not interrupted at any time.

Second, there is a change in the latest BOINC version so it only backs off exponentially if the failure is due to the server being unavailable. If it is a local problem, BOINC will keep trying until you fix it. This means that if BOINC does run out of work, it will get new work automatically as soon as the connection is reestablished.

Your claim that time was lost is probably due to a misunderstanding of what is going on, but you haven't provided enough information to demonstrate this properly. We can look into it further if you want.
[Jun 7, 2006 5:33:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC Scheduler deferring science processing

hmmmm if my pc runs at 93.6% efficiency in terms of CPU time allocated to BOINC v rest, then over the timespan of the WU starting and the connection problem being resolved, i would have expected for the WU to have logged 4.5 hours or so. It only progressed 2.75 hours.

The log stdoutdae.txt shows that BOINC started to try uninterruptedly with 1 minute interfalls to obtain new work at 02:57 CET:

2006-06-07 02:57:38 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi
2006-06-07 02:57:38 [World Community Grid] Reason: To fetch work
2006-06-07 02:57:38 [World Community Grid] Requesting 119 seconds of new work
2006-06-07 02:57:54 [---] Project communication failed: attempting access to reference site
2006-06-07 02:57:58 [World Community Grid] Scheduler request failed: couldn't resolve host name
2006-06-07 02:57:58 [World Community Grid] Deferring scheduler requests for 1 minutes and 0 seconds
2006-06-07 02:58:11 [---] Access to reference site failed - check network connection or proxy configuration.

@ 3:14 CET it finishes the WU and tries to send result whilst continuing to obtain new work:

2006-06-07 03:14:07 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi
2006-06-07 03:14:07 [World Community Grid] Reason: To fetch work
2006-06-07 03:14:07 [World Community Grid] Requesting 708 seconds of new work
2006-06-07 03:14:28 [World Community Grid] Scheduler request failed: couldn't resolve host name
2006-06-07 03:14:28 [World Community Grid] Deferring scheduler requests for 1 minutes and 0 seconds
2006-06-07 03:15:00 [---] Rescheduling CPU: application exited
2006-06-07 03:15:00 [World Community Grid] Computation for task faah0583_d280cb330_x1hpv_00_2 finished
2006-06-07 03:15:00 [World Community Grid] Starting task ex328_1B_0 using rosetta version 422
2006-06-07 03:15:01 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_0
2006-06-07 03:15:01 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_1
2006-06-07 03:15:18 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_0: http error
2006-06-07 03:15:18 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_0
2006-06-07 03:15:18 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_1: http error
2006-06-07 03:15:18 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_1

@ 7:06 CET checked computer to see problem and reset router...log continues

2006-06-07 07:06:32 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi
2006-06-07 07:06:32 [World Community Grid] Reason: To fetch work
2006-06-07 07:06:32 [World Community Grid] Requesting 5796 seconds of new work
2006-06-07 07:06:35 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_0: http error
2006-06-07 07:06:35 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_0
2006-06-07 07:06:35 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_1: http error
2006-06-07 07:06:35 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_1
2006-06-07 07:06:47 [World Community Grid] Scheduler request failed: couldn't resolve host name
2006-06-07 07:06:47 [World Community Grid] Deferring scheduler requests for 1 minutes and 0 seconds
2006-06-07 07:07:35 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_0
2006-06-07 07:07:35 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_1
2006-06-07 07:07:49 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi
2006-06-07 07:07:49 [World Community Grid] Reason: To fetch work
2006-06-07 07:07:49 [World Community Grid] Requesting 5844 seconds of new work
2006-06-07 07:07:51 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_0: http error
2006-06-07 07:07:51 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_0
2006-06-07 07:07:51 [World Community Grid] Temporarily failed upload of faah0583_d280cb330_x1hpv_00_2_1: http error
2006-06-07 07:07:51 [World Community Grid] Backing off 1 minutes and 0 seconds on upload of file faah0583_d280cb330_x1hpv_00_2_1
2006-06-07 07:08:04 [World Community Grid] Scheduler request failed: couldn't resolve host name
2006-06-07 07:08:04 [World Community Grid] Deferring scheduler requests for 1 minutes and 0 seconds
2006-06-07 07:08:51 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_0
2006-06-07 07:08:51 [World Community Grid] Started upload of file faah0583_d280cb330_x1hpv_00_2_1
2006-06-07 07:08:58 [World Community Grid] Finished upload of file faah0583_d280cb330_x1hpv_00_2_0
2006-06-07 07:08:58 [World Community Grid] Throughput 7938 bytes/sec
2006-06-07 07:09:06 [World Community Grid] Finished upload of file faah0583_d280cb330_x1hpv_00_2_1
2006-06-07 07:09:06 [World Community Grid] Throughput 26565 bytes/sec
2006-06-07 07:09:09 [World Community Grid] Sending scheduler request to https://secure.worldcommunitygrid.org/boinc/wcg_cgi/fcgi
2006-06-07 07:09:09 [World Community Grid] Reason: To fetch work
2006-06-07 07:09:09 [World Community Grid] Requesting 5910 seconds of new work, and reporting 1 completed tasks
2006-06-07 07:09:15 [World Community Grid] Scheduler request succeeded
2006-06-07 07:09:17 [World Community Grid] Started download of file ex442_10_ex442.fasta
2006-06-07 07:09:17 [World Community Grid] Started download of file ex442_10_ex442.psipred
2006-06-07 07:09:19 [World Community Grid] Finished download of file ex442_10_ex442.fasta
2006-06-07 07:09:19 [World Community Grid] Throughput 123 bytes/sec
2006-06-07 07:09:19 [World Community Grid] Finished download of file ex442_10_ex442.psipred
2006-06-07 07:09:19 [World Community Grid] Throughput 418 bytes/sec
2006-06-07 07:09:19 [World Community Grid] Started download of file ex442_10_ex442.psipred_ss2
2006-06-07 07:09:19 [World Community Grid] Started download of file ex442_10_aaex44203_05.075_v1_3
2006-06-07 07:09:21 [World Community Grid] Finished download of file ex442_10_ex442.psipred_ss2
2006-06-07 07:09:21 [World Community Grid] Throughput 2369 bytes/sec
2006-06-07 07:09:21 [World Community Grid] Started download of file ex442_10_aaex44209_05.075_v1_3
2006-06-07 07:09:28 [World Community Grid] Finished download of file ex442_10_aaex44203_05.075_v1_3
2006-06-07 07:09:28 [World Community Grid] Throughput 157961 bytes/sec
2006-06-07 07:09:38 [World Community Grid] Finished download of file ex442_10_aaex44209_05.075_v1_3
2006-06-07 07:09:38 [World Community Grid] Throughput 243640 bytes/sec
2006-06-07 07:09:39 [---] Rescheduling CPU: files downloaded

Okay, 4.5 hours was 4 hours and 10 minutes of uninterruptedly minutely logging of fetching expanded from 3:10 with result sending attempts until 7:06. In the 4 wallclock hours the WU unit crunched, it only progressed 2 hours 46 minutes CPU time, ergo 1.5 hours went somewhere south.

There can only be 2 conclusions with which one can do anything:

1. The log shows with interruption of 1 minutes, perpetual attempt to fetch work with ever increasing number of seconds work (whatever that means)
2. Substantial time was lost keeping the failing comms request looping going. The "Rescheduling CPU" has been observed many times.
3. Watching taskmanager science project entry which counts synchronous to BOINC, its very visible that CPU time stalls whilst comms between BOINC and WCG servers takes place.

Had the fetching / sending attempts backed off for longer periods, WHICH I THINK SHOULD IDEALLY UNDERSTAND IT SHOULD HAVE DONE, would the 1.5 hours lost have been 5 minutes maybe...got a real 5mb download and 320k upload adsl, so sending fetching times are normally only parts of minutes.

I'd be happy to zip the log and send it to address of support choice....meantime, i think you gave the answer...the problem was local thus BOINC keeps on trying (without pausing exponentially???), whilst if it is serverside it would extend the requests exponentially.

I think there's something to fix in BOINC 5.4.9 client!

PS curiously i've seen, since the introduction of the 1 week deadline switch messages which go like:

2006-06-07 13:14:42 [---] Using earliest-deadline-first scheduling because computer is overcommitted.
2006-06-07 13:14:45 [---] Suspending work fetch because computer is overcommitted.

It would stop the WU in progress dead in its track and go to a different WU. Then when that prevailing WU is finished it would give the following:

2006-06-07 21:39:20 [---] Resuming round-robin CPU scheduling.
2006-06-07 21:39:20 [---] Allowing work fetch again.

Why would it persevere in fetching work attempts most of the night, when in fact like was put, sufficient work is in cache (keep 2 days)?
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jun 7, 2006 8:54:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC Scheduler deferring science processing

To answer your last question first: BOINC tries to fill it's cache completely. This means that as soon as it has a tiny gap in the cache (and the network policy allows), it will send a scheduler request for that amount. This is why the scheduler request is usually just for a few seconds of work. However, the scheduler response gives BOINC a whole work unit, and BOINC doesn't need to phone home again for quite a while.

Now, the lost time: BOINC's activity isn't going to affect CPU time normally. However, network activity can tie up CPU, particularly if you have a cheap network card or integrated networking. If the router is fried, your computer could easily sit there bouncing messages at it all day. There's not a lot BOINC can do about this. Should it happen again, check task manager and see which process is tying up your CPU.
[Jun 7, 2006 9:36:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC Scheduler deferring science processing

That piece i did not provide, but in taskmanager it showed yesterdaymorning to have allocated 1 hour 19 minutes 29 seconds to BOINC.EXE. This morning , never having been off since last boot 5 days ago 14 hours ago (Systeminfo | Find "Tempo"), it showed 1:19:37. That's a whole 8 seconds in 24 hours, when not futilly trying to get work.

No my network/comms hardware is fine....i can run streaming radio from Pandora, Seattle, it barely taking 1 or 2 % cpu time. Not the close to 30% that BOINC.exe took trying its luck.

My suggestion thus remain to put a mod request forward to the BOINC.exe client to go thru same backoff extension routine as the servers. This perpetual top-up to the brim attempting is....(various words from Dizionario Brittannica omitted).

This tiny gap filling seems to be responsible for the Overcommitting / Round Robin switching routines as well? The estimated times of WU's not being based on past client experience, are utterly wacky as it litterally does 99% in 9 clock hours and not the proposed 17:18:01 for each WU in cache presently.

have a nice day
ciao
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jun 8, 2006 5:27:41 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC Scheduler deferring science processing

You're missing the point.

The network hardware will put more strain on the CPU when it isn't working properly.

If you don't like 5.4.9, I suggest you talk to the BOINC developers, or downgrade.
[Jun 8, 2006 6:31:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC Scheduler deferring science processing

hmmm 5.4.9 opposed to a previous version has nothing to do with it, any BOINC version has way off the mark time estimates.

Fact remains that BOINC.exe ate 1:19 hh:mm in a 4 hour wallclock stretch taking away time from the science crunch. My comment porposal was to make the BOINC agent to back off over longer periods i.e. 1 minute 10 minutes 100 minutes, since it anyway has 2 days work in cache. I'm sure WCG has more cloud bringing this up to the BOINC development forum than a private individual has. So let me rephrase, because BOINC ad infinitum kept trying to contact the Server, the Network hardware was kept busy.....but my good network hardware barely burdens the CPU!

No i'm not missing the point. I'l seek out answer elsewhere and revert.

Sorry to have bothered you....have a nice day anyway rose
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jun 8, 2006 7:42:14 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC Scheduler deferring science processing

(>.<)

Obviously I wasn't clear. The removal of the exponential backoff is a new feature in 5.4.9. This is by design. If you don't like it, by all means tell the BOINC programmers.
[Jun 8, 2006 7:56:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread