| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 8
|
|
| Author |
|
|
schepers
Advanced Cruncher Canada Joined: Oct 11, 2006 Post Count: 85 Status: Offline Project Badges:
|
I've decided to repost this into a new thread as it might be important to those affected by the BOINC server outage this morning.
----------------------------------------I went and checked all my machines, including some of the logs. I noticed that all the machines that were up and attempting to communicate with the servers during the outage now have an unusually high number of work units. (All of my machines are running BOINC 5.8.0 or 5.8.1, and are all set to cache 1 days worth of jobs, work_buf_min_days=1) System 1: P4 2.16 dual core 67 tasks waiting to run, 2 running System 2: P4 3.0Ghz HT running as 2 processors 36 tasks waiting to run, 2 running System 3: P4 3.0Ghz HT, running as 2 processors 38 tasks waiting to run, 2 running System 4: P4 2.8Ghz HT, running as 1 processor 22 tasks waiting to run, 1 running Question: Why do I now have 67 tasks on my dual core? This is only a sample of a few of the machines I'm running. I check these machines every few days and they don't normally have more than 8-16 tasks waiting, with the dual-core normally having a few more. This would appear to be a side-effect of this mornings outage. Even the single-core System 4 has 22 tasks waiting... these might be done within the week. I'm worried that the one with 67 waiting can't finish them in the 1 week timeframe. Even the others with 36 and 32 might be hard pressed to finish on time, and these run 24/7 running parallel tasks. Is anyone else seeing this problem? Please check your Tasks tab in the BOINC manager to see what's queued up. I'm also still seeing MD5 hash failures only for FCG tasks. [Edit 1 times, last edit by schepers at Jan 17, 2007 7:19:45 PM] |
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
Question: Why do I now have 67 tasks on my dual core? Short answer, this is one of the reasons alpha-builds includes "May be unstable - Use only for testing"... Longer answer, when detection of "stalled" downloads was introduced in v5.8.0, it switched a > and < and due to this bug the client continues to ask for more and more work if download-server unaccessible... Bug fixed in v5.8.2 I'm also still seeing MD5 hash failures only for FCG tasks. This is because WCG is reporting wrong information to BOINC-client, and BOINC-client doesn't correctly handle this... As for a possible client-fix, abort all downloads > nbytes, meaning if this fix gets added all FCG1-wu would be aborted until WCG fixes their wu. ![]() edit - This is only a problem if download is started but for any reason stopped before finished, when trying to resume the download shows MD5-error and continues to re-try. ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." [Edit 1 times, last edit by Ingleside at Jan 18, 2007 12:04:20 AM] |
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
For the MD5-errors, a possible bug-fix was just added, so should show-up in next client, likely v5.8.4.
----------------------------------------Still, WCG should fix their fcg1-wu, and stop claiming they're zero byte long. ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
That's how they arrive in the 'transfer' tab. All none compressed projects tell the size of the file downloads, whereas for FCG1 it's unknown, therefor starting at zero.
----------------------------------------
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Jan 18, 2007 7:53:46 AM] |
||
|
|
schepers
Advanced Cruncher Canada Joined: Oct 11, 2006 Post Count: 85 Status: Offline Project Badges:
|
Longer answer, when detection of "stalled" downloads was introduced in v5.8.0, it switched a > and < and due to this bug the client continues to ask for more and more work if download-server unaccessible... I was afraid you would say this about running a beta. However, I think the server situation was a little more specific. Most of my machines were up and attempting communications during this unstable time. Since they managed to download pieces of work units (a few % of many files, but rarely a complete file) they continued to ask for more. If the server was truly down, and nothing got downloaded, would this state still have happened? Would my machines download a whole mess of jobs at once when the servers came back? This is the first bug I've seen in 5.8.0 & 5.8.1. I await the stable 5.8.x/5.9.x. Is there a list maintained somewhere that documents the bugs fixed between beta versions? [Edit 1 times, last edit by schepers at Jan 18, 2007 2:31:41 PM] |
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
If the server was truly down, and nothing got downloaded, would this state still have happened? Would my machines download a whole mess of jobs at once when the servers came back? If scheduling-server is up but download-server is down, yes, you'll get assigned more and more work. For many projects this is the same server, so normally both is unaccessible at the same time and therefore the bug is not a problem. This is the first bug I've seen in 5.8.0 & 5.8.1. I await the stable 5.8.x/5.9.x. Is there a list maintained somewhere that documents the bugs fixed between beta versions? Odd-numbered is development-builds, even-numbered like v5.8.x is possible release-builds. Normally you'll not get any problems even if you runs an alpha-build, but ocassionally something unexpected happens... As for release-notes, the download-page normally only mentions new features, but the checkin_notes includes everything. http://setiathome.berkeley.edu/cgi-bin/cvsweb.cgi/boinc/checkin_notes ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
This is a note suggesting there is a fix underway for the 5.8 release to stop work request inflation.
----------------------------------------David 23 Jan 2007 - core client: added <work_request_factor> configuration option. Multiplies work requests. Use values > 1 if your computer often runs out of work while disconnected.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
quote]I'm also still seeing MD5 hash failures only for FCG tasks. This is because WCG is reporting wrong information to BOINC-client, and BOINC-client doesn't correctly handle this... As for a possible client-fix, abort all downloads > nbytes, meaning if this fix gets added all FCG1-wu would be aborted until WCG fixes their wu. ![]() edit - This is only a problem if download is started but for any reason stopped before finished, when trying to resume the download shows MD5-error and continues to re-try. The print out of the md5sum problem was always misleading (since a partially downloaded file will always have a md5sum that differs from the full file). This message is no longer displayed in the new build. The md5sum's that are provided for the Genome Comparison files are the corrected md5sum's for the uncompressed file. As far as reporting 0 bytes for the download files for Genome Comparison – this is being done because of a problem with the BOINC 5.4 client. Specifically, if you provide a byte count for a downloaded file that is interrupted, then the 5.4 client will attempt to only download the remaining portion of the file rather then re-downloading the entire file. This works great with uncompressed files. However, with Genome Comparison we are using the built in BOINC compression. BOINC uses the libcurl package and expands the file as it is downloaded from the site. The problem that arises is that if the transfer is interrupted, then when the 5.4 client attempts to resume the download it uses the size of the uncompressed portion of the downloaded file - not the amount of the compressed file downloaded so far. This caused an incorrect request to be sent to the server for the remaining amount of the download. The only way around this problem was to not send the file size for files that are using the built-in BOINC compression. [Edit 2 times, last edit by knreed at Jan 24, 2007 7:51:41 PM] |
||
|
|
|