World Community Grid - View Thread - Overloaded with BOINC work units

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Overloaded with BOINC work units

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 8

[ ]

Author

This topic has been viewed 1526 times and has 7 replies

schepers
Advanced Cruncher
Canada
Joined: Oct 11, 2006
Post Count: 85
Status: Offline
Project Badges:

100 year badge for Human Proteome Folding - Phase 2

5 year badge for Discovering Dengue Drugs - Together

10 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

50 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

50 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for Computing for Clean Water

50 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Overloaded with BOINC work units

I've decided to repost this into a new thread as it might be important to those affected by the BOINC server outage this morning.

I went and checked all my machines, including some of the logs. I noticed that all the machines that were up and attempting to communicate with the servers during the outage now have an unusually high number of work units. (All of my machines are running BOINC 5.8.0 or 5.8.1, and are all set to cache 1 days worth of jobs, work_buf_min_days=1)

System 1:
P4 2.16 dual core
67 tasks waiting to run, 2 running

System 2:
P4 3.0Ghz HT running as 2 processors
36 tasks waiting to run, 2 running

System 3:
P4 3.0Ghz HT, running as 2 processors
38 tasks waiting to run, 2 running

System 4:
P4 2.8Ghz HT, running as 1 processor
22 tasks waiting to run, 1 running

Question: Why do I now have 67 tasks on my dual core? This is only a sample of a few of the machines I'm running. I check these machines every few days and they don't normally have more than 8-16 tasks waiting, with the dual-core normally having a few more. This would appear to be a side-effect of this mornings outage. Even the single-core System 4 has 22 tasks waiting... these might be done within the week.

I'm worried that the one with 67 waiting can't finish them in the 1 week timeframe. Even the others with 36 and 32 might be hard pressed to finish on time, and these run 24/7 running parallel tasks.

Is anyone else seeing this problem? Please check your Tasks tab in the BOINC manager to see what's queued up.

I'm also still seeing MD5 hash failures only for FCG tasks.

----------------------------------------
[Edit 1 times, last edit by schepers at Jan 17, 2007 7:19:45 PM]

[Jan 17, 2007 7:17:28 PM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Overloaded with BOINC work units

Question: Why do I now have 67 tasks on my dual core?

Short answer, this is one of the reasons alpha-builds includes "May be unstable - Use only for testing"...

Longer answer, when detection of "stalled" downloads was introduced in v5.8.0, it switched a > and < and due to this bug the client continues to ask for more and more work if download-server unaccessible...
Bug fixed in v5.8.2

I'm also still seeing MD5 hash failures only for FCG tasks.

This is because WCG is reporting wrong information to BOINC-client, and BOINC-client doesn't correctly handle this... As for a possible client-fix, abort all downloads > nbytes, meaning if this fix gets added all FCG1-wu would be aborted until WCG fixes their wu. devilish

edit - This is only a problem if download is started but for any reason stopped before finished, when trying to resume the download shows MD5-error and continues to re-try.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

----------------------------------------
[Edit 1 times, last edit by Ingleside at Jan 18, 2007 12:04:20 AM]

[Jan 17, 2007 11:06:01 PM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:


Re: Overloaded with BOINC work units

For the MD5-errors, a possible bug-fix was just added, so should show-up in next client, likely v5.8.4.

Still, WCG should fix their fcg1-wu, and stop claiming they're zero byte long.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Jan 18, 2007 12:22:05 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Overloaded with BOINC work units

That's how they arrive in the 'transfer' tab. All none compressed projects tell the size of the file downloads, whereas for FCG1 it's unknown, therefor starting at zero.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Jan 18, 2007 7:53:46 AM]

[Jan 18, 2007 7:53:16 AM]

schepers
Advanced Cruncher
Canada
Joined: Oct 11, 2006
Post Count: 85
Status: Offline
Project Badges:


Re: Overloaded with BOINC work units

Longer answer, when detection of "stalled" downloads was introduced in v5.8.0, it switched a > and < and due to this bug the client continues to ask for more and more work if download-server unaccessible...

I was afraid you would say this about running a beta. However, I think the server situation was a little more specific. Most of my machines were up and attempting communications during this unstable time. Since they managed to download pieces of work units (a few % of many files, but rarely a complete file) they continued to ask for more.

If the server was truly down, and nothing got downloaded, would this state still have happened? Would my machines download a whole mess of jobs at once when the servers came back?

This is the first bug I've seen in 5.8.0 & 5.8.1. I await the stable 5.8.x/5.9.x.

Is there a list maintained somewhere that documents the bugs fixed between beta versions?

----------------------------------------
[Edit 1 times, last edit by schepers at Jan 18, 2007 2:31:41 PM]

[Jan 18, 2007 2:30:44 PM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:


Re: Overloaded with BOINC work units

If the server was truly down, and nothing got downloaded, would this state still have happened? Would my machines download a whole mess of jobs at once when the servers came back?

If scheduling-server is up but download-server is down, yes, you'll get assigned more and more work. For many projects this is the same server, so normally both is unaccessible at the same time and therefore the bug is not a problem.

This is the first bug I've seen in 5.8.0 & 5.8.1. I await the stable 5.8.x/5.9.x.

Is there a list maintained somewhere that documents the bugs fixed between beta versions?

Odd-numbered is development-builds, even-numbered like v5.8.x is possible release-builds. Normally you'll not get any problems even if you runs an alpha-build, but ocassionally something unexpected happens...

As for release-notes, the download-page normally only mentions new features, but the checkin_notes includes everything. http://setiathome.berkeley.edu/cgi-bin/cvsweb.cgi/boinc/checkin_notes

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Jan 18, 2007 4:55:53 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Overloaded with BOINC work units

This is a note suggesting there is a fix underway for the 5.8 release to stop work request inflation.

David 23 Jan 2007
- core client: added <work_request_factor> configuration option.
Multiplies work requests.
Use values > 1 if your computer often runs out of work
while disconnected.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Jan 24, 2007 4:51:32 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Overloaded with BOINC work units

quote]I'm also still seeing MD5 hash failures only for FCG tasks.

edit - This is only a problem if download is started but for any reason stopped before finished, when trying to resume the download shows MD5-error and continues to re-try.

The print out of the md5sum problem was always misleading (since a partially downloaded file will always have a md5sum that differs from the full file). This message is no longer displayed in the new build. The md5sum's that are provided for the Genome Comparison files are the corrected md5sum's for the uncompressed file.

As far as reporting 0 bytes for the download files for Genome Comparison – this is being done because of a problem with the BOINC 5.4 client. Specifically, if you provide a byte count for a downloaded file that is interrupted, then the 5.4 client will attempt to only download the remaining portion of the file rather then re-downloading the entire file. This works great with uncompressed files. However, with Genome Comparison we are using the built in BOINC compression. BOINC uses the libcurl package and expands the file as it is downloaded from the site. The problem that arises is that if the transfer is interrupted, then when the 5.4 client attempts to resume the download it uses the size of the uncompressed portion of the downloaded file - not the amount of the compressed file downloaded so far. This caused an incorrect request to be sent to the server for the remaining amount of the download. The only way around this problem was to not send the file size for files that are using the built-in BOINC compression.

----------------------------------------
[Edit 2 times, last edit by knreed at Jan 24, 2007 7:51:41 PM]

[Jan 24, 2007 7:12:41 PM]

[ ]