World Community Grid - View Thread - 2022-11-04 Update (ARP units & Device Manager issues)

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: 2022-11-04 Update (ARP units & Device Manager issues)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 30

[ ]

Author

This topic has been viewed 8372 times and has 29 replies

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 448
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

200 year badge for Mapping Cancer Markers

180 day badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

180 day badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: 2022-11-04 Update (ARP units & Device Manager issues)

@Cyclops,
Based on how many WU I currently show “in progress” and how many I TYPICALLY show “in progress,” I would speculate (since I am at work and not home and unable to visually verify) that I have about 40 - 60 WUs stalled with potential http errors.
That comment was based on information I emailed to you earlier today.

Bruce

ETA: Downloads were in worse shape than I suspected. Virtually all indicated WUs were incomplete downloads with http errors. They have been expedited but have seen additional http errors on newer downloads. What I thought were stalled WUs were, in fact, WUs not yet scheduled to be sent. Will continue to monitor.

----------------------------------------
[Edit 1 times, last edit by bfmorse at Nov 5, 2022 7:58:39 AM]

[Nov 4, 2022 11:24:27 PM]

Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 279
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

50 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

20 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: 2022-11-04 Update (ARP units & Device Manager issues)

After running smoothly for a week, the download issues are back. The system was running perfectly with the mix of MCM, OPN, and a steady flow of OPNG. Reintroduce ARP, and the issue returns. I'd call that a correlation.
Given the larger file of size of ARP, if the download speed isn't fast enough, it may be resulting in too many connections, or there could be some other factor involved. Definitely worth looking into.

----------------------------------------

“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)

[Nov 5, 2022 12:55:36 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1337
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: 2022-11-04 Update (ARP units & Device Manager issues)

Amen to this! I'd add that the flow of MCM needs to be steady enough that folks don't run out and need to re-download that 100+MB file,

By the way, adriverhoef speculated on this in a post in another News thread a couple of weeks ago - don't know whether the tech team saw that or not :-)

Cheers - Al.

[Nov 5, 2022 3:17:46 AM]

Pete Broad
Senior Cruncher
Wales
Joined: Jan 3, 2007
Post Count: 169
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

10 year badge for Outsmart Ebola Together

200 year badge for Smash Childhood Cancer

200 year badge for OpenPandemics - COVID-19


Re: 2022-11-04 Update (ARP units & Device Manager issues)

I'm one of the people with device manager issues. New machines are getting work but are not shown in the device manager. Also, name changes that I made on some machines are not showing up.

Pete

----------------------------------------

[Nov 5, 2022 10:07:35 AM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

1 year badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: 2022-11-04 Update (ARP units & Device Manager issues)

Yep, lots of failed downloads.
In the next BOINC programmers' gathering, I'd suggest adding a user-tunable option to the client to let it be more resilient when the download servers are overloaded and are giving lots of these http errors: do more retries, and not do "project backoff"s so readily.

Oh, and apparently the "How to run WCG 1.01" manual that IBM handed to Krembil had a section about how you should only try new adventurous things with the project, like vastly increasing the flow of outgoing files, when the weekend is imminent. That way, these new things will create such chaos that you won't so easily forget that they don't work too well.
OTOH, please delete that section of the manual wink

[Nov 5, 2022 10:24:39 AM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2360
Status: Recently Active
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

14 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Outsmart Ebola Together

5 year badge for Microbiome Immunity Project


Re: 2022-11-04 Update (ARP units & Device Manager issues)

In the next BOINC programmers' gathering, I'd suggest adding a user-tunable option to the client to let it be more resilient when the download servers are overloaded

"when the download servers are overloaded"
So, how are you gonna tell that the server is overloaded(*)? And why is the server overloaded? Because too many clients are overloading the server? So let's be more resilient and overload the server even more? devilish

* The HyperText Transfer Protocol (HTTP) 503 Service Unavailable server error response code indicates that the server is not ready to handle the request. Common causes are a server that is down for maintenance or that is overloaded.

In other words, there is no distinction between 'down for maintenance' and 'overloaded'. cool

[Nov 5, 2022 10:59:24 AM]

Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1314
Status: Offline
Project Badges:

180 day badge for Smash Childhood Cancer

45 day badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: 2022-11-04 Update (ARP units & Device Manager issues)

Thank you for the ARP update, and thank you for the ARP WUs. If I have a choice between download errors or no ARPs, I'll take the ARPs errors and all. I get them eventually.

It will be interesting to see how long this batch of ARPs takes to send out, and then there will be no more errors. The resends are usually a smaller group and more spread out, so don't cause problems.

[Nov 5, 2022 2:46:41 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:


Re: 2022-11-04 Update (ARP units & Device Manager issues)

So adriverhoef implies that the current client re-tries and project backoffs algorithms are already optimal.
I came to my desktop this morning and checked the "farm". The 2 machines with GPUs are processing no GPU work, but have OPNG work "Downloading". There are only about 20 files in Project Backoff, some of which would wait more than another 5 hours until their next try, blocking all work. I do some manual "Retry now" clicks. About 1 in 3 files download rapidly on each try, which is a big improvement over the previous day. There must be plenty of little windows of time during which the servers can accept download requests, but also lots of little windows where the servers are busy. There seems to be plenty of bandwidth to transfer the downloads that actually start. I re-try downloading the files that missed out coming on the previous try. Some devices go into Project Backoff on only the second try, which freezes all re-tries from the device.

It seems to me to be a far from optimal use of the available server capacity.

[Nov 6, 2022 5:41:30 AM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2360
Status: Recently Active
Project Badges:


Re: 2022-11-04 Update (ARP units & Device Manager issues)

So adriverhoef implies that the current client re-tries and project backoffs algorithms are already optimal.

In a normal situation, yes, that's what I'm suggesting.

This situation, where you seem to have to 'fight' for a successful connection with slow speed, is not normal.

So, "adding a user-tunable option" is surely constructive thinking and I like that, but I think your suggestion is - as seen in the light of my arguments why it wouldn't work - not doable, or rather not advisable.

Sorry to hear about your computerfarm. sad

----------------------------------------
[Edit 1 times, last edit by adriverhoef at Nov 6, 2022 11:27:33 AM]

[Nov 6, 2022 11:23:35 AM]

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 448
Status: Offline
Project Badges:


Re: 2022-11-04 Update (ARP units & Device Manager issues)

@Cyclops

Http errors and dismal transfer rates seem to be the norm since ARP has been released to join the other the active research WU’s.

Although I look forward to processing those WU as well, I cringe at the trending performance of WCG’s web site when WUs on additional, current (but on hold) research is released.

Current file transfer throughput hovers around 33KBps to 41KBps for an 18.28 MB file. Download speed to my gateway was just tested and is over 800 Mbps. Is this low transfer rate normal and expected at my end?
[ETA: download data rate unit value corrected to read 800 Mbps e.g., 800,000 Kbps. UPLOADS of ARP data files were around 1,000KBps]

I REALLY HOPE that troubleshooting, resolution and implementation of the appropriate steps be taken to eliminate these errors!

Please advise.

----------------------------------------
[Edit 1 times, last edit by bfmorse at Nov 7, 2022 10:21:44 PM]

[Nov 6, 2022 5:36:02 PM]

[ ]