Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 13
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 5596 times and has 12 replies Next Thread
Eric Pohlke
Cruncher
Canada
Joined: Feb 4, 2006
Post Count: 15
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Work Units keep getting Deferred

Starting on Oct 1, early afternoon, Completed Work Units stopped uploading and instead would be deferred. Then new Work Units stopped downloading right away and began to defer at 15-minute intervals. Today, the deferral intervals have moved up to 3 hours 15 minutes. My EPYC AMD 7700 now only gets work for a couple of hours per day. GPU Work Units are pretty much non-extant as the system only takes less than 10 seconds to do one. My connection to the World Community Grid website tends to crawl at times. Access anywhere else is not an issue on the 12-pipes x 1 Gbit connections it has to the internet.

Now you claim everything is complete and you’re back up and running. Yet, the connections and traffic from your server appears to be limited.
[Oct 3, 2022 11:41:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
DennyInDurham
Cruncher
USA
Joined: Aug 4, 2020
Post Count: 23
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

I have now aborted all work units not yet started due today (10/4).

Apparently Krembil does not have the commitment, or staff and infrastructure, to run World Community Grid.
[Oct 4, 2022 5:05:46 AM]   Link   Report threatening or abusive post: please login first  Go to top 
BobbyB
Veteran Cruncher
Canada
Joined: Apr 25, 2020
Post Count: 598
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

Maybe you are storing too many WUs in you cache. How many

From the Boinc Manager go to Options/ComputingPreferences and near the bottom there is a section Other.

There are 4 boxes of interest. I use 0.2, 0.2, 30, and 600 respectively.

I don't doubt Krembil's commitment. I have no knowledge of the other two so no way to judge. Maybe the eyes were bigger than the stomach and they bit off too big a chunk.
[Oct 4, 2022 2:42:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
DennyInDurham
Cruncher
USA
Joined: Aug 4, 2020
Post Count: 23
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

No, it wasn't configured for "too many" WUs. The problem, and to be specific, is for OPN1 and OPNG (i.e., OpenPandemics - Covid 19) WUs in the US.

When it queues a batch of WUs, you see them get added to the Transfers list, and it tries to download each, sometime reaching 10-30%. It then gives up and goes on to the next. So, a lot of time and bandwidth is consumed, but nothing is actually completely downloaded. So, there's this long list of partially downloaded WUs that frequently fail due to "project backoff". You can repeatedly retry transfers and get one or two through each time, but it inevitably stops with "project backoff". It goes through the same sequence of trying for a bit, stopping partially downloaded, and moving on to the next. If you do nothing, the WUs just sit there as "downloading" tasks, and occasionally get redispatched only to fail again.

It acts like there's a giant bottleneck, either fetching the data to download or actually sending it. So, could be the DB, could be anywhere in the Internet path (I have 1G Fiber, so not likely me). Given the responsiveness (or lack thereof) of the website, I would guess the problem is at Krembil.
----------------------------------------
[Edit 1 times, last edit by DennyInDurham at Oct 6, 2022 12:07:01 AM]
[Oct 5, 2022 11:58:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
settantta@gmail.com
Cruncher
Joined: Nov 28, 2005
Post Count: 2
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

Apologies in advance for a rather long reply...

As you say, it is not a question of "too many WUs"—quite the opposite in fact.

It is actually a problem at the server end, as shown by the following entries from the log:

Thu 06 Oct 2022 11:19:38 AEST | World Community Grid | update requested by user
Thu 06 Oct 2022 11:19:40 AEST | World Community Grid | Sending scheduler request: Requested by user.
Thu 06 Oct 2022 11:19:40 AEST | World Community Grid | Requesting new tasks for CPU
Thu 06 Oct 2022 11:19:46 AEST | World Community Grid | Scheduler request completed: got 6 new tasks
Thu 06 Oct 2022 11:19:48 AEST | World Community Grid | Started download of MCM1_0191589_2032_MCM1_0191589_2032.txt
Thu 06 Oct 2022 11:19:48 AEST | World Community Grid | Started download of MCM1_0191596_2948_MCM1_0191596_2948.txt
Thu 06 Oct 2022 11:19:59 AEST | World Community Grid | Temporarily failed download of MCM1_0191596_2948_MCM1_0191596_2948.txt: transient HTTP error
Thu 06 Oct 2022 11:19:59 AEST | World Community Grid | Backing off 00:03:29 on download of MCM1_0191596_2948_MCM1_0191596_2948.txt
Thu 06 Oct 2022 11:19:59 AEST | World Community Grid | Started download of MCM1_0191589_2138_MCM1_0191589_2138.txt
Thu 06 Oct 2022 11:20:01 AEST | World Community Grid | Temporarily failed download of MCM1_0191589_2032_MCM1_0191589_2032.txt: transient HTTP error
Thu 06 Oct 2022 11:20:01 AEST | World Community Grid | Backing off 00:02:04 on download of MCM1_0191589_2032_MCM1_0191589_2032.txt
Thu 06 Oct 2022 11:20:01 AEST | World Community Grid | Started download of MCM1_0191591_2394_MCM1_0191591_2394.txt
Thu 06 Oct 2022 11:20:02 AEST | World Community Grid | Temporarily failed download of MCM1_0191589_2138_MCM1_0191589_2138.txt: transient HTTP error
Thu 06 Oct 2022 11:20:02 AEST | World Community Grid | Backing off 00:02:35 on download of MCM1_0191589_2138_MCM1_0191589_2138.txt
Thu 06 Oct 2022 11:20:02 AEST | World Community Grid | Started download of MCM1_0191596_2942_MCM1_0191596_2942.txt
Thu 06 Oct 2022 11:20:04 AEST | World Community Grid | Temporarily failed download of MCM1_0191591_2394_MCM1_0191591_2394.txt: transient HTTP error
Thu 06 Oct 2022 11:20:04 AEST | World Community Grid | Backing off 00:03:52 on download of MCM1_0191591_2394_MCM1_0191591_2394.txt
Thu 06 Oct 2022 11:20:04 AEST | World Community Grid | Started download of MCM1_0191589_2155_MCM1_0191589_2155.txt
Thu 06 Oct 2022 11:20:06 AEST | World Community Grid | Temporarily failed download of MCM1_0191596_2942_MCM1_0191596_2942.txt: transient HTTP error
Thu 06 Oct 2022 11:20:06 AEST | World Community Grid | Backing off 00:03:26 on download of MCM1_0191596_2942_MCM1_0191596_2942.txt
Thu 06 Oct 2022 11:20:08 AEST | World Community Grid | Temporarily failed download of MCM1_0191589_2155_MCM1_0191589_2155.txt: transient HTTP error
Thu 06 Oct 2022 11:20:08 AEST | World Community Grid | Backing off 00:02:27 on download of MCM1_0191589_2155_MCM1_0191589_2155.txt


Notice that last part of the notifications that reads "transient HTTP error"? That is what is causing the problem (I believe Windows uses a different error code, something like "HTTP error 0")

Yes, I've looked at the post on the FAQ forum which supposedly gives a workaround. IT DOESN'T! Not a single suggestion is even relevant. FTR, I'm running Linux on a Core i7, direct connection to the internet via a high-speed fibre connection (FTTC from nbnco for the benefit of Australians).

All of which tells me the problem is at the server end, and probably due to either lack of bandwidth, throttling from their end, or a problem with one or more of any proxy connections used.

You'll note I have been doing this since around 2005, mainly on Linux, and have NEVER had any similar problems prior to the shift over to the new infrastructure. I am very close to saying "Stuff this, I'm not doing it any more." I doubt I'm the only one. But it seems to be impossible to report the problems...
----------------------------------------
Alex
[Oct 6, 2022 1:48:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1928
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

Apologies in advance for a rather long reply...

As you say, it is not a question of "too many WUs"—quite the opposite in fact.

It is actually a problem at the server end, as shown by the following entries from the log:


Notice that last part of the notifications that reads "transient HTTP error"? That is what is causing the problem (I believe Windows uses a different error code, something like "HTTP error 0")
No, we have been through this weeks ago. That the BOINC logs show "transient HTTP error" is more or less a generic error message, as in most other cases, those are in fact all "transient" errors (they come, and they go away by themselves, just passing through).

Here, in most cases, the actual error message from the server is a 503 HTTP error, which is returned in those 107 bytes that you will see in the "Size" column of the Transfer tab in the BOINC manager.

This means basically that a service the http server is relying on is not answer (fast enough). A couple of weeks ago, we got the one and only real technical response from a WCG tech (Christian "cubes") that mentioned they tracked this down to an issue with their load balancer, running HAproxy. My suspicion has always been that it is a "number of concurrent connections" to the dataabase server(s)..

Shortly after Christians post, all the download errors disappeared and the system seemed to run smoothly for the first time since the restart in June. For a couple of days, until sometime on Friday afternoon (PDT). Then first the download issues reared their ugly little heads again, until then sometime early Saturday mornin g (PDT) the web site and forum took a nose dive too. And that part got fixed then late Tuesday, while the download problems still persist...

Unfortunately, communication from Krembil as to what the issue is kind of abysmal, so we are back to the old waiting and guessing game... crying

Ralf
----------------------------------------

[Oct 6, 2022 6:13:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

I have a page full of "download pending" OPN WUs (including 1 ARP WU) and the log shows "transit HTTP" errors. Was working fine 2 days ago. Do I abort them or is this a fat bit stuck on the Krembil's side?
----------------------------------------
I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team


[Oct 6, 2022 9:27:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1928
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

I have a page full of "download pending" OPN WUs (including 1 ARP WU) and the log shows "transit HTTP" errors. Was working fine 2 days ago. Do I abort them or is this a fat bit stuck on the Krembil's side?
I doubt that this was working 2 days ago. The download issue is going on since some time late Friday 9/30, that would make it 6 days by now...
But there is no need to abort them, you should be able to retry them a few times (when the BOINC client has completely gone through the list of files to download, not just blindly hitting [Retry now])...

Ralf
----------------------------------------

[Oct 6, 2022 10:08:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
davethebuilder
Cruncher
Australia
Joined: Dec 6, 2013
Post Count: 15
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

A couple of points...

Toronto is not Los Angeles and is not a major international internet hub so the resources available to Krembil are, by comparison, limited.

Krembil appears to be unprepared for the size and scale of this project especially the required bandwidth and the IT infrastructure to support it. As a result, bottlenecks are being created due to simultaneous internet traffic demands except from 29 Sept. - 2 Oct. where Work Unit transfers appeared to normalise,

It still may take some time for this matter to be resolved especially if the hardware is not physically in place to cope with the requirements of WCG.

Hopefully, these matters will be resolved in time and everyone can get back to donating their resources for this worthwhile global initiative.
[Oct 7, 2022 2:41:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
nasher
Veteran Cruncher
USA
Joined: Dec 2, 2005
Post Count: 1422
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units keep getting Deferred

I also keep running into this issue.. i have reactivated some other Boinc projects that i don't care for as much to run while i try to get work units and due to (resource share) so they fall into the background when i get work units.. it is not optimal but it works for me. but it is very frustrating
----------------------------------------

[Oct 7, 2022 5:10:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread