Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 38
Posts: 38   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 5214 times and has 37 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

After further review, 108 returned. 104 as valid or pending verification/validation and 4 as "too Late" out of 368 total WUs. Some from the weekend may have already disappeared from the results list but I have no evidence of that. Another 15 will upload in the next 3 to 4 hours

EDIT: BoincTasks now show 57 ARP1 WUs pending upload. Can't seem to get anything uploaded or downloaded.

Out of interest, how many concurrent connections do you have configured for upload and download? If you've increased it beyond the default (which, I believe, is 2) it won't be helping any, especially if the client tries to initiate all the contacts at once! (And there may also be issues if it fails too many successive connections as the "recovery" could also kill off a viable connection that is doing something [unless they've fixed that in a recent client...].)

I'm seriously considering cutting mine from 2 to 1 to see if it helps :-) I can't be sitting around mashing the retry button, and running a script to do the same can cause problems of its own (I've seen it result in a big download in progress [the older (100+MB) MCM1 base data file] being killed off...

Cheers - Al.

P.S. I also noted the Boca Raton Community HS surprise at the 150 results returned, and I too am surprised (if those were all ARP1)! How many systems...? -- I normally process about 600 MCM1 WUs a day using 4 systems, and would expect to process 500 or so along with [say] 20 or so ARP1 tasks a day when ARP1 is available, yet I've only managed a dozen ARP1 over three days and my MCM1 counts are down in the 300s now! All because of download issues caused by the saturation of connections to the download servers...

[Edited the P.S. and restored the end of my comment about losing a download to a click-script, which seemed to have got lost somewhere :-)]
----------------------------------------
[Edit 3 times, last edit by alanb1951 at Nov 6, 2024 10:12:00 PM]
[Nov 6, 2024 9:43:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gj82854
Advanced Cruncher
Joined: Sep 26, 2022
Post Count: 122
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

I have 10 xfers per project and a max of 16 per host. This configuration has been set this way for at least 10 years and works well most times. Due to the issues related to the ARP1 project it never uses all those connections at one time. The most I've been able to get today has been 2 per host for downloads. About noon UTC the uploads were VERY slow but seemed to clear about an hour or so later. This afternoon, the uploads have been using 6 simultaneous connections and works well. The queue of uploads was cleared before 1500 UTC. I don't sit and "babysit" the workload. I do check in once in a while and try to help the WUs with short deadlines get started but the others are on their own. It seems this same "network issue" happened when IBM was controlling the project and I thought Keith/Kevin had isolated it to the load balancers. However, I don't think the environment is the same anymore.

I have ARP1 is running on 7 systems currently but I can't get enough work fast enough to keep the threads busy (no queue) so there is also a mix of Rosetta, NFS, and Gaia running. As of this writing there are about 130 ARP1 workunits running concurrently across the seven systems with about 100 waiting for download.. The faster systems are doing them in about 10 hours give or take depending on load.

The problem I run into most of the time is the hardcoded 1000 limit per client. I run a one day queue to be sure I have enough work to get past most project issues. On almost all my systems I hit the 1000 limit before I hit the one day queue. For most projects, the 1000 limit gets me about a half day. The slower systems will get about 18 to 20 hours of work.
[Nov 6, 2024 10:59:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Boca Raton Community HS
Senior Cruncher
Joined: Aug 27, 2021
Post Count: 209
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host


P.S. I also noted the Boca Raton Community HS surprise at the 150 results returned, and I too am surprised (if those were all ARP1)! How many systems...? -- I normally process about 600 MCM1 WUs a day using 4 systems, and would expect to process 500 or so along with [say] 20 or so ARP1 tasks a day when ARP1 is available, yet I've only managed a dozen ARP1 over three days and my MCM1 counts are down in the 300s now! All because of download issues caused by the saturation of connections to the download servers...


You are just as surprised as I was about this. I am actually looking for a solution or suggestions. The first round of ARP1 work units that came out on Nov 1st (I think..) that was a trial run of work, ran flawlessly. Then, when all of the log jam started, we ran into issues. Out of all of the work you see that we "returned", only of few of them were valid- almost all of them (about 95%) were "WU download error: couldn't get input files:" Then, I think our systems asked for more, which compounded the issue. I have the systems running "default", maximum output. I thought that this was still 1 ARP1 task running at a time.

Our systems do ask for a lot of work though. Although we don't have a massive amount of systems (9 running), many of them are high core count Threadrippers and Xeons. We typically return ~5,000 MCM1 WU per day, without issue. Now, we are returning a fraction of that because the download queue on the systems is moving so slow. We only run 2 concurrent connections as well. 1 day of stored work.

I know that some will say that "we are partly to blame for the log jam". With 9 systems (and no autoclickers, scripts, etc) I can't say that we are not contributing to the demand, but all of us are to some degree.

We would be happy to take a step back from ARP1 until this clears up if you all think it would help- we are here to help the body of research as a whole.

I would love to hear valid recommendations!
[Nov 7, 2024 1:04:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

gj82854 - thanks for that insight into how your systems are set up. I've not got that much "power" available and would never be troubled by the 1000 limit, as I set my systems up to return results as quickly as possible (so I don't have a WCG reserve if there's a longer outage!)

I have four systems regularly looking for work (and 2 connections/system/server, but at the moment I am [still] lucky to get more than one or two open connections altogether, so by the time I do clear a backlog I'm nearly out of work again and I end up with another excess of files to download :-( -- "rinse and repeat"...

If you haven't already seen it, check out this post by savas in the News forum; it touches on load balancers and other things :-)

Cheers - Al.

P.S. I suspect you may also have better bandwidth than I have; I have access to a total of about 300Mb/second for download and around 100Mb/s upload (so sending back ARP1 and OPNG isn't always fun even when things are working as they should!) -- ah, well...
[Nov 7, 2024 1:05:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Boca Raton Community HS
Senior Cruncher
Joined: Aug 27, 2021
Post Count: 209
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

gj82854 - thanks for that insight into how your systems are set up. I've not got that much "power" available and would never be troubled by the 1000 limit, as I set my systems up to return results as quickly as possible (so I don't have a WCG reserve if there's a longer outage!)

I have four systems regularly looking for work (and 2 connections/system/server, but at the moment I am [still] lucky to get more than one or two open connections altogether, so by the time I do clear a backlog I'm nearly out of work again and I end up with another excess of files to download :-( -- "rinse and repeat"...

If you haven't already seen it, check out this post by savas in the News forum; it touches on load balancers and other things :-)

Cheers - Al.

P.S. I suspect you may also have better bandwidth than I have; I have access to a total of about 300Mb/second for download and around 100Mb/s upload (so sending back ARP1 and OPNG isn't always fun even when things are working as they should!) -- ah, well...


Replied to your question above your last post, in case you missed it (you posted within se onds after I posted!)
[Nov 7, 2024 1:08:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
gj82854
Advanced Cruncher
Joined: Sep 26, 2022
Post Count: 122
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

gj82854 - thanks for that insight into how your systems are set up. I've not got that much "power" available and would never be troubled by the 1000 limit, as I set my systems up to return results as quickly as possible (so I don't have a WCG reserve if there's a longer outage!)

I have four systems regularly looking for work (and 2 connections/system/server, but at the moment I am [still] lucky to get more than one or two open connections altogether, so by the time I do clear a backlog I'm nearly out of work again and I end up with another excess of files to download :-( -- "rinse and repeat"...

If you haven't already seen it, check out this post by savas in the News forum; it touches on load balancers and other things :-)

Cheers - Al.

P.S. I suspect you may also have better bandwidth than I have; I have access to a total of about 300Mb/second for download and around 100Mb/s upload (so sending back ARP1 and OPNG isn't always fun even when things are working as they should!) -- ah, well...

I'm on a 1GB fiber connection. The connection is symmetrical so it is 1GB up and down
EDIT: I apologize for hijacking this thread. When I posted to this thread I just wanted to convey that I haven't seen any of the errors described on any of my Linux hosts.
----------------------------------------
[Edit 1 times, last edit by gj82854 at Nov 7, 2024 1:30:37 AM]
[Nov 7, 2024 1:26:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[AF>Le_Pommier] Jerome_C2005
Cruncher
Joined: Aug 17, 2006
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

I have very few tasks (nothing compared to your numbers), after only 4 failed 2 days ago, I had one terminated OK in the night after 7 hours of calculation, and now it is qualified "too late" on the site and granted 0 credit. I have only 2 or 3 other on-going but if I understand what you explain above, they will probably be treated as "too late" also ?

And downloading the very few other tasks that I have is just crazy and literally takes days to succeed, unless you constantly retry and retry and retry and retry and retry and retry and...

The only positive point for me is that the fraud cannot be massive.

It only confirms what I already knew : Krembil is just a bunch of amateurs regarding maintenance of a boinc project. Maybe they are good in some fields, but they should definitively not have accepted the challenge of taking over such a great project that *was* WCG. All projects stopped except MCM, trying to revive ARP and "wow this is impressive".

A dignified death would have been preferable to such a miserable life.
----------------------------------------

[Nov 7, 2024 9:23:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

alanb1951:
I have access to a total of about 300Mb/second for download and around 100Mb/s upload (so sending back ARP1 and OPNG isn't always fun even when things are working as they should!)
That's really a lot more than I have. My maximum internet speed is 4.4 MB (megabyte) per second downloading; when I do a speedtest: 36.5 Mbps down and 8.9 Mbps up. However, that should be enough to download 130 MB (the size of one ARP1-task) in half a minute.[*1]

Adri
(*1) Almost 2 years ago, it took "about 6 minutes, including retries, to download one ARP1-task".
[Nov 8, 2024 4:15:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[AF>Le_Pommier] Jerome_C2005
Cruncher
Joined: Aug 17, 2006
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

I could end another 15 tasks approx and I discovered that

- there are no break point: I had to reboot with all tasks having 20 hours of calculation and most of them restarted from 0 (but a few ones restarted with a few hours of calculation, very strange)

- once credited, you are punished to have a slow machine : most of the time I was quite slower than the wingman and almost on all these tasks I got granted credit reduced compare to claimed credit, and the wingman got more credit than claimed... what I reward and what an incentive - and in very few cases it was the opposite, I got more credit being slower, so it's like a lottery ??

- at the end of the day, after 36 hours of calculation for each task (average), I was given max 1000 credits (and sometimes much less) per task. One could call this "a joke" but it's not funny

This combined with the previous issues (status too late, download terror) = I'm done for now with ARP.
----------------------------------------

----------------------------------------
[Edit 2 times, last edit by [AF>Le_Pommier] Jerome_C2005 at Nov 10, 2024 6:08:55 PM]
[Nov 10, 2024 6:06:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All tasks failing on linux host

This combined with the previous issues (status too late, download terror) = I'm done for now with ARP.

ARP has steep requirements for a reason i.e. don't try to run it on a machine with inadequate resources. I think you are right in giving up on ARP. 36 hours for an ARP unit indicates although your machine can adequately process ARP units, it is probably marginal at best. Hopefully, you can stay fully stocked with MCM units even though their downloads are also adversely affected.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 10, 2024 7:32:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 38   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread