Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Official Messages Forum: News Thread: Workunits are being sent out |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 450
|
Author |
|
MarkH
Advanced Cruncher United States of America Joined: May 16, 2020 Post Count: 50 Status: Offline Project Badges: |
Regret to report that MCM work units are crashing on start after waiting several hours for downloads to be made available. I also had to abort an ARP job that tried to complete downloads all night. I was able to run OPN1, MCM, and ARP jobs a few days ago without all the delays, and then it went really bad again. Just aborted more MCM downloads after 8 hours of waiting in an endless fallback.
----------------------------------------I know Krembil is working on this entire mess, and I will try downloading again in a few days. I want to continue being in the fight against these diseases and conditions. I want the Krembil people to know that I and others know you're working hard to get everything right. You've been working for months trying your best to get things to 100% capacity and 99.99% reliability. All of us that care get it, and wish we could give you hugs/handshakes/pats on the back, buy you a beer/wine/coffee/tea, and thank you personally for all your work. But at some point Krembil's leadership needs to realize they need a lot more help to fix the issues throughout the WCG systems than in-house staff could possibly handle alone. They've got worse luck than I do, and that's not a good thing.
"That science of the people, by the people, for the people, shall not perish from the Earth."
|
||
|
erich56
Senior Cruncher Austria Joined: Feb 24, 2007 Post Count: 294 Status: Offline Project Badges: |
Hang in there. While there are occasional networking problems and droughts where WUs aren't available, overall, things are - at least on the work side - more or less working. I have around five devices and (with few exceptions) received a steady flow of WUs since about 10 days ago - no manual intervention required. hm, that's interesting to read. My box with which I crunch MCM does not need manual intervention. The three other computers with which I crunch OPNG need massive manual intervention. |
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 251 Status: Offline Project Badges: |
My box with which I crunch MCM does not need manual intervention. The three other computers with which I crunch OPNG need massive manual intervention. My main box (Ryzen 7 1700+GTX 960 running all projects) also needs a lot of babysitting. Raspberry Pis (OPN1 only) and dual/quad-core x86 systems without a usable GPU (all projects except OPNG) need a lot less, but still need an occasional kick in the pants. Packing up the OPNG units so they don't require zillions of file requests requests would go a long way towards giving us breathing room, since they have lots of relatively-small files inside. ARP is tougher, though, since those work units are already quite porky. If bundled up, they'd still have a huge bandwidth demand and likely a worse disk footprint while unpacking. Of course, that also means beta-testing any such changes, so there's no short-term fix other than eliminating the bottleneck(s) in the infrastructure, wherever they may be. All I can do at this point is report. I get the feeling that we are bumping up against both bandwidth and request rate constraints. |
||
|
erich56
Senior Cruncher Austria Joined: Feb 24, 2007 Post Count: 294 Status: Offline Project Badges: |
It seems to me that they are re-sending "old" OPNG tasks from last year
Why do I think so: at the very beginning of OPNG, the tasks yielded credit points between 200 and 400; lateron, this was changed to a figure between ~800 and ~1000. Since last night, the tasks again only show some 200-400 points. So my suspicion is that just to send tasks out, they use old ones that had been sent out last year already. So I am questioning how much sense this makes. Just to feed us we get old stuff which had been processed long time ago? |
||
|
wildhagen
Veteran Cruncher The Netherlands Joined: Jun 5, 2009 Post Count: 728 Status: Offline Project Badges: |
Well, given the untrustworthiness and unreliability of Krembil (given their lies, broken promises and lack of communiction), it wouldn't surprise me too much if all the 'work' (not only OPNG) we are doing is pure fake work, in the name of testing,
----------------------------------------[Edit 1 times, last edit by wildhagen at Sep 9, 2022 4:43:54 AM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 859 Status: Offline Project Badges: |
@erich56
----------------------------------------When OPNG came back on stream, the work units had a new target (receptor) whilst OPN1 was still working on the same targets as before the hiatus. The sudden change suggests that there was a fairly urgent need to see the results for this target. This new target seems to have required more iterations per docking, with corresponding increases in credit awarded per task. Now it appears that the "rush job" for the new target has finished (or, at least, paused), so OPNG is back processing the same target it was working on before the hiatus! The task names will have lower batch numbers because they are part of a substantial set of batches of available data that was already in the pipeline, though they are higher than pre-hiatus. The highest OPNG batch numbers I recorded pre-hiatus were around the 0149280 region, and these new batches for the same receptor seem to be above 0150200; there are huge numbers of possible ligands to try to dock, so it is not too surprising if there is a backlog to clear now the "rush job" seems to be over. And there's no benefit in sending out unwanted work, so unless WCG inform us otherwise I'd tend to assume it's "live" data. It's just tough luck that the dockings seem to be a lot easier to achieve, so lower credit :-) And in the meantime, OPN1 is chugging along looking at the same target that OPNG has now gone back to. Cheers - Al. [Edited to point out that I had not seen wildhagen's post when I posted this. It doesn't alter what I've said, as I tend to believe folks until they are proven really untrustworthy.] [Edit 1 times, last edit by alanb1951 at Sep 9, 2022 5:25:32 AM] |
||
|
wildhagen
Veteran Cruncher The Netherlands Joined: Jun 5, 2009 Post Count: 728 Status: Offline Project Badges: |
@erich56 [Edited to point out that I had not seen wildhagen's post when I posted this. It doesn't alter what I've said, as I tend to believe folks until they are proven really untrustworthy.] Normally I do the the same, but they lied and broke promises a few time too many for my liking, let alone the lack of communication. In my book, and with the frequency they are doing this, thats make a party unreliably. You can't trust anything they say, because they won't deliver on wat they say. Repeatedly. A classic case of 'the boy who cried wolf'-syndrome. |
||
|
erich56
Senior Cruncher Austria Joined: Feb 24, 2007 Post Count: 294 Status: Offline Project Badges: |
After about 1 day's outage, WCG is back; however, the network problems are even worse; whereas the problems so far were only with downloads, now uploads are also affected strongly.
So all in all: it gets worse instead of better. I am questioning how come, after all these months they took for the transition from IBM to Krembil. |
||
|
wildhagen
Veteran Cruncher The Netherlands Joined: Jun 5, 2009 Post Count: 728 Status: Offline Project Badges: |
It's near to impossible to download anything at the moment. Even with several retries, it continues to fail.
Situation seems to be worsening a lot over the last few days, and that is not even counting the expired certificate. |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 859 Status: Offline Project Badges: |
After about 1 day's outage, WCG is back; however, the network problems are even worse; whereas the problems so far were only with downloads, now uploads are also affected strongly. A moment's consideration might explain why difficulties with uploading would probably appear when the system came back from a lengthy period off-line with an active user base out there. I suspect a lot of us expected that to happen - been there before, had the experience elsewhere...So all in all: it gets worse instead of better. I am questioning how come, after all these months they took for the transition from IBM to Krembil. Think of all those client systems waiting to upload a whole day's worth of work, putting in requests at a much higher rate than would normally be expected... And the same magnification effect applies to downloads as not only were there lots of systems out there that got cut off whilst downloading but there would now be lots more "empty" systems wanting a top-up once they had finished uploading... It looks bad now, but start saying it has got worse if it's still as bad in 24 hours time :-) The upload/download speeds during the first couple of hours after the system came back gave a good indication of the sheer volume of traffic there must have been -- I normally see upload and download speeds 10 times better than I was getting at 08:00 UTC today! And now I've got my normal half a day's worth of work things seem to have settled down as far as I'm concerned (especially as I've turned off OPN1/OPNG for now!...) It probably isn't realistic to expect any BOINC system to add capacity just to handle the aftermath of major unplanned outages (though some folks may think otherwise), so when a BOINC site has an "accident" the end users will see upload/download problems; I've experienced this at several other sites, and I seem to recall one or two occasions when the old WCG had issues of this type (sometimes without actual down-time!)... Cheers - Al. P.S. It's probably a blessing that the BOINC client won't request work if the upload queue is too big, else there would have been even more initial traffic! |
||
|
|