| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 88
|
|
| Author |
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Likewise, my "validated and not purged" total has also continued to grow - currently above 42000. You have more horsepower than I do, I only have 15000 in "valid not purged." Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
Are we headed towards some epic server outage with over a month (and approaching two months) of backlogged work units? Someone should kick the MCM1 assimilators.
----------------------------------------
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Regarding "kicking the MCM1 assimilators":
Quite a while ago, the techs reported that the assimilators would stop working (unclear whether they hung or exited) and if restarted they'd soon stop again. There was a brief period in November where they stood up for over a day and that cleared out a fraction of what was waiting then. More recently, someone posted that they seemed to recall that this happened to IBM/WCG at some point in the past. I went looking at Kevin Reed's posts for anything about work not being purged, and found one or two references in 2013 :-) -- apparently, the assimilator was deadlocking with one of the other processes because both were trying to access the same data (which he seemed to think shouldn't have been able to happen...) It wasn't clear [to me] what the solution to that specific problem was, as there were other issues around the same time, ending in a MySQL database upgrade and a few days offline to clear up the mess. In the current situation, what I find a little disturbing is that we've not even heard a "We know what causes it but we don't yet know how to fix it!" statement. I'd hope they've reached that point by now, as there is quite a lot of diagnostic information that can be logged if required, although it is possible that the issue is something that doesn't trigger a logging event so it might be a case of zeroing in on a possible cause (which won't necessarily be easy for folks who've not had years working with the BOINC code![*1], especially if they have other duties too!) If the issue is data-related, allowing more data to accumulate is unlikely to be helping;. Unfortunately, as the only CPU project with work at present is MCM1 and that appears to be the only project having problems, any resolution is likely to irritate a lot of users! I sometimes think some of the users here and on other projects think that all problems can be solved with a reboot [because it works for Windows?] :-) Instinct tells me that the BOINC side of the system needs a few days with no user activity so that they don't have to try to run assimilators, validators, file deleters and the purge mechanism concurrently most of the time -- IBM used to shut off one or more of the subsystems at stress times (e.g. the statistics runs) anyway. That might well make doing diagnostics and trying solutions a lot easier! If kicking things was going to work, it would have worked by now :-) Cheers - Al. *1 There's quite a steep learning curve :-) -- I did quite a lot of code-diving when Milkyway was having problems, as I reckoned anything I could find out might help (I hope it did); I've also spent a while trying to convince myself I understand the data flows involved in validating. assimilating and purging work units (I'm not 100% convinced!) -- I certainly wouldn't claim to be competent to run a BOINC service and I don't envy them the debugging task! I just wish I could help somehow... |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
If it's a problem that happened similarly in the past, hopefully past techs followed best practices and documented in great detail in the ticketing system that Krembil WCG now owns and controls. My gut says that no such ticketing system or documentation exists.
----------------------------------------I certainly wish WCG had a full-time sysadmin and a full-time DBA working on this, honestly.
[Edit 1 times, last edit by hchc at Dec 6, 2023 4:36:52 PM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Maybe they could reach out to Kevin Reed, Uplinger, or Al Seippel (?), if they could find them and pick their brains (if they are willing) or maybe try to retain their services as consultants. It might cost a few bucks but it would probably be money well spent in not having to reinvent the wheel.
----------------------------------------Cheers
Sgt. Joe
----------------------------------------*Minnesota Crunchers* [Edit 1 times, last edit by Sgt.Joe at Dec 6, 2023 7:22:02 PM] |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
In reply to hchc and Sgt, Joe,
If it's a problem that happened similarly in the past, hopefully past techs followed best practices and documented in great detail in the ticketing system that Krembil WCG now owns and controls. My gut says that no such ticketing system or documentation exists. Yes! -- I had assumed that if there was some sort of record of past failures and fixes it didn't actually have anything that helped or, more likely, that such wasn't available to the present WCG support.I certainly wish WCG had a full-time sysadmin and a full-time DBA working on this, honestly. Agreed! -- if WCG is depending on non-WCG technical support (e.g. data centre staff) to resolve this, the engagement with the problem has too many layers (and may not be met with enthusiasm by some of those involved!) However, it seems that funding issues prohibit extra staff [at present?] :-(Maybe they could reach out to Kevin Reed, Uplinger, or Al Seippel (?), if they could find them and pick their brains (if they are willing) or maybe try to retain their services as consultants. It might cost a few bucks but it would probably be money well spent in not having to reinvent the wheel. I wondered about that but we have no way of knowing what they've tried already (including this option!)Thanks for the extra input on stuff I considered but didn't mention in my other post (which was probably a bit long anyway :-) Cheers - Al. |
||
|
|
Blount
Veteran Cruncher Joined: Aug 19, 2005 Post Count: 590 Status: Offline Project Badges:
|
TigerLily, This problem is still with us. I now have over 100,000 MCM tasks in valid state. They are not purging. This means it can take minutes for the My contribution-> Results query to run. Many times it just hangs out. This must be slowing down the backend as well.
----------------------------------------Can you tell us if the problem has been accepted by the team? [Edit 1 times, last edit by Blount at Dec 13, 2023 3:07:48 PM] |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
TIgerLilly,
I have over 20,000 MCM additional VALID WU's more than I had described on my Nov 10th post in this thread. I have also noticed the delay in loading the results information that Blount described above. Any information you can share will be appreciated. Thanks! |
||
|
|
TigerLily
Senior Cruncher Joined: May 26, 2023 Post Count: 280 Status: Offline Project Badges:
|
Hello Blount and bfmorse,
The tech team is aware of the issue and are working on resolving it. The issue is that some host IDs are reading 0 in the results table, causing MCM1 assimilators to quit at times. However, this has only affected about 10,000 results and new results are accumulating that are unaffected by this issue. They confirmed that we still have the time and space on the server to allow these results to accumulate while they work on finding a fix for this problem. Thanks for your patience as they work on resolving this issue. |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
TigerLily,
That sounds like a similar problem we had with an "invalid value" causing WU's to error out at the volunteer's computer - I seem to recall it was similar to a FIELD problem when defining the location and length of data being extracted from an ASCII file. Found an email I sent - See: Re: TigerLilly’s comment in the forum: Thread: 2023-08-17 Update (Weekend work unit shortage and OPNG issue) “SCC1 ATOM Syntax incorrect "62 " is not a valid atom number errors- but NOT on all WU's” and also “ATOM syntax incorrect: "62 " is not a valid atom number” Notice "space" after "62" - apparently that was causing the fault. (I had similar experience when I was doing programming a number of years ago when I did not define the FIELD statement accurately.) Looking forward to resuming full CPU's & GPU's on my systems again. Please continue to keep us informed. Thanks, bfmorse |
||
|
|
|