Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 88
Posts: 88   Pages: 9   [ Previous Page | 1 2 3 4 5 6 7 8 9 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 15890 times and has 87 replies Next Thread
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

Likewise, my "validated and not purged" total has also continued to grow - currently above 42000.


You have more horsepower than I do, I only have 15000 in "valid not purged." laughing

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 28, 2023 2:14:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

Are we headed towards some epic server outage with over a month (and approaching two months) of backlogged work units? Someone should kick the MCM1 assimilators.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Dec 6, 2023 2:53:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

Regarding "kicking the MCM1 assimilators":

Quite a while ago, the techs reported that the assimilators would stop working (unclear whether they hung or exited) and if restarted they'd soon stop again. There was a brief period in November where they stood up for over a day and that cleared out a fraction of what was waiting then.

More recently, someone posted that they seemed to recall that this happened to IBM/WCG at some point in the past. I went looking at Kevin Reed's posts for anything about work not being purged, and found one or two references in 2013 :-) -- apparently, the assimilator was deadlocking with one of the other processes because both were trying to access the same data (which he seemed to think shouldn't have been able to happen...) It wasn't clear [to me] what the solution to that specific problem was, as there were other issues around the same time, ending in a MySQL database upgrade and a few days offline to clear up the mess.

In the current situation, what I find a little disturbing is that we've not even heard a "We know what causes it but we don't yet know how to fix it!" statement. I'd hope they've reached that point by now, as there is quite a lot of diagnostic information that can be logged if required, although it is possible that the issue is something that doesn't trigger a logging event so it might be a case of zeroing in on a possible cause (which won't necessarily be easy for folks who've not had years working with the BOINC code![*1], especially if they have other duties too!)

If the issue is data-related, allowing more data to accumulate is unlikely to be helping;. Unfortunately, as the only CPU project with work at present is MCM1 and that appears to be the only project having problems, any resolution is likely to irritate a lot of users! I sometimes think some of the users here and on other projects think that all problems can be solved with a reboot [because it works for Windows?] :-)

Instinct tells me that the BOINC side of the system needs a few days with no user activity so that they don't have to try to run assimilators, validators, file deleters and the purge mechanism concurrently most of the time -- IBM used to shut off one or more of the subsystems at stress times (e.g. the statistics runs) anyway. That might well make doing diagnostics and trying solutions a lot easier!

If kicking things was going to work, it would have worked by now :-)

Cheers - Al.

*1 There's quite a steep learning curve :-) -- I did quite a lot of code-diving when Milkyway was having problems, as I reckoned anything I could find out might help (I hope it did); I've also spent a while trying to convince myself I understand the data flows involved in validating. assimilating and purging work units (I'm not 100% convinced!) -- I certainly wouldn't claim to be competent to run a BOINC service and I don't envy them the debugging task! I just wish I could help somehow...
[Dec 6, 2023 4:15:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

If it's a problem that happened similarly in the past, hopefully past techs followed best practices and documented in great detail in the ticketing system that Krembil WCG now owns and controls. My gut says that no such ticketing system or documentation exists.

I certainly wish WCG had a full-time sysadmin and a full-time DBA working on this, honestly.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 1 times, last edit by hchc at Dec 6, 2023 4:36:52 PM]
[Dec 6, 2023 4:36:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

Maybe they could reach out to Kevin Reed, Uplinger, or Al Seippel (?), if they could find them and pick their brains (if they are willing) or maybe try to retain their services as consultants. It might cost a few bucks but it would probably be money well spent in not having to reinvent the wheel.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
----------------------------------------
[Edit 1 times, last edit by Sgt.Joe at Dec 6, 2023 7:22:02 PM]
[Dec 6, 2023 7:21:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

In reply to hchc and Sgt, Joe,

If it's a problem that happened similarly in the past, hopefully past techs followed best practices and documented in great detail in the ticketing system that Krembil WCG now owns and controls. My gut says that no such ticketing system or documentation exists.
Yes! -- I had assumed that if there was some sort of record of past failures and fixes it didn't actually have anything that helped or, more likely, that such wasn't available to the present WCG support.
I certainly wish WCG had a full-time sysadmin and a full-time DBA working on this, honestly.
Agreed! -- if WCG is depending on non-WCG technical support (e.g. data centre staff) to resolve this, the engagement with the problem has too many layers (and may not be met with enthusiasm by some of those involved!) However, it seems that funding issues prohibit extra staff [at present?] :-(
Maybe they could reach out to Kevin Reed, Uplinger, or Al Seippel (?), if they could find them and pick their brains (if they are willing) or maybe try to retain their services as consultants. It might cost a few bucks but it would probably be money well spent in not having to reinvent the wheel.
I wondered about that but we have no way of knowing what they've tried already (including this option!)

Thanks for the extra input on stuff I considered but didn't mention in my other post (which was probably a bit long anyway :-)

Cheers - Al.
[Dec 7, 2023 3:09:15 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Blount
Veteran Cruncher
Joined: Aug 19, 2005
Post Count: 590
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

TigerLily, This problem is still with us. I now have over 100,000 MCM tasks in valid state. They are not purging. This means it can take minutes for the My contribution-> Results query to run. Many times it just hangs out. This must be slowing down the backend as well.

Can you tell us if the problem has been accepted by the team?
----------------------------------------
[Edit 1 times, last edit by Blount at Dec 13, 2023 3:07:48 PM]
[Dec 12, 2023 11:39:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 442
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

TIgerLilly,

I have over 20,000 MCM additional VALID WU's more than I had described on my Nov 10th post in this thread.

I have also noticed the delay in loading the results information that Blount described above.

Any information you can share will be appreciated.
Thanks!
[Dec 13, 2023 3:50:28 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TigerLily
Senior Cruncher
Joined: May 26, 2023
Post Count: 280
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

Hello Blount and bfmorse,

The tech team is aware of the issue and are working on resolving it. The issue is that some host IDs are reading 0 in the results table, causing MCM1 assimilators to quit at times. However, this has only affected about 10,000 results and new results are accumulating that are unaffected by this issue. They confirmed that we still have the time and space on the server to allow these results to accumulate while they work on finding a fix for this problem. Thanks for your patience as they work on resolving this issue.
[Dec 14, 2023 3:54:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 442
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validated but not purged

TigerLily,

That sounds like a similar problem we had with an "invalid value" causing WU's to error out at the volunteer's computer - I seem to recall it was similar to a FIELD problem when defining the location and length of data being extracted from an ASCII file.

Found an email I sent - See:
Re: TigerLilly’s comment in the forum:
Thread: 2023-08-17 Update (Weekend work unit shortage and OPNG issue)

“SCC1 ATOM Syntax incorrect "62 " is not a valid atom number errors- but NOT on all WU's” and also
“ATOM syntax incorrect: "62 " is not a valid atom number”

Notice "space" after "62" - apparently that was causing the fault.

(I had similar experience when I was doing programming a number of years ago when I did not define the FIELD statement accurately.)

Looking forward to resuming full CPU's & GPU's on my systems again. Please continue to keep us informed.
Thanks,
bfmorse
[Dec 14, 2023 5:39:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 88   Pages: 9   [ Previous Page | 1 2 3 4 5 6 7 8 9 | Next Page ]
[ Jump to Last Post ]
Post new Thread