World Community Grid - View Thread - Validated but not purged

World Community Grid Forums

Category: Active Research

Forum: Mapping Cancer Markers Forum

Thread: Validated but not purged

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 88

[ ]

Author

This topic has been viewed 16121 times and has 87 replies

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7850
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Validated but not purged

Likewise, my "validated and not purged" total has also continued to grow - currently above 42000.

You have more horsepower than I do, I only have 15000 in "valid not purged." laughing

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Nov 28, 2023 2:14:02 PM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:

45 day badge for Help Cure Muscular Dystrophy

20 year badge for Mapping Cancer Markers

1 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: Validated but not purged

Are we headed towards some epic server outage with over a month (and approaching two months) of backlogged work units? Someone should kick the MCM1 assimilators.

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Dec 6, 2023 2:53:37 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1327
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for Africa Rainfall Project


Re: Validated but not purged

Regarding "kicking the MCM1 assimilators":

Quite a while ago, the techs reported that the assimilators would stop working (unclear whether they hung or exited) and if restarted they'd soon stop again. There was a brief period in November where they stood up for over a day and that cleared out a fraction of what was waiting then.

More recently, someone posted that they seemed to recall that this happened to IBM/WCG at some point in the past. I went looking at Kevin Reed's posts for anything about work not being purged, and found one or two references in 2013 :-) -- apparently, the assimilator was deadlocking with one of the other processes because both were trying to access the same data (which he seemed to think shouldn't have been able to happen...) It wasn't clear [to me] what the solution to that specific problem was, as there were other issues around the same time, ending in a MySQL database upgrade and a few days offline to clear up the mess.

In the current situation, what I find a little disturbing is that we've not even heard a "We know what causes it but we don't yet know how to fix it!" statement. I'd hope they've reached that point by now, as there is quite a lot of diagnostic information that can be logged if required, although it is possible that the issue is something that doesn't trigger a logging event so it might be a case of zeroing in on a possible cause (which won't necessarily be easy for folks who've not had years working with the BOINC code![*1], especially if they have other duties too!)

If the issue is data-related, allowing more data to accumulate is unlikely to be helping;. Unfortunately, as the only CPU project with work at present is MCM1 and that appears to be the only project having problems, any resolution is likely to irritate a lot of users! I sometimes think some of the users here and on other projects think that all problems can be solved with a reboot [because it works for Windows?] :-)

Instinct tells me that the BOINC side of the system needs a few days with no user activity so that they don't have to try to run assimilators, validators, file deleters and the purge mechanism concurrently most of the time -- IBM used to shut off one or more of the subsystems at stress times (e.g. the statistics runs) anyway. That might well make doing diagnostics and trying solutions a lot easier!

If kicking things was going to work, it would have worked by now :-)

Cheers - Al.

*1 There's quite a steep learning curve :-) -- I did quite a lot of code-diving when Milkyway was having problems, as I reckoned anything I could find out might help (I hope it did); I've also spent a while trying to convince myself I understand the data flows involved in validating. assimilating and purging work units (I'm not 100% convinced!) -- I certainly wouldn't claim to be competent to run a BOINC service and I don't envy them the debugging task! I just wish I could help somehow...

[Dec 6, 2023 4:15:45 PM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:


Re: Validated but not purged

If it's a problem that happened similarly in the past, hopefully past techs followed best practices and documented in great detail in the ticketing system that Krembil WCG now owns and controls. My gut says that no such ticketing system or documentation exists.

I certainly wish WCG had a full-time sysadmin and a full-time DBA working on this, honestly.

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 1 times, last edit by hchc at Dec 6, 2023 4:36:52 PM]

[Dec 6, 2023 4:36:34 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7850
Status: Offline
Project Badges:


Re: Validated but not purged

Maybe they could reach out to Kevin Reed, Uplinger, or Al Seippel (?), if they could find them and pick their brains (if they are willing) or maybe try to retain their services as consultants. It might cost a few bucks but it would probably be money well spent in not having to reinvent the wheel.

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

----------------------------------------
[Edit 1 times, last edit by Sgt.Joe at Dec 6, 2023 7:22:02 PM]

[Dec 6, 2023 7:21:08 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1327
Status: Offline
Project Badges:


Re: Validated but not purged

In reply to hchc and Sgt, Joe,

Yes! -- I had assumed that if there was some sort of record of past failures and fixes it didn't actually have anything that helped or, more likely, that such wasn't available to the present WCG support.

I certainly wish WCG had a full-time sysadmin and a full-time DBA working on this, honestly.

Agreed! -- if WCG is depending on non-WCG technical support (e.g. data centre staff) to resolve this, the engagement with the problem has too many layers (and may not be met with enthusiasm by some of those involved!) However, it seems that funding issues prohibit extra staff [at present?] :-(

I wondered about that but we have no way of knowing what they've tried already (including this option!)

Thanks for the extra input on stuff I considered but didn't mention in my other post (which was probably a bit long anyway :-)

Cheers - Al.

[Dec 7, 2023 3:09:15 AM]

Blount
Veteran Cruncher
Joined: Aug 19, 2005
Post Count: 598
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

180 day badge for Help Cure Muscular Dystrophy

45 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

180 day badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

45 day badge for Uncovering Genome Mysteries

1 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

50 year badge for OpenPandemics - COVID-19


Re: Validated but not purged

TigerLily, This problem is still with us. I now have over 100,000 MCM tasks in valid state. They are not purging. This means it can take minutes for the My contribution-> Results query to run. Many times it just hangs out. This must be slowing down the backend as well.

Can you tell us if the problem has been accepted by the team?

----------------------------------------
[Edit 1 times, last edit by Blount at Dec 13, 2023 3:07:48 PM]

[Dec 12, 2023 11:39:31 PM]

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 448
Status: Offline
Project Badges:

45 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

20 year badge for Smash Childhood Cancer

1 year badge for Microbiome Immunity Project

50 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Validated but not purged

TIgerLilly,

I have over 20,000 MCM additional VALID WU's more than I had described on my Nov 10th post in this thread.

I have also noticed the delay in loading the results information that Blount described above.

Any information you can share will be appreciated.
Thanks!

[Dec 13, 2023 3:50:28 AM]

TigerLily
Senior Cruncher
Joined: May 26, 2023
Post Count: 280
Status: Offline
Project Badges:


Re: Validated but not purged

Hello Blount and bfmorse,

The tech team is aware of the issue and are working on resolving it. The issue is that some host IDs are reading 0 in the results table, causing MCM1 assimilators to quit at times. However, this has only affected about 10,000 results and new results are accumulating that are unaffected by this issue. They confirmed that we still have the time and space on the server to allow these results to accumulate while they work on finding a fix for this problem. Thanks for your patience as they work on resolving this issue.

[Dec 14, 2023 3:54:59 PM]

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 448
Status: Offline
Project Badges:


Re: Validated but not purged

TigerLily,

That sounds like a similar problem we had with an "invalid value" causing WU's to error out at the volunteer's computer - I seem to recall it was similar to a FIELD problem when defining the location and length of data being extracted from an ASCII file.

Found an email I sent - See:

Re: TigerLilly’s comment in the forum:
Thread: 2023-08-17 Update (Weekend work unit shortage and OPNG issue)

“SCC1 ATOM Syntax incorrect "62 " is not a valid atom number errors- but NOT on all WU's” and also
“ATOM syntax incorrect: "62 " is not a valid atom number”

Notice "space" after "62" - apparently that was causing the fault.

(I had similar experience when I was doing programming a number of years ago when I did not define the FIELD statement accurately.)

Looking forward to resuming full CPU's & GPU's on my systems again. Please continue to keep us informed.
Thanks,
bfmorse

[Dec 14, 2023 5:39:47 PM]

[ ]