World Community Grid - View Thread - OPN1 WU memory leaks?

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OPN1 WU memory leaks?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 41

[ ]

Author

This topic has been viewed 12047 times and has 40 replies

Biscotto
Cruncher
Italy
Joined: Apr 11, 2020
Post Count: 27
Status: Offline
Project Badges:

180 day badge for Mapping Cancer Markers

14 day badge for Microbiome Immunity Project

180 day badge for OpenPandemics - COVID-19


Re: OPN1 WU memory leaks?

Can confirm this on Debian stable, OPN1 tasks exceed memory limit and i'm forced to restart the client to clean it. Hopefully there will be a solution soon

----------------------------------------

Papa Ryzen 5 3600 / Mama Radeon RX 560

[Oct 5, 2021 7:08:59 AM]

TonyEllis
Senior Cruncher
Australia
Joined: Jul 9, 2008
Post Count: 286
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: OPN1 WU memory leaks?

Same here Fedora 34 and Raspberry Pi OS...

----------------------------------------

Run Time Stats https://grassmere-productions.no-ip.biz/

[Oct 5, 2021 7:51:53 AM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

90 day badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

5 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project


Re: OPN1 WU memory leaks?

Another task crashed after > 9 hours runtime, because of 'out of memory' on a Win10 laptop with 2GB RAM.

OPN1_0066989_00106_0 -> https://www.worldcommunitygrid.org/contribution/results/1948291824/log

Total 7 error tasks now on 3 laptops with Windows 10.

----------------------------------------
[Edit 1 times, last edit by Crystal Pellet at Oct 5, 2021 8:26:50 AM]

[Oct 5, 2021 8:23:47 AM]

ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 328
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

1 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: OPN1 WU memory leaks?

I am also seeing this on Windows 10 with 16Gb memory running 12 BOINC threads.
Usually OPN1 tasks contain 2 to 4 jobs and OPNG over 100. These problem tasks contain over 100 jobs, so I assume they should have been OPNG tasks. Does this explain the lack of OPNG tasks recently?

[Oct 5, 2021 8:34:19 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for GO Fight Against Malaria

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project


Re: OPN1 WU memory leaks?

The problem here seems to be that there are lots of jobs out there at the moment which are docking mostly small(ish) ligands with few branches - the workunit sizing algorithm is putting together workunits with over 150 jobs and huge numbers of docking attempts. With that many jobs there are lots of checkpoints, and they are quite close together in time!

[Edit: I see another comment about high job counts came in while I was compiling this post!]

On most of my systems I only run one OPN1 task at a time (concentrating on MCM1 and ARP1...) so I don't seem to get badly bitten by these larger tasks. However, my Raspberry Pi can only run OPN1 and it has just had a task earn Error status after processing Job #185 (which was its last job); the reported total number of dockings to be attempted was 9260, which would produce a lot of result output! The Pi has successfully returned four other tasks that had around 170..180 jobs, but I suspect this was the first time it had three of them running at the same time!!!

The tail of the stderr report contained the following...

INFO:[06:25:55] Finished Docking number 9
INFO:[06:25:55] End AutoDock...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
SIGABRT: abort called
Stack trace (2 frames):
[0x6baa8]
[0x138250]

As far as I can tell, the error happened in the harness code rather than in AutoDock itself (which might explain the short stack trace!); Jobs and Dockings are numbered from 0, and the last job wanted to do 10 dockings (so completing number 9 meant it had finished!)

For what it's worth, the workunit name is OPN1_0066657_00034, but it might have happened to any of the other over-packed work-units instead :-(

I think I'm going to cut my (8GB!) Pi back from three OPN1 to two at a time in the hope I won't waste another 6+ hours on a task that aborts when nearly finished!

Cheers - Al.

----------------------------------------
[Edit 1 times, last edit by alanb1951 at Oct 5, 2021 8:48:01 AM]

[Oct 5, 2021 8:45:07 AM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:


Re: OPN1 WU memory leaks?

... the workunit sizing algorithm is putting together workunits with over 150 jobs and huge numbers of docking attempts. With that many jobs there are lots of checkpoints, and they are quite close together in time!

As far I can see the jobs are numbered > #205, but they don't start with 0 (zero).
The successful tasks have done 26 or 27 jobs

----------------------------------------
[Edit 2 times, last edit by Crystal Pellet at Oct 5, 2021 9:12:56 AM]

[Oct 5, 2021 9:09:47 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:


Re: OPN1 WU memory leaks?

As far I can see the jobs are numbered > #205, but they don't start with 0 (zero).
The successful tasks have done 26 or 27 jobs

Are you referring to what you can see for returned results on the WGC web site? For these tasks that'll only be the tail of the actual stderr.txt file (I think it's restricted to about 64K). To really see what's going on for one of these needs to look at stderr.txt in the slots directory of a running task!

As an example, a job I've been watching recently was on Job #69 when I started typing this (having started at Job #0!) and had already written nearly 4000 lines to stderr.txt. Its command parameters included wcgdpf=75 (number of jobs) and wcgruns=3689 (total number of docking attempts) so it had 75 jobs (0..74) and nearly all those jobs were going to do 50 dockings . It has now finished and the web site shows part of the dockings list for one job (#48) then has complete output for Job #49 up to Job #74 - a mere 1500 lines of output.

I have a Python script running on each of my systems as a daemon to look for new BOINC workunits arriving so I can capture [some of] their parameters and/or data - I use this on OPN1/OPNG to find out details such as numbers of jobs, the ligand structures of each job and the numbers of dockings attempted per ligand (fixed at 50 for OPNG, variable up to 50 for OPN1). My remarks are/were based on what I glean from that data source, and digging into the slots directory to watch file sizes and stderr output...

The highest numbers of jobs I've seen so far this week have been 201, 205, 208 and 209 (one of each). The rest of the "more jobs" tasks seem fairly evenly divided between 151..200 and 61..85. Last week almost every task had 5 or less jobs, and the most jobs I saw was 9; this week less than 10 of the 75 tasks I've seen at the time of writing this had single-figure numbers of jobs...

These are a nuisance (and I can understand why folks don't want to run them, especially if they want to do multiple tasks); I really hope they don't have to build more batches with such lopsided job mixes!

Cheers - Al.

[Oct 5, 2021 11:02:11 AM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:


Re: OPN1 WU memory leaks?

Thanks Al,
You're right. The result is truncated to the last ~26 jobs.
I only had such a script for Powershell to watch the OPNG number of jobs and progress.

Now I have a 187 jobs running task, which already had consumed over 1600MB RAM,
so I've suspended it once (LAIM off) and resumed it to reduce the memory size.

[Oct 5, 2021 1:00:51 PM]

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2494
Status: Recently Active
Project Badges:

10 year badge for Mapping Cancer Markers

14 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: OPN1 WU memory leaks?

Doesn't seem to be much of a response from the project team, when it comes to this issue....

[Oct 5, 2021 4:15:20 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Africa Rainfall Project


Re: OPN1 WU memory leaks?

CPU work for OPN1 has been stopped while we investigate.

[Oct 5, 2021 4:20:40 PM]

[ ]