Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 41
Posts: 41   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8200 times and has 40 replies Next Thread
Biscotto
Cruncher
Italy
Joined: Apr 11, 2020
Post Count: 27
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

Can confirm this on Debian stable, OPN1 tasks exceed memory limit and i'm forced to restart the client to clean it. Hopefully there will be a solution soon
----------------------------------------
Papa Ryzen 5 3600 / Mama Radeon RX 560

[Oct 5, 2021 7:08:59 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TonyEllis
Senior Cruncher
Australia
Joined: Jul 9, 2008
Post Count: 261
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

Same here Fedora 34 and Raspberry Pi OS...
----------------------------------------
[Oct 5, 2021 7:51:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

Another task crashed after > 9 hours runtime, because of 'out of memory' on a Win10 laptop with 2GB RAM.

OPN1_0066989_00106_0 -> https://www.worldcommunitygrid.org/contribution/results/1948291824/log

Total 7 error tasks now on 3 laptops with Windows 10.
----------------------------------------
[Edit 1 times, last edit by Crystal Pellet at Oct 5, 2021 8:26:50 AM]
[Oct 5, 2021 8:23:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 328
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

I am also seeing this on Windows 10 with 16Gb memory running 12 BOINC threads.
Usually OPN1 tasks contain 2 to 4 jobs and OPNG over 100. These problem tasks contain over 100 jobs, so I assume they should have been OPNG tasks. Does this explain the lack of OPNG tasks recently?
[Oct 5, 2021 8:34:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 988
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

The problem here seems to be that there are lots of jobs out there at the moment which are docking mostly small(ish) ligands with few branches - the workunit sizing algorithm is putting together workunits with over 150 jobs and huge numbers of docking attempts. With that many jobs there are lots of checkpoints, and they are quite close together in time!

[Edit: I see another comment about high job counts came in while I was compiling this post!]

On most of my systems I only run one OPN1 task at a time (concentrating on MCM1 and ARP1...) so I don't seem to get badly bitten by these larger tasks. However, my Raspberry Pi can only run OPN1 and it has just had a task earn Error status after processing Job #185 (which was its last job); the reported total number of dockings to be attempted was 9260, which would produce a lot of result output! The Pi has successfully returned four other tasks that had around 170..180 jobs, but I suspect this was the first time it had three of them running at the same time!!!

The tail of the stderr report contained the following...

INFO:[06:25:55] Finished Docking number 9
INFO:[06:25:55] End AutoDock...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
SIGABRT: abort called
Stack trace (2 frames):
[0x6baa8]
[0x138250]

As far as I can tell, the error happened in the harness code rather than in AutoDock itself (which might explain the short stack trace!); Jobs and Dockings are numbered from 0, and the last job wanted to do 10 dockings (so completing number 9 meant it had finished!)

For what it's worth, the workunit name is OPN1_0066657_00034, but it might have happened to any of the other over-packed work-units instead :-(

I think I'm going to cut my (8GB!) Pi back from three OPN1 to two at a time in the hope I won't waste another 6+ hours on a task that aborts when nearly finished!

Cheers - Al.
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Oct 5, 2021 8:48:01 AM]
[Oct 5, 2021 8:45:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

... the workunit sizing algorithm is putting together workunits with over 150 jobs and huge numbers of docking attempts. With that many jobs there are lots of checkpoints, and they are quite close together in time!

As far I can see the jobs are numbered > #205, but they don't start with 0 (zero).
The successful tasks have done 26 or 27 jobs
----------------------------------------
[Edit 2 times, last edit by Crystal Pellet at Oct 5, 2021 9:12:56 AM]
[Oct 5, 2021 9:09:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 988
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

... the workunit sizing algorithm is putting together workunits with over 150 jobs and huge numbers of docking attempts. With that many jobs there are lots of checkpoints, and they are quite close together in time!

As far I can see the jobs are numbered > #205, but they don't start with 0 (zero).
The successful tasks have done 26 or 27 jobs

Are you referring to what you can see for returned results on the WGC web site? For these tasks that'll only be the tail of the actual stderr.txt file (I think it's restricted to about 64K). To really see what's going on for one of these needs to look at stderr.txt in the slots directory of a running task!

As an example, a job I've been watching recently was on Job #69 when I started typing this (having started at Job #0!) and had already written nearly 4000 lines to stderr.txt. Its command parameters included wcgdpf=75 (number of jobs) and wcgruns=3689 (total number of docking attempts) so it had 75 jobs (0..74) and nearly all those jobs were going to do 50 dockings . It has now finished and the web site shows part of the dockings list for one job (#48) then has complete output for Job #49 up to Job #74 - a mere 1500 lines of output.

I have a Python script running on each of my systems as a daemon to look for new BOINC workunits arriving so I can capture [some of] their parameters and/or data - I use this on OPN1/OPNG to find out details such as numbers of jobs, the ligand structures of each job and the numbers of dockings attempted per ligand (fixed at 50 for OPNG, variable up to 50 for OPN1). My remarks are/were based on what I glean from that data source, and digging into the slots directory to watch file sizes and stderr output...

The highest numbers of jobs I've seen so far this week have been 201, 205, 208 and 209 (one of each). The rest of the "more jobs" tasks seem fairly evenly divided between 151..200 and 61..85. Last week almost every task had 5 or less jobs, and the most jobs I saw was 9; this week less than 10 of the 75 tasks I've seen at the time of writing this had single-figure numbers of jobs...

These are a nuisance (and I can understand why folks don't want to run them, especially if they want to do multiple tasks); I really hope they don't have to build more batches with such lopsided job mixes!

Cheers - Al.
[Oct 5, 2021 11:02:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

Thanks Al,
You're right. The result is truncated to the last ~26 jobs.
I only had such a script for Powershell to watch the OPNG number of jobs and progress.

Now I have a 187 jobs running task, which already had consumed over 1600MB RAM,
so I've suspended it once (LAIM off) and resumed it to reduce the memory size.
[Oct 5, 2021 1:00:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2220
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

Doesn't seem to be much of a response from the project team, when it comes to this issue....
[Oct 5, 2021 4:15:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OPN1 WU memory leaks?

CPU work for OPN1 has been stopped while we investigate.
[Oct 5, 2021 4:20:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 41   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread