Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 41
|
![]() |
Author |
|
Biscotto
Cruncher Italy Joined: Apr 11, 2020 Post Count: 27 Status: Offline Project Badges: ![]() ![]() ![]() |
Can confirm this on Debian stable, OPN1 tasks exceed memory limit and i'm forced to restart the client to clean it. Hopefully there will be a solution soon
----------------------------------------Papa Ryzen 5 3600 / Mama Radeon RX 560 |
||
|
TonyEllis
Senior Cruncher Australia Joined: Jul 9, 2008 Post Count: 261 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Same here Fedora 34 and Raspberry Pi OS...
----------------------------------------
Run Time Stats https://grassmere-productions.no-ip.biz/
|
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Another task crashed after > 9 hours runtime, because of 'out of memory' on a Win10 laptop with 2GB RAM.
----------------------------------------OPN1_0066989_00106_0 -> https://www.worldcommunitygrid.org/contribution/results/1948291824/log Total 7 error tasks now on 3 laptops with Windows 10. [Edit 1 times, last edit by Crystal Pellet at Oct 5, 2021 8:26:50 AM] |
||
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 328 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I am also seeing this on Windows 10 with 16Gb memory running 12 BOINC threads.
Usually OPN1 tasks contain 2 to 4 jobs and OPNG over 100. These problem tasks contain over 100 jobs, so I assume they should have been OPNG tasks. Does this explain the lack of OPNG tasks recently? |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 988 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The problem here seems to be that there are lots of jobs out there at the moment which are docking mostly small(ish) ligands with few branches - the workunit sizing algorithm is putting together workunits with over 150 jobs and huge numbers of docking attempts. With that many jobs there are lots of checkpoints, and they are quite close together in time!
----------------------------------------[Edit: I see another comment about high job counts came in while I was compiling this post!] On most of my systems I only run one OPN1 task at a time (concentrating on MCM1 and ARP1...) so I don't seem to get badly bitten by these larger tasks. However, my Raspberry Pi can only run OPN1 and it has just had a task earn Error status after processing Job #185 (which was its last job); the reported total number of dockings to be attempted was 9260, which would produce a lot of result output! The Pi has successfully returned four other tasks that had around 170..180 jobs, but I suspect this was the first time it had three of them running at the same time!!! The tail of the stderr report contained the following... INFO:[06:25:55] Finished Docking number 9 As far as I can tell, the error happened in the harness code rather than in AutoDock itself (which might explain the short stack trace!); Jobs and Dockings are numbered from 0, and the last job wanted to do 10 dockings (so completing number 9 meant it had finished!) For what it's worth, the workunit name is OPN1_0066657_00034, but it might have happened to any of the other over-packed work-units instead :-( I think I'm going to cut my (8GB!) Pi back from three OPN1 to two at a time in the hope I won't waste another 6+ hours on a task that aborts when nearly finished! Cheers - Al. [Edit 1 times, last edit by alanb1951 at Oct 5, 2021 8:48:01 AM] |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
... the workunit sizing algorithm is putting together workunits with over 150 jobs and huge numbers of docking attempts. With that many jobs there are lots of checkpoints, and they are quite close together in time! As far I can see the jobs are numbered > #205, but they don't start with 0 (zero). The successful tasks have done 26 or 27 jobs [Edit 2 times, last edit by Crystal Pellet at Oct 5, 2021 9:12:56 AM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 988 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
... the workunit sizing algorithm is putting together workunits with over 150 jobs and huge numbers of docking attempts. With that many jobs there are lots of checkpoints, and they are quite close together in time! As far I can see the jobs are numbered > #205, but they don't start with 0 (zero). The successful tasks have done 26 or 27 jobs Are you referring to what you can see for returned results on the WGC web site? For these tasks that'll only be the tail of the actual stderr.txt file (I think it's restricted to about 64K). To really see what's going on for one of these needs to look at stderr.txt in the slots directory of a running task! As an example, a job I've been watching recently was on Job #69 when I started typing this (having started at Job #0!) and had already written nearly 4000 lines to stderr.txt. Its command parameters included wcgdpf=75 (number of jobs) and wcgruns=3689 (total number of docking attempts) so it had 75 jobs (0..74) and nearly all those jobs were going to do 50 dockings . It has now finished and the web site shows part of the dockings list for one job (#48) then has complete output for Job #49 up to Job #74 - a mere 1500 lines of output. I have a Python script running on each of my systems as a daemon to look for new BOINC workunits arriving so I can capture [some of] their parameters and/or data - I use this on OPN1/OPNG to find out details such as numbers of jobs, the ligand structures of each job and the numbers of dockings attempted per ligand (fixed at 50 for OPNG, variable up to 50 for OPN1). My remarks are/were based on what I glean from that data source, and digging into the slots directory to watch file sizes and stderr output... The highest numbers of jobs I've seen so far this week have been 201, 205, 208 and 209 (one of each). The rest of the "more jobs" tasks seem fairly evenly divided between 151..200 and 61..85. Last week almost every task had 5 or less jobs, and the most jobs I saw was 9; this week less than 10 of the 75 tasks I've seen at the time of writing this had single-figure numbers of jobs... These are a nuisance (and I can understand why folks don't want to run them, especially if they want to do multiple tasks); I really hope they don't have to build more batches with such lopsided job mixes! Cheers - Al. |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks Al,
You're right. The result is truncated to the last ~26 jobs. I only had such a script for Powershell to watch the OPNG number of jobs and progress. Now I have a 187 jobs running task, which already had consumed over 1600MB RAM, so I've suspended it once (LAIM off) and resumed it to reduce the memory size. |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2220 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Doesn't seem to be much of a response from the project team, when it comes to this issue....
|
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
CPU work for OPN1 has been stopped while we investigate.
|
||
|
|
![]() |