Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Active Research Forum: OpenPandemics - COVID-19 Project Thread: OPN1 tasks don't respect write to disk interval |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 10
|
Author |
|
biscotto
Cruncher Italy Joined: Apr 11, 2020 Post Count: 27 Status: Offline Project Badges: |
Hello,
----------------------------------------OPN1 tasks seem to write to disk too much: checking with iotop records up to 1GB of data over 30 mins for each task, which is too high. It seems OPN1 tasks don't respect the write to disk interval, while MCM tasks do. Is there an official reason for this? Papa Ryzen 5 3600 / Mama Radeon RX 560 [Edit 2 times, last edit by Biscotto at Oct 9, 2021 2:18:51 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The longest currently running job of OPN on my device is 2 hours and has written 13 checkpoints or 1 every.9.2 minutes. My setting is....'at most., every 600 seconds, so ballpark that's close to what I want it to be. A second one has 90 minutes CPU time with 12 checkpoints done and at 7.5 minutes each that's not abiding by the preference. Concur, something is not right.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
BTW Notice the completion times are significantly shorter than before. Normally my device completed 25-30 a day, it's now at over 100 and validating.
|
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12149 Status: Offline Project Badges: |
This is probably related to the reason for the recent outage.
|
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 875 Status: Offline Project Badges: |
OPN1 writes a single checkpoint for each job within a work unit, and I don't think it ever writes "timed" checkpoints within a job. So, if many of the individual jobs are short as at present, with relatively simple ligands (fewer atoms with few branches) the checkpoints will be close together, but if the jobs are long there may be many minutes between them.
----------------------------------------So yes, in a way it is related to the recent issues and outage; there were work units with huge numbers of (mostly very small) jobs! It is what it is - I doubt there's much that can be done about it... Cheers - Al. P.S. OPNG is also a "one checkpoint per job" application and work units with large, multi-branch, ligands are much kinder to one's disks there too! [Edit to relate to previous issues (and a typo!)...] [Edit 2 times, last edit by alanb1951 at Oct 9, 2021 10:03:28 PM] |
||
|
TonyEllis
Senior Cruncher Australia Joined: Jul 9, 2008 Post Count: 259 Status: Recently Active Project Badges: |
An example of OPN disk activity over the last several days
----------------------------------------Pi 3A+ running OPN - change in SSD read/writes readily apparent (It survived some monster WUs without error using 2G of swap)
Run Time Stats https://grassmere-productions.no-ip.biz/
|
||
|
biscotto
Cruncher Italy Joined: Apr 11, 2020 Post Count: 27 Status: Offline Project Badges: |
My problem was really with the amount of data written. Can somebody else check how much data do OPN1 tasks write over the span of 30min-1hr? If you are on gnu/linux a good tool would be iotop
----------------------------------------Papa Ryzen 5 3600 / Mama Radeon RX 560 |
||
|
Dayle Diamond
Senior Cruncher Joined: Jan 31, 2013 Post Count: 450 Status: Offline Project Badges: |
We've been complaining about this since OPN1 was in beta!
At best, it's a known issue. |
||
|
biscotto
Cruncher Italy Joined: Apr 11, 2020 Post Count: 27 Status: Offline Project Badges: |
We've been complaining about this since OPN1 was in beta! Oh, bummer. Has this been addressed by the maintainers?Papa Ryzen 5 3600 / Mama Radeon RX 560 |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 875 Status: Offline Project Badges: |
On why there's lots of I/O, and whether a reduction is possible - a programmer's viewpoint...
If you look at the slots directory for a running OPN1 (or OPNG) task you'll find a lot of files with names of the form wcg_checkpoint_??.ckp. Unsurprisingly, those files are the key to the amount of disk I/O that happens. The majority of those files start life as copies of the receptor.??.map files generated by AutoGrid. One of the files (usually wcg_checkpoint_13.dat) appears to be an accumulation of the individual AutoDock job dialog files. The correspondence between files can be found in wcg_checkpoint.dat. The size of the results file(s) depends on the sizes of the ligands, the size of the flexres part of the receptor and the number of jobs in a work unit. The dominant part is, of course, the number of jobs! Hopefully, the accumulated dialogs file and the results file(s) are grown by write-append rather than copy-append! As for the copied .map files: it appears that these files get copied each time a checkpoint is taken, so that can be quite a lot of I/O activity. The sizes of the larger .map files will be about 6 to 8 times the grid size, so for the current receptor that's 1.0 to 1.4MB per file. I am not sure, but I don't think AutoDock actually alters any of those files; if that is indeed the case and the code doing the copying could be persuaded to hook up the actual .map files instead (using links or whatever...) that would considerably reduce the amount of I/O per job (and hence, per task). Of course, even if those files are only used as input, the changes would probably require coding that doesn't follow normal BOINC practice (which might explain why it hasn't happened already if it would mitigate the problem!) And the likelihood of a fix would also depend on whether the relevant code is in the wrapper or embedded in the actual science code. Cheers - Al. P.S. I'd love to have access to the wrapper code for OPN1/OPNG to see how it actually works -- I used to "tune" software as a part of my work, and this sort of puzzle was within my remit... |
||
|
|