World Community Grid - View Thread - OPN1 tasks don't respect write to disk interval

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OPN1 tasks don't respect write to disk interval

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 10

[ ]

Author

This topic has been viewed 2827 times and has 9 replies

Biscotto
Cruncher
Italy
Joined: Apr 11, 2020
Post Count: 27
Status: Offline
Project Badges:

180 day badge for Mapping Cancer Markers

14 day badge for Microbiome Immunity Project

180 day badge for OpenPandemics - COVID-19


OPN1 tasks don't respect write to disk interval

Hello,
OPN1 tasks seem to write to disk too much: checking with iotop records up to 1GB of data over 30 mins for each task, which is too high.
It seems OPN1 tasks don't respect the write to disk interval, while MCM tasks do. Is there an official reason for this?

----------------------------------------

Papa Ryzen 5 3600 / Mama Radeon RX 560

----------------------------------------
[Edit 2 times, last edit by Biscotto at Oct 9, 2021 2:18:51 PM]

[Oct 9, 2021 2:17:13 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: OPN1 tasks don't respect write to disk interval

The longest currently running job of OPN on my device is 2 hours and has written 13 checkpoints or 1 every.9.2 minutes. My setting is....'at most., every 600 seconds, so ballpark that's close to what I want it to be. A second one has 90 minutes CPU time with 12 checkpoints done and at 7.5 minutes each that's not abiding by the preference. Concur, something is not right.

[Oct 9, 2021 6:17:12 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: OPN1 tasks don't respect write to disk interval

BTW Notice the completion times are significantly shorter than before. Normally my device completed 25-30 a day, it's now at over 100 and validating.

[Oct 9, 2021 6:20:15 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12503
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: OPN1 tasks don't respect write to disk interval

This is probably related to the reason for the recent outage.

[Oct 9, 2021 6:33:02 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1010
Status: Offline
Project Badges:

14 day badge for Discovering Dengue Drugs - Together

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project


Re: OPN1 tasks don't respect write to disk interval

OPN1 writes a single checkpoint for each job within a work unit, and I don't think it ever writes "timed" checkpoints within a job. So, if many of the individual jobs are short as at present, with relatively simple ligands (fewer atoms with few branches) the checkpoints will be close together, but if the jobs are long there may be many minutes between them.

So yes, in a way it is related to the recent issues and outage; there were work units with huge numbers of (mostly very small) jobs! It is what it is - I doubt there's much that can be done about it...

Cheers - Al.

P.S. OPNG is also a "one checkpoint per job" application and work units with large, multi-branch, ligands are much kinder to one's disks there too!

[Edit to relate to previous issues (and a typo!)...]

----------------------------------------
[Edit 2 times, last edit by alanb1951 at Oct 9, 2021 10:03:28 PM]

[Oct 9, 2021 9:59:49 PM]

TonyEllis
Senior Cruncher
Australia
Joined: Jul 9, 2008
Post Count: 262
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: OPN1 tasks don't respect write to disk interval

An example of OPN disk activity over the last several days
Pi 3A+ running OPN - change in SSD read/writes readily apparent

(It survived some monster WUs without error using 2G of swap)

----------------------------------------

Run Time Stats https://grassmere-productions.no-ip.biz/

[Oct 10, 2021 6:44:13 AM]

Biscotto
Cruncher
Italy
Joined: Apr 11, 2020
Post Count: 27
Status: Offline
Project Badges:


Re: OPN1 tasks don't respect write to disk interval

My problem was really with the amount of data written. Can somebody else check how much data do OPN1 tasks write over the span of 30min-1hr? If you are on gnu/linux a good tool would be iotop

----------------------------------------

Papa Ryzen 5 3600 / Mama Radeon RX 560

[Oct 10, 2021 5:32:47 PM]

Dayle Diamond
Senior Cruncher
Joined: Jan 31, 2013
Post Count: 452
Status: Offline
Project Badges:

14 day badge for Drug Search for Leishmaniasis

100 year badge for Mapping Cancer Markers

20 year badge for Outsmart Ebola Together

10 year badge for Smash Childhood Cancer

2 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OPN1 tasks don't respect write to disk interval

We've been complaining about this since OPN1 was in beta!

At best, it's a known issue.

[Oct 11, 2021 5:30:10 AM]

Biscotto
Cruncher
Italy
Joined: Apr 11, 2020
Post Count: 27
Status: Offline
Project Badges:


Re: OPN1 tasks don't respect write to disk interval

We've been complaining about this since OPN1 was in beta!

Oh, bummer. Has this been addressed by the maintainers?

----------------------------------------

Papa Ryzen 5 3600 / Mama Radeon RX 560

[Oct 11, 2021 12:32:12 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1010
Status: Offline
Project Badges:


Re: OPN1 tasks don't respect write to disk interval

On why there's lots of I/O, and whether a reduction is possible - a programmer's viewpoint...

If you look at the slots directory for a running OPN1 (or OPNG) task you'll find a lot of files with names of the form wcg_checkpoint_??.ckp. Unsurprisingly, those files are the key to the amount of disk I/O that happens.

The majority of those files start life as copies of the receptor.??.map files generated by AutoGrid. One of the files (usually wcg_checkpoint_13.dat) appears to be an accumulation of the individual AutoDock job dialog files. The correspondence between files can be found in wcg_checkpoint.dat.

The size of the results file(s) depends on the sizes of the ligands, the size of the flexres part of the receptor and the number of jobs in a work unit. The dominant part is, of course, the number of jobs! Hopefully, the accumulated dialogs file and the results file(s) are grown by write-append rather than copy-append!

As for the copied .map files: it appears that these files get copied each time a checkpoint is taken, so that can be quite a lot of I/O activity. The sizes of the larger .map files will be about 6 to 8 times the grid size, so for the current receptor that's 1.0 to 1.4MB per file. I am not sure, but I don't think AutoDock actually alters any of those files; if that is indeed the case and the code doing the copying could be persuaded to hook up the actual .map files instead (using links or whatever...) that would considerably reduce the amount of I/O per job (and hence, per task).

Of course, even if those files are only used as input, the changes would probably require coding that doesn't follow normal BOINC practice (which might explain why it hasn't happened already if it would mitigate the problem!) And the likelihood of a fix would also depend on whether the relevant code is in the wrapper or embedded in the actual science code.

Cheers - Al.

P.S. I'd love to have access to the wrapper code for OPN1/OPNG to see how it actually works -- I used to "tune" software as a part of my work, and this sort of puzzle was within my remit...

[Oct 12, 2021 3:53:57 AM]

[ ]