World Community Grid - View Thread - excessive disk transfers

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: excessive disk transfers

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 35

[ ]

Author

This topic has been viewed 6261 times and has 34 replies

ExtraTerrestrial Apes
Cruncher
Joined: Nov 7, 2009
Post Count: 12
Status: Offline


excessive disk transfers

Dear WCG developers,

forgive me if this has been asked and discussed before. I tried a quick search but found nothing of value.

The reason why I'm posting is that, as you probably know, WCG Clean Energy Project Phase 2 WUs perform lot's of disk operations. And I mean really a massive amount of them: I measured the amount of transferred data via SSDReady (free version) and running 1 of those tasks continously would yield several 100 GB/day! If I used all 8 logical cores of my hyperthreaded 4.0 GHz i7 those numbers might easily pass 1 TB/day (if the SSD could keep up).

Generally write endurance is no big ussue for SSDs. But consumer drives are usually rated in the double digit TBs written. They'll very likely take more, but that's not guaranteed. And wearing out an SSD in less than a quarter of a year would be very shocking for your users.

I seriously doubt those disk transfers are neccessary in this way. Could you sheed some light on the issue? A few more thoughts from my point:

- using simple zip compression I reduced the data in a CEP2 slot by a factor of 2.1. If you perform this operation in the app (.zip librariers should be readily available) you'd save your users half of the file writes and might even see faster processing, because writing less data physically is always faster.

- I observed almost constant write access via SSDReady, irregardless of my "preferred check point interval" setting. Can't you simply cache those file changes in main memory and only push them out when the user preferenced checkpoint interval has passed? This way users could directly control how much safety against data loss they want to trade off against increased SSD wear. And again you'd probably see a significant speedup, if your app is single threaded and has to wait for those frequent disk transfers to finish before it can continue processing.

- keeping the calculation results / write queue in memory would of course increase memory usage, which can be a factor in some systems. However, in my case I usually have 4 of 8 GB free, including my CPUs & GPUs being fully loaded by BOINC. I could easily afford to spend a few 100 MBs more, even GBs. And you could easily check this occasionally: if the machine is running out of memory, flush the results to disk earlier than planned.

- I suppose memory caching could greatly reduce the number of file accesses, because in the slot directory I just found ~700 MB, whereas we're talking about many GBs transferred per WU. This suggests that lot's of results being written now are being overwritten later on - i.e. the disk writes could be avoided completely if you waited long enough.

- "simply dedicate an old HDD for BOINC and those WUs" - one could do that, at the expense of additional power, noise and HDD wear. However, if there's a rather simple software solution, which would benefit the project and all users, I'd much prefer that.

Best regards,
MrS

----------------------------------------

ExtraTerrestrial Apes - Scanning for our furry friends since Jan 2002

----------------------------------------
[Edit 1 times, last edit by ExtraTerrestrial Apes at Aug 23, 2015 1:42:26 PM]

[Aug 23, 2015 1:33:35 PM]

Eric_Kaiser
Veteran Cruncher
Germany (Hessen)
Joined: May 7, 2013
Post Count: 1047
Status: Offline
Project Badges:

10 year badge for The Clean Energy Project - Phase 2

20 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: excessive disk transfers

MrS,

this is already known for "ages". Every once a while there is a discussion about that issue.
Some users are using a ram disk to prohibit this extensive read/writes from their hdd or ssd.

----------------------------------------

[Aug 23, 2015 1:44:31 PM]

ExtraTerrestrial Apes
Cruncher
Joined: Nov 7, 2009
Post Count: 12
Status: Offline


Re: excessive disk transfers

I almost expected this (but never followed WCG closely). Do you know the developers stance on this?

MrS

----------------------------------------

ExtraTerrestrial Apes - Scanning for our furry friends since Jan 2002

[Aug 23, 2015 2:19:16 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

20 year badge for OpenPandemics - COVID-19


Re: excessive disk transfers

Having been around on this project from the beginning, I know that they were well aware of it, and worked to limit it as much as possible. That is possible to some extent by how you partition the work between the distributed PCs, and their own computers. It also affects the amount of data you have to transfer, and has something to do with why we upload our results (very large files) directly to Harvard rather than going back to IBM as all the other projects do.

But if they limit the writes too much, it apparently does not make sense to use distributed computing at some point. I think if you go back to the original posts, you will see the discussions on this.

[Aug 23, 2015 2:25:48 PM]

ExtraTerrestrial Apes
Cruncher
Joined: Nov 7, 2009
Post Count: 12
Status: Offline


Re: excessive disk transfers

Could you point me to those posts?

MrS

----------------------------------------

ExtraTerrestrial Apes - Scanning for our furry friends since Jan 2002

[Aug 23, 2015 4:08:43 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: excessive disk transfers

CEP2 is big in terms of RAM and VM use, not to speak of the checkpoint related file writes which number is purposely limited to a max of 8 per result as that could otherwise lead to very substantial degradation of the user experience. The less RAM there's available, the more VM exchanging [Disk-swapping] will happen, and that -will- generate excessive disk I/O. This is why at point of download the Feeder/Distributor checks if there's enough RAM [750MB I think] and free-and-allowed-to-use Disk space [2.5GB]. The project server only checks this for a single job, which is right since the defaults for CEP2 is to only assign 1 task at the time. If users then choose to fetch more [overrides allowed after vehement protestation] and multiple concurrent and problems show, it's problems of the user's own making. This is why I don't allow more than one to run at a time, though the I7s can handle 8 if left alone and dedicated, simply as this affects all concurrent work, not only CEP2 and me as user, device running as LAN network file/mail server and more.

----------------------------------------
[Edit 1 times, last edit by Former Member at Aug 23, 2015 4:50:30 PM]

[Aug 23, 2015 4:47:40 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:


Re: excessive disk transfers

Could you point me to those posts?

I don't know how, except to look at the oldest ones on this forum. It was the reason why CEP2 is the only project that allows you to limit the number of jobs you have downloaded at any one time. Originally, that limit was "1" and could not be changed. Then, people asked that it be made a user selectable option, so it was changed. It should be searchable on that basis, or just go back to the beginning and it won't be far away.

[Aug 23, 2015 8:14:16 PM]

ExtraTerrestrial Apes
Cruncher
Joined: Nov 7, 2009
Post Count: 12
Status: Offline


Re: excessive disk transfers

Thanks, Rob! I should have known or expected that the developers of such a big project thought well about this issue. Using 8 check points per WU definitely doesn't sound like too many. What about file compression? Has this already been considered?

MrS

----------------------------------------

ExtraTerrestrial Apes - Scanning for our furry friends since Jan 2002

[Aug 23, 2015 9:30:11 PM]

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

20 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: excessive disk transfers

I run another project entirely (not Boinc) that uses compression solely to reduce upload file sizes and that is incredibly cpu intensive....we are talking 3+ celeron core @2.4 mins per 10K of 120KB files here, so I can only imagine that trying similar with cep would detract from cpu getting the job done if trying to reduce writes volume. I would also think it might add an element of additional error potentially.

Suggest that you over provision by 25% or more any SSD running these cep WU's more than a few at a time. This should give plenty of room for garbage collection to run concurrently thus extending life. You could try something designed for write intensive work too..... Something enterprise or derived from same such as the Seagate 600 pro from a year or two back which was designed for a TBW in excess of a Petabyte

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by OldChap at Aug 23, 2015 10:48:16 PM]

[Aug 23, 2015 10:35:23 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: excessive disk transfers

CEP2 is extremely I/O intense during setup. If then a task is finishing and uploading and a new one is started [I control this with app_config to get a seamless 1 finish, 1 start], there's big competition for the CPU time and storage sub-systems, AND the network. If then the data complexity is such that any form of pre transmission compression and packaging just gives fractional file reduction, you're not going to do this and just transmit as-is.

Similar to a few other projects at WCG, would love to see sim-linked static files for CEP2 [~6700 files, yes you read that right], so when running more than one, there's not more than one copy on the system. This copying and setup of a task is very time consuming, so much so that actual computing could not be starting until minutes after, even on moderately powerful systems. If then running with a ramdrive to speed up matters and run multiple, that's eating space fast.

Anyway, the last on this. There's endless repetitive discussion on the intricacies of CEP2 and how to tweak [there's a CEP2 settings sheet somewhere]. The science runs on commercial software [Q-Chem], so they can figure it out [Think WCG/IBM told them a few things how to improve, them having had the source code for security vetting and grid enabling as such].

[Aug 24, 2015 11:45:32 AM]

[ ]