| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 35
|
|
| Author |
|
|
ExtraTerrestrial Apes
Cruncher Joined: Nov 7, 2009 Post Count: 12 Status: Offline |
Dear WCG developers,
----------------------------------------forgive me if this has been asked and discussed before. I tried a quick search but found nothing of value. The reason why I'm posting is that, as you probably know, WCG Clean Energy Project Phase 2 WUs perform lot's of disk operations. And I mean really a massive amount of them: I measured the amount of transferred data via SSDReady (free version) and running 1 of those tasks continously would yield several 100 GB/day! If I used all 8 logical cores of my hyperthreaded 4.0 GHz i7 those numbers might easily pass 1 TB/day (if the SSD could keep up). Generally write endurance is no big ussue for SSDs. But consumer drives are usually rated in the double digit TBs written. They'll very likely take more, but that's not guaranteed. And wearing out an SSD in less than a quarter of a year would be very shocking for your users. I seriously doubt those disk transfers are neccessary in this way. Could you sheed some light on the issue? A few more thoughts from my point: - using simple zip compression I reduced the data in a CEP2 slot by a factor of 2.1. If you perform this operation in the app (.zip librariers should be readily available) you'd save your users half of the file writes and might even see faster processing, because writing less data physically is always faster. - I observed almost constant write access via SSDReady, irregardless of my "preferred check point interval" setting. Can't you simply cache those file changes in main memory and only push them out when the user preferenced checkpoint interval has passed? This way users could directly control how much safety against data loss they want to trade off against increased SSD wear. And again you'd probably see a significant speedup, if your app is single threaded and has to wait for those frequent disk transfers to finish before it can continue processing. - keeping the calculation results / write queue in memory would of course increase memory usage, which can be a factor in some systems. However, in my case I usually have 4 of 8 GB free, including my CPUs & GPUs being fully loaded by BOINC. I could easily afford to spend a few 100 MBs more, even GBs. And you could easily check this occasionally: if the machine is running out of memory, flush the results to disk earlier than planned. - I suppose memory caching could greatly reduce the number of file accesses, because in the slot directory I just found ~700 MB, whereas we're talking about many GBs transferred per WU. This suggests that lot's of results being written now are being overwritten later on - i.e. the disk writes could be avoided completely if you waited long enough. - "simply dedicate an old HDD for BOINC and those WUs" - one could do that, at the expense of additional power, noise and HDD wear. However, if there's a rather simple software solution, which would benefit the project and all users, I'd much prefer that. Best regards, MrS
ExtraTerrestrial Apes - Scanning for our furry friends since Jan 2002
----------------------------------------[Edit 1 times, last edit by ExtraTerrestrial Apes at Aug 23, 2015 1:42:26 PM] |
||
|
|
Eric_Kaiser
Veteran Cruncher Germany (Hessen) Joined: May 7, 2013 Post Count: 1047 Status: Offline Project Badges:
|
MrS,
----------------------------------------this is already known for "ages". Every once a while there is a discussion about that issue. Some users are using a ram disk to prohibit this extensive read/writes from their hdd or ssd. ![]() |
||
|
|
ExtraTerrestrial Apes
Cruncher Joined: Nov 7, 2009 Post Count: 12 Status: Offline |
I almost expected this (but never followed WCG closely). Do you know the developers stance on this?
----------------------------------------MrS
ExtraTerrestrial Apes - Scanning for our furry friends since Jan 2002
|
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
Having been around on this project from the beginning, I know that they were well aware of it, and worked to limit it as much as possible. That is possible to some extent by how you partition the work between the distributed PCs, and their own computers. It also affects the amount of data you have to transfer, and has something to do with why we upload our results (very large files) directly to Harvard rather than going back to IBM as all the other projects do.
But if they limit the writes too much, it apparently does not make sense to use distributed computing at some point. I think if you go back to the original posts, you will see the discussions on this. |
||
|
|
ExtraTerrestrial Apes
Cruncher Joined: Nov 7, 2009 Post Count: 12 Status: Offline |
Could you point me to those posts?
----------------------------------------MrS
ExtraTerrestrial Apes - Scanning for our furry friends since Jan 2002
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
CEP2 is big in terms of RAM and VM use, not to speak of the checkpoint related file writes which number is purposely limited to a max of 8 per result as that could otherwise lead to very substantial degradation of the user experience. The less RAM there's available, the more VM exchanging [Disk-swapping] will happen, and that -will- generate excessive disk I/O. This is why at point of download the Feeder/Distributor checks if there's enough RAM [750MB I think] and free-and-allowed-to-use Disk space [2.5GB]. The project server only checks this for a single job, which is right since the defaults for CEP2 is to only assign 1 task at the time. If users then choose to fetch more [overrides allowed after vehement protestation] and multiple concurrent and problems show, it's problems of the user's own making. This is why I don't allow more than one to run at a time, though the I7s can handle 8 if left alone and dedicated, simply as this affects all concurrent work, not only CEP2 and me as user, device running as LAN network file/mail server and more.
----------------------------------------[Edit 1 times, last edit by Former Member at Aug 23, 2015 4:50:30 PM] |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
Could you point me to those posts? I don't know how, except to look at the oldest ones on this forum. It was the reason why CEP2 is the only project that allows you to limit the number of jobs you have downloaded at any one time. Originally, that limit was "1" and could not be changed. Then, people asked that it be made a user selectable option, so it was changed. It should be searchable on that basis, or just go back to the beginning and it won't be far away. |
||
|
|
ExtraTerrestrial Apes
Cruncher Joined: Nov 7, 2009 Post Count: 12 Status: Offline |
Thanks, Rob! I should have known or expected that the developers of such a big project thought well about this issue. Using 8 check points per WU definitely doesn't sound like too many. What about file compression? Has this already been considered?
----------------------------------------MrS
ExtraTerrestrial Apes - Scanning for our furry friends since Jan 2002
|
||
|
|
OldChap
Veteran Cruncher UK Joined: Jun 5, 2009 Post Count: 978 Status: Offline Project Badges:
|
I run another project entirely (not Boinc) that uses compression solely to reduce upload file sizes and that is incredibly cpu intensive....we are talking 3+ celeron core @2.4 mins per 10K of 120KB files here, so I can only imagine that trying similar with cep would detract from cpu getting the job done if trying to reduce writes volume. I would also think it might add an element of additional error potentially.
----------------------------------------Suggest that you over provision by 25% or more any SSD running these cep WU's more than a few at a time. This should give plenty of room for garbage collection to run concurrently thus extending life. You could try something designed for write intensive work too..... Something enterprise or derived from same such as the Seagate 600 pro from a year or two back which was designed for a TBW in excess of a Petabyte ![]() [Edit 1 times, last edit by OldChap at Aug 23, 2015 10:48:16 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
CEP2 is extremely I/O intense during setup. If then a task is finishing and uploading and a new one is started [I control this with app_config to get a seamless 1 finish, 1 start], there's big competition for the CPU time and storage sub-systems, AND the network. If then the data complexity is such that any form of pre transmission compression and packaging just gives fractional file reduction, you're not going to do this and just transmit as-is.
Similar to a few other projects at WCG, would love to see sim-linked static files for CEP2 [~6700 files, yes you read that right], so when running more than one, there's not more than one copy on the system. This copying and setup of a task is very time consuming, so much so that actual computing could not be starting until minutes after, even on moderately powerful systems. If then running with a ramdrive to speed up matters and run multiple, that's eating space fast. Anyway, the last on this. There's endless repetitive discussion on the intricacies of CEP2 and how to tweak [there's a CEP2 settings sheet somewhere]. The science runs on commercial software [Q-Chem], so they can figure it out [Think WCG/IBM told them a few things how to improve, them having had the source code for security vetting and grid enabling as such]. |
||
|
|
|