World Community Grid - View Thread - Request to zip large datasets

World Community Grid Forums

Category: Active Research

Forum: Mapping Cancer Markers Forum

Thread: Request to zip large datasets

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 11

[ ]

Author

This topic has been viewed 3971 times and has 10 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Request to zip large datasets

I had a quick look through this forum and didn't see this question posted before, so here goes!

Today a rather large file was transferred to me, which is concerning because of limited/shared bandwidth where I stay.

I have set a conservative daily limit on maximum transfer size, but it would maybe be better to receive a smaller file instead of waiting longer (eg a few days) to be able to process the file.

Would be possible to transfer it in zipped format and unzip it using the BOINC client?

Size comparison below as an example.

File: mcm1.dataset-17_72_SDG_v1.txt
Transfer size: 32 714 954 bytes
Zipped size: 10 018 609 bytes

This would obviously reduce server bandwidth requirements as well, and might be beneficial to people using a mobile client (where operators charge per MB)

Thanks for taking the time to answer my question smile

----------------------------------------
[Edit 1 times, last edit by Former Member at Mar 31, 2014 7:45:29 AM]

[Mar 31, 2014 7:27:53 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Request to zip large datasets

Far as can be seen the large 30mb txt file is a one time transfer. It seems to serve as a reference database to all the following mcm1 tasks. Why the file is not transmitted in pre-compressed form is unknown. Given that files are already being compressed / decompressed in real time over the internet by the modems, not a bandwidth concern. The compressed files arrive here at a few hundred kb speed, but the txt file at well over 2mb, evidencing the real time decompression on the agent side.

While writing this, the not pre-compressing may have something to do with the multiuse. All mcm1 tasks symlink to the file. Nothing that's transmitted in zip form gets unzipped while in the project folder. They are stored as is. The agent may lack the functionality to do that, it being a responsibility of the science application? Just a stab at the unknown why, but anyway the file is transmitted only one time, not with every new task it appears.

[Mar 31, 2014 8:11:13 AM]

seippel
Former World Community Grid Tech
Joined: Apr 16, 2009
Post Count: 392
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

180 day badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Request to zip large datasets

bjorntfn,

BOINC automatically handles compression for transferred files so the actual amount transferred for that file should be far less than 32M. Also, as lavaflow pointed out, for MCM1 the same dataset file is used for many different work units, and BOINC is smart enough to only transfer it once to save some bandwidth there as well.

Seippel

[Mar 31, 2014 8:43:33 PM]

BobCat13
Senior Cruncher
Joined: Oct 29, 2005
Post Count: 295
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

45 day badge for The Clean Energy Project

90 day badge for Influenza Antiviral Drug Search

90 day badge for Discovering Dengue Drugs - Together - Phase 2

90 day badge for The Clean Energy Project - Phase 2

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for FightAIDS@Home - Phase 2

1 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Request to zip large datasets

bjorntfn,

BOINC automatically handles compression for transferred files so the actual amount transferred for that file should be far less than 32M. Also, as lavaflow pointed out, for MCM1 the same dataset file is used for many different work units, and BOINC is smart enough to only transfer it once to save some bandwidth there as well.

Seippel

That bolded part is not correct (at least for me). If a machine is running multiple Boinc projects or even multiple applications here at WCG and it finishes the last MCM1 it has, then reports it without receiving a new MCM1 task, the dataset txt file is deleted from the project directory and is downloaded again once a new MCM1 task is received.

Here are some messages taken from the client log since the first of March:

02-Mar-2014 01:54:05 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
02-Mar-2014 08:46:59 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
06-Mar-2014 15:03:35 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
08-Mar-2014 02:37:01 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
08-Mar-2014 20:33:25 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
10-Mar-2014 07:27:03 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
10-Mar-2014 20:49:54 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
11-Mar-2014 07:45:35 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
12-Mar-2014 17:14:06 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
14-Mar-2014 05:17:48 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
15-Mar-2014 04:49:55 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
15-Mar-2014 09:35:21 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
21-Mar-2014 15:54:50 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
23-Mar-2014 03:46:18 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
23-Mar-2014 17:03:50 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
24-Mar-2014 06:11:49 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
26-Mar-2014 12:02:52 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
27-Mar-2014 22:25:33 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
29-Mar-2014 17:00:18 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
31-Mar-2014 07:42:28 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt

The client just reported an MCM1 task about 3 hours ago, doesn't have one currently in queue and the dataset txt file was deleted.

Edit:
Here are messages from a Linux machine that runs multiple projects, so it may have times where there are no WCG tasks on it. After it reported the 1 WCG MCM1 task below, it was empty of WCG tasks, later requested more work and had to download the dataset txt file again.

31-Mar-2014 05:03:56 [WCG] Sending scheduler request: To report completed tasks.
31-Mar-2014 05:03:56 [WCG] Reporting 1 completed tasks
31-Mar-2014 05:03:56 [WCG] Not requesting tasks: don't need
31-Mar-2014 05:03:58 [WCG] Scheduler request completed

31-Mar-2014 07:38:30 [WCG] Sending scheduler request: To fetch work.
31-Mar-2014 07:38:30 [WCG] Requesting new tasks for CPU
31-Mar-2014 07:38:33 [WCG] Scheduler request completed: got 3 new tasks

31-Mar-2014 07:39:05 [WCG] Started download of mcm1.dataset-17_72_SDG_v1.txt
31-Mar-2014 07:39:16 [WCG] Finished download of mcm1.dataset-17_72_SDG_v1.txt

----------------------------------------
[Edit 1 times, last edit by BobCat13 at Apr 1, 2014 3:25:36 AM]

[Apr 1, 2014 3:07:30 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Request to zip large datasets

The brainy part of version 7 agents, volunteers complaining of crud collecting over time. Can you imagine clients that get on and off cep2 with these 200mb sets? Maybe this cleaning process best only runs at a restart?

My client has been fighting to get new copies of both 7.16 and 7.20 fahv since the new logo launch, and failing due a signature / md5 incorrectness, so it's simapping.

[Apr 1, 2014 9:22:34 AM]

BobCat13
Senior Cruncher
Joined: Oct 29, 2005
Post Count: 295
Status: Offline
Project Badges:


Re: Request to zip large datasets

CEP2 never deleted the 200MB set when there was no CEP2 task present on the machine, nor HPF2. Rosetta also doesn't delete their large minirosetta database if you have no Rosetta tasks present. This is specific to the MCM1 dataset txt file.

Just a guess here, but since MCM1 will use different datasets over its run, the files are not referenced under the <app_version> section in client_state.xml so they get deleted if no MCM1 tasks are present in queue. If I remember correctly, there was an option to make files <static/> in client_state.xml which would keep them on the system even if no tasks that used them were present. This option may have been removed by version 7 clients, but I'm not sure. The <static/> option would also require the WCG techs remove the flag once a dataset is no longer needed, so if the techs forget you could end up having a bunch of unneeded files taking up disc space.

----------------------------------------
[Edit 1 times, last edit by BobCat13 at Apr 1, 2014 2:35:49 PM]

[Apr 1, 2014 2:35:12 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Request to zip large datasets

Let's extend the bolded line "BOINC is smart enough to only transfer it once to save some bandwidth there as well, long as it's deemed needed for the current batches processed on the node.". The file not being replicated time and again into the slots, symlinked for each task, makes the whole processing / storage event substantially more efficient than what's seen with cep2 particularly.

[Apr 1, 2014 2:54:29 PM]

seippel
Former World Community Grid Tech
Joined: Apr 16, 2009
Post Count: 392
Status: Offline
Project Badges:


Re: Request to zip large datasets

My previous comment was a bit of an oversimplification. Work unit files are categorized into three different types:
1) Files which are specific to a single work unit. These are sent when the work unit is downloaded and deleted after it's done
2) File which span a large number of work units but not the life of the project. These files are downloaded once, but deleted from the client machine once all work units that need it are finished. The MCM1 dataset files fall into this category.
3) Files which are needed by all work units (or nearly all) work units for a project. These files are kept even if no work units that need them are currently downloaded. The qcaux.zip files for CEP2 fall into this last category.

Seippel

[Apr 1, 2014 8:12:37 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Request to zip large datasets

Thanks all for the responses.

I am still not convinced....

The issue I'm having is not whether or not the file is reused by multiple work units, although this does sound like a good idea - saves bandwidth for people like me smile

Thanks for the good explanation of this mechanic.
In theory BOINC could hang onto the file until the storage is needed for something else instead of deleting it, if there is potential for the file to be used "soon".
You might find that some users have set conservative values for storage allowed and this causes the file to be deleted earlier.

Anyway.... back to my question smile

I'm concerned about the actual size transferred during the initial download - this will always happen at least once.

As lavaflow explains, BOINC does allow for the files to be zipped/unzipped on-the-fly, but my setup seems to indicate that this is not happening.

It seems a bit strange that my agent reports the transfer size at the full 31MB, and presumably counts this towards my daily transfer limit, rather than the compressed size.

Another observation - according to the BOINC documentation I could find, if the file is compressed/decompressed by BOINC then resuming the download is not supported.
However, in my situation, when the download stalled after reaching my daily bandwidth limit, the following day the download resumed (it did not restart).

This leads me to think that the file is not being precompressed and decompressed in the background.

Is there any way to confirm this beyond reasonable doubt?

My client's only indication as to the file size is that the total of the uncompressed file is reported and the download pauses at the correct number of MB when it reaches my daily limit.

I suppose that being able to resume the download is preferable to zipping it as this would ensure the file always reaches the client (whereas zipping + daily limit might prevent the file from ever reaching)

Perhaps if the compression is explicitly performed by the science application, then the download could be treated as a normal download by BOINC and resumed as usual.
Could we request this?

Thanks!

----------------------------------------
[Edit 2 times, last edit by Former Member at Apr 2, 2014 7:46:00 AM]

[Apr 2, 2014 7:31:43 AM]

seippel
Former World Community Grid Tech
Joined: Apr 16, 2009
Post Count: 392
Status: Offline
Project Badges:


Re: Request to zip large datasets

bjorntfn,

BOINC uses libcurl to handle the compression/transfer for it, so the file size BOINC shows as transferred is the full file size, not the compressed file size. Since BOINC doesn't display the compressed file size anywhere, testing it would require network monitoring outside of BOINC while the transfer is running. For what it's worth, I confirmed this last night using iftop and transferring data for an MCM1 work unit and the actual amount of data transferred was just over 10MB (the dataset file was 32MB). One thing to keep in mind if you test that way, is that resumed transfers are not compressed so if you interrupt the transfer, you may see something different.

Another option would be to set the http_debug and http_xfer_debug log flags in a cc_config.xml file (information about setting up a cc_config.xml file can be found in the Sekerob's "The Start Here Forum Frequently Asked Questions Index" post). This will show that BOINC requests the transfer to be compressed:
[http_debug][ID#47] Sent header to server: Accept-Encoding: deflate, gzip
Again, ignore the "HTTP: wrote X bytes" messages because those are what BOINC gets back from libcurl, which is uncompressed.

Seippel

[Apr 4, 2014 3:47:48 PM]

[ ]