World Community Grid - View Thread - Problems with Restart and Hibernation

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Problems with Restart and Hibernation

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 9

[ ]

Author

This topic has been viewed 1204 times and has 8 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Problems with Restart and Hibernation

Are there any problems with the restart of BOINC. I have several errors and Inconclusive after restarting and comming back from Hibernation.

Any hints ?

And why is this comming ? I have 1GB set as maximum ?

05/07/2006 10:06:07|World Community Grid|Resuming task faah0680_d475cb052_x1hpv_01_2 using faah version 509
05/07/2006 10:11:08|World Community Grid|Aborting task faah0680_d475cb052_x1hpv_01_2: exceeded disk limit: 51245484.000000 > 50000000.000000
05/07/2006 10:11:08|World Community Grid|Unrecoverable error for result faah0680_d475cb052_x1hpv_01_2 (Maximum disk usage exceeded)
05/07/2006 10:11:08|World Community Grid|Deferring scheduler requests for 1 minutes and 0 seconds
05/07/2006 10:11:14||Rescheduling CPU: application exited
05/07/2006 10:11:14|World Community Grid|Computation for task faah0680_d475cb052_x1hpv_01_2 finished

Thanks in advance

Siegfried

[Jul 5, 2006 8:28:08 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Problems with Restart and Hibernation

Had exactly the same 50mb error a while back. The response was that the particular WU had been compiled with too low an allowance for diskspace usage, nothing to do with your HD! Advice just to get on with the next one.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Jul 5, 2006 8:53:42 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problems with Restart and Hibernation

That's a slightly more unusual error.

What you're seeing is, each work unit has a built-in disk limit, in addition (and smaller than) the overall BOINC limit. This stops one work unit eating all the space, and (in this case) the work unit isn't supposed to create such large temporary files.

It is perfectly possible that the hibernation has conflicted with BOINC, and corrupted the data, leading to a large out of control error log using up all the space. This would be consistent with the error occurring a few minutes after the work unit was resumed.

Take a look at the work unit folder, and see what files are there, and note any very large files. Of course, the work unit folder may have been cleaned up by now.

Personally, I don't find hibernation useful. It is faster to shutdown and restart properly, without running any risk of corruption. However, if you can reproduce this, then it should definitely be looked into further by the WCG techs. Keep an eye on it, and if it happens again have a look at the actual file that grew too large.

[Jul 5, 2006 9:02:54 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problems with Restart and Hibernation

The space Problem occurs after restarting BOINC after a system restart.

Thanks for all information.

I see nothing special in the workunit folder:

Volume in drive C is Local Disk
Volume Serial Number is 5CC0-9DEE

Directory of C:\Program Files\BOINC\projects\www.worldcommunitygrid.org

05.07.2006 11:12 <DIR> .
05.07.2006 11:12 <DIR> ..
09.06.2006 16:19 39.057 avgE_from_pdb
09.06.2006 16:20 35.280 bbind00.Nov.lib
09.06.2006 16:19 561 bb_hbW
09.06.2006 16:20 479.180 disulf_jumps.dat
09.06.2006 16:20 7.722 dunsd
05.07.2006 10:16 1.489 faah0680_d475cb056_x1hpv_01_0_0
05.07.2006 10:16 75.671 faah0680_d475cb056_x1hpv_01_0_1
04.07.2006 10:22 3.878 faah0680_d475cb056_x1hpv_01_AD4_parameters.dat
04.07.2006 10:22 3.450 faah0680_d475cb056_x1hpv_01_d475cb056.pdbqt
04.07.2006 10:22 1.105 faah0680_d475cb056_x1hpv_01_d475cb056_x1hpv_01.gpf
04.07.2006 10:22 3.191 faah0680_d475cb056_x1hpv_01_faah0680_d475cb056_x1hpv_01.dpf
04.07.2006 10:22 184.555 faah0680_d475cb056_x1hpv_01_x1hpv.pdbqt
05.07.2006 07:41 3.878 faah0680_d475cb897_x1hpv_00_AD4_parameters.dat
05.07.2006 07:41 3.754 faah0680_d475cb897_x1hpv_00_d475cb897.pdbqt
05.07.2006 07:41 1.243 faah0680_d475cb897_x1hpv_00_d475cb897_x1hpv_00.gpf
05.07.2006 07:41 3.327 faah0680_d475cb897_x1hpv_00_faah0680_d475cb897_x1hpv_00.dpf
05.07.2006 07:41 184.555 faah0680_d475cb897_x1hpv_00_x1hpv.pdbqt
27.06.2006 07:34 11.994 hpf2.avgE_from_pdb.gz
27.06.2006 07:34 5.873.640 hpf2.bbdep02.May.sortlib.gz
27.06.2006 07:35 165 hpf2.Paa.gz
27.06.2006 07:34 1.831 hpf2.Paa_n.gz
27.06.2006 07:34 117.362 hpf2.Paa_pp.gz
27.06.2006 07:35 2.406 hpf2.paircutoffs.gz
27.06.2006 07:35 69.283 hpf2.pdbpairstats_fine.gz
27.06.2006 07:35 19.292 hpf2.phi.theta.36.HS.resmooth.gz
27.06.2006 07:35 11.718 hpf2.phi.theta.36.SS.resmooth.gz
27.06.2006 07:34 129.450 hpf2.plane_data_table_1015.dat.gz
27.06.2006 07:35 389.787 hpf2.Rama_smooth_dyn.dat_ss_6.4.gz
27.06.2006 07:35 1.113 hpf2.SASA-angles.dat.gz
27.06.2006 07:35 64.917 hpf2.SASA-masks.dat.gz
27.06.2006 07:35 2.475 hpf2.sasa_offsets.txt.gz
27.06.2006 07:35 34.441 hpf2.sasa_prob_cdf.txt.gz
28.06.2006 22:22 1.243 hpf2_5.07_win_paths.txt
09.06.2006 16:20 1.608 jump_templates.dat
09.06.2006 16:20 364 Paa
09.06.2006 16:19 6.272 Paa_n
09.06.2006 16:19 984.960 Paa_pp
09.06.2006 16:20 18.034 paircutoffs
09.06.2006 16:20 280.000 pdbpairstats_fine
09.06.2006 16:20 62.208 phi.theta.36.HS.resmooth
09.06.2006 16:20 41.472 phi.theta.36.SS.resmooth
09.06.2006 16:19 907.200 plane_data_table_1015.dat
09.06.2006 16:20 4.432.320 Rama_smooth_dyn.dat_ss_6.4
09.06.2006 16:20 5.382.144 rosetta_4.22_windows_intelx86
09.06.2006 16:20 13.613 SASA-angles.dat
09.06.2006 16:20 1.074.560 SASA-masks.dat
09.06.2006 16:20 6.731 sasa_offsets.txt
09.06.2006 16:20 13.260 sc_hbW
09.06.2006 16:20 17.838 template.pdb
05.07.2006 11:12 0 tree.dat
08.06.2006 15:44 1.146.880 wcg_faah_autodock_5.09_windows_intelx86
28.06.2006 22:52 11.730.944 wcg_hpf2_rosetta_5.07_windows_intelx86
09.06.2006 16:19 1.243 win_paths.txt
05.07.2006 07:41 232.261 za094_00322_aaza09403_05.075_v1_3.gz
05.07.2006 07:41 514.623 za094_00322_aaza09409_05.075_v1_3.gz
05.07.2006 07:41 222 za094_00322_za094.fasta.gz
05.07.2006 07:41 209 za094_00322_za094.psipred.gz
05.07.2006 07:41 716 za094_00322_za094.psipred_ss2.gz
05.07.2006 07:41 232.261 za094_00333_aaza09403_05.075_v1_3.gz
05.07.2006 07:41 514.623 za094_00333_aaza09409_05.075_v1_3.gz
05.07.2006 07:41 222 za094_00333_za094.fasta.gz
05.07.2006 07:41 209 za094_00333_za094.psipred.gz
05.07.2006 07:41 716 za094_00333_za094.psipred_ss2.gz
63 File(s) 35.380.726 bytes
2 Dir(s) 2.375.688.192 bytes free

Thanks in advance.

Siegfried

[Jul 5, 2006 9:27:17 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Problems with Restart and Hibernation

Hibernation....here in the land of unstable electricity, the Standby option is experienced as 'The' solution, not hybing. Takes about 5 seconds to shut down and 15 to get back to where i was. The PSU provides sufficient standby juice to maintain that RAM state for a long long time.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Jul 5, 2006 9:31:04 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problems with Restart and Hibernation

For reference, running work units keep their files in the slot folders, e.g. C:\Program Files\BOINC\slots\0

It will be cleaned out for the next work unit by now, but that's where the interesting stuff will be should it happen again.

[Jul 5, 2006 9:33:28 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problems with Restart and Hibernation

Thanks for the information. Nothing special in the C:\Program Files\BOINC\slots\0 folder.

Regarding Hibernation, it seems that the files not found happens often after comming back from hibernation:

2006-07-05 05:33:37 [World Community Grid] Starting task faah0680_d475cb052_x1hpv_01_2 using faah version 509
2006-07-05 07:32:44 [World Community Grid] Task faah0680_d475cb052_x1hpv_01_2 exited with zero status but no 'finished' file
2006-07-05 07:32:44 [World Community Grid] If this happens repeatedly you may need to reset the project.
2006-07-05 07:32:44 [---] Rescheduling CPU: application exited

error in the status information:

Checkpoint complete
call_glss(): pop_size: 200 num_evals: 10000000 start: [09:42:39]
call_glss(): end: [09:57:30]
wcg_checkpoint() called
Starting to checkpoint ...
Failed to open wcg_checkpoint.dat for reading. rc: 2. File doesn't exist?
INFO: CPU Idle Factor is 0.000000
World Community Grid AutoDock (projects/www.worldcommunitygrid.org/wcg_faah_autodock_5.09_windows_intelx86) version Failed to get VersionInfo size: 1812

Failed to open receptor.maps.fld for reading. rc: 2. File doesn't exist?
INFO:[10:06:12] Start AutoGrid...

autogrid: autogrid4: Successful Completion.
wcg_checkpoint() called
Starting to checkpoint ...
Checkpoint complete
INFO:[10:09:02] End AutoGrid...
Beginning AutoDock...
INFO: Setting num_generations: 27000
Setting maxGen to 6750
Failed to open wcg_faah.state for reading. rc: 2. File doesn't exist?

Thanks in advance.

Siegfried

[Jul 5, 2006 9:43:39 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problems with Restart and Hibernation

No, that's perfectly normal. It's just looking for files to see if it needs to restart from a checkpoint, or whether it should start at the beginning. All your FAAH work units will log something similar to this.

[Jul 5, 2006 10:10:27 AM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Problems with Restart and Hibernation

BOINC allows the project to set a limit on each workunit that causes the workunit to abort if the amount of disk used by the workunit exceeds this threshold. It prevents an application from running amok and dumping out tons of data if something goes wrong.

You got this message becuase we set the limit for FightAIDS@Home at 50000000. However there have been about 30 workunits (out of around 10,000's that we have run) that needed more disk space then this. All new FightAIDS@Home batches on BOINC now are now set to 75000000 (there are still a few old ones going through though).

We apologize for the problem.

----------------------------------------
[Edit 2 times, last edit by knreed at Jul 9, 2006 2:44:17 AM]

[Jul 9, 2006 2:39:25 AM]

[ ]