Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 15
Posts: 15   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3730 times and has 14 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Lost over 8 hours of calculating time

I am running BETA - The Clean Energy Project - Phase 2 6.25 on my laptop. Beta_E200499_593_A.24.C18H12N2S2SEsI.33.2.set1d06_2

It was running all day at work on high priority, I had to shutdown and go home. When I got there it started all over again at 0%, all the work done during the day was gone. sad I've never seen that happen before so I thought I'd post a heads up for everyone.

I am running version 6.2.28. I am concerned that it will not complete by morning and the due date is 14 hours away. It will be the first time I didn't make a due date, other than a computer crash.
[Sep 24, 2010 3:41:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
anhhai
Veteran Cruncher
Joined: Mar 22, 2005
Post Count: 839
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

drscott, if you let it run all night, it will complete by morning. CEP2 WU have a 12 hr maximum run time (except in a few error situations). The WU for this project has been known to give headaches to a lot of crunchers, due to the fact that the time between checkpoints are long, very long. The only thing I can recommend is that if you decide to run CEP2 (which you have to manually add under preferences), then when you go home, just put you laptop into suspend mode instead of shutting it down.
I hope that the WCG techs programmed in some sort of fail-safe that if a WU restarts over too many times, then just error out because a some crunchers may not know that they maybe losing a lot of work if they shutdown their system. And the truth is that most crunchers only run their system a few hours a day (of course these crunchers will probably not manually add this project to their list), so there maybe a lot of lost work when this project starts.
----------------------------------------

[Sep 24, 2010 4:21:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

I hope that the WCG techs programmed in some sort of fail-safe that if a WU restarts over too many times, then just error out because a some crunchers may not know that they maybe losing a lot of work if they shutdown their system.

A standard BOINC-client-feature is to error-out any tasks that's re-started 100 times in a row from the same checkpoint.
----------------------------------------


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
[Sep 24, 2010 5:01:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
anhhai
Veteran Cruncher
Joined: Mar 22, 2005
Post Count: 839
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

100 times at 3 or 4 hrs before the 1st check point means 300-400 hrs wasted. Man thats more then the 10 day time limit for a WU. I really hope only we don't end up losing a bunch of crunch time. Of course, I am sure the techs know what they are doing. But we will find out soon enough
----------------------------------------

[Sep 24, 2010 5:13:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

drscott, if you let it run all night, it will complete by morning. CEP2 WU have a 12 hr maximum run time (except in a few error situations).

No, this was just incomplete reporting. That was inflated Elapsed time! There's yet to be a single confirmed report that the CEP2 jobs have not cut off at 12:00 CPU hours (real CPU time).
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Sep 24, 2010 7:29:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

100 times at 3 or 4 hrs before the 1st check point means 300-400 hrs wasted. Man thats more then the 10 day time limit for a WU. I really hope only we don't end up losing a bunch of crunch time. Of course, I am sure the techs know what they are doing. But we will find out soon enough

That was for dramatic effect. No-one shuts down their computer 100 times over the course of a task. Usually such restarts are driven by events on the device (security software mostly interfering), then that 100 is reached very quickly. Checkpoints are highly variable at that. The first ones can be rather quick.

REMEMBER: CEP2 is expresslyopt-in, best only run off machines that crunch long hours or 24/7.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Sep 24, 2010 7:37:23 AM]
[Sep 24, 2010 7:35:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

I am running BETA - The Clean Energy Project - Phase 2 6.25 on my laptop. Beta_E200499_593_A.24.C18H12N2S2SEsI.33.2.set1d06_2

It was running all day at work on high priority, I had to shutdown and go home. When I got there it started all over again at 0%, all the work done during the day was gone. sad I've never seen that happen before so I thought I'd post a heads up for everyone.

I am running version 6.2.28. I am concerned that it will not complete by morning and the due date is 14 hours away. It will be the first time I didn't make a due date, other than a computer crash.

Hi drscott53,

A properly hibernated/standby'd laptop should not loose progress on tasks. If doing a full shutdown, best is to exit BOINC Manager and tell it to shut down the service too. That way BOINC has the time to save the task and it then able to resume from last checkpoint.

In production it is not recommended to select CEP2 on short running devices. As I noted, checkpoints are variable, but can take a number of hours in between, which is why the project has a warning plus the advise to switch on LAIM (leave application in memory, when preempted).

cheers and thanks for contributing to WCG.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Sep 24, 2010 7:47:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
rilian
Veteran Cruncher
Ukraine - we rule!
Joined: Jun 17, 2007
Post Count: 1452
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

This WU lost several hours of CPU time as well (usually that computer did a WU in 12.00 hours.

It was crunched in virtual machine on laptop (macbook) and i had to shut it down with "power off" to save battery time

[19:55:26] -- it is the time when i turned VM on yesterday



Project Name: Beta - The Clean Energy Project - Phase 2
Created: 9/21/10
Name: BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06
Minimum Quorum: 2
Replication: 2

-------------------------

Result Name: BETA_ E200499_ 922_ A.24.C18H12N2S2SeSi.133.1.set1d06_ 2--
<core_client_version>6.5.0</core_client_version>
<![CDATA[
<stderr_txt>
ERROR: could not initialize graphics pointer in shared memory.
INFO: No state to restore. Start from the beginning.
ERROR: could not initialize graphics pointer in shared memory.
INFO: No state to restore. Start from the beginning.
[21:25:14] Number of jobs = 16
[21:25:14] Starting job 0,CPU time has been restored to 0.000000.
[21:38:25] Finished Job #0
[21:38:25] Starting job 1,CPU time has been restored to 445.421875.
[22:05:35] Finished Job #1
[22:05:35] Starting job 2,CPU time has been restored to 1338.765625.
No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
ERROR: could not initialize graphics pointer in shared memory.
[10:55:11] Number of jobs = 16
[10:55:11] Starting job 2,CPU time has been restored to 1338.765625.
ERROR: could not initialize graphics pointer in shared memory.
[19:55:26] Number of jobs = 16
[19:55:26] Starting job 2,CPU time has been restored to 1338.765625.
No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
ERROR: could not initialize graphics pointer in shared memory.
[22:16:31] Number of jobs = 16
[22:16:31] Starting job 2,CPU time has been restored to 1338.765625.
No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
ERROR: could not initialize graphics pointer in shared memory.
[10:53:14] Number of jobs = 16
[10:53:14] Starting job 2,CPU time has been restored to 1338.765625.
Application exited with RC = 0xc0000005
[16:00:23] Finished Job #2
[16:00:23] Starting job 3,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #3
[16:00:23] Starting job 4,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #4
[16:00:23] Starting job 5,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #5
[16:00:23] Starting job 6,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #6
[16:00:23] Starting job 7,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #7
[16:00:23] Starting job 8,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #8
[16:00:23] Starting job 9,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #9
[16:00:23] Starting job 10,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #10
[16:00:23] Starting job 11,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #11
[16:00:23] Starting job 12,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #12
[16:00:23] Starting job 13,CPU time has been restored to 10132.140625.
[16:00:23] Skipping Job #13
[16:00:23] Starting job 14,CPU time has been restored to 10132.140625.
[16:00:24] Skipping Job #14
[16:00:24] Starting job 15,CPU time has been restored to 10132.140625.
[16:00:24] Skipping Job #15
called boinc_finish

</stderr_txt>
]]>

------------------------

23-Sep-2010 19:55:23 [---] Preferences limit disk usage to 3.20GB
23-Sep-2010 19:55:23 [World Community Grid] Restarting task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 using beta11 version 625
23-Sep-2010 22:16:26 [World Community Grid] Task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 exited with zero status but no 'finished' file
23-Sep-2010 22:16:26 [World Community Grid] If this happens repeatedly you may need to reset the project.
23-Sep-2010 22:16:29 [World Community Grid] Restarting task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 using beta11 version 625
24-Sep-2010 10:52:29 [World Community Grid] Task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 exited with zero status but no 'finished' file
24-Sep-2010 10:52:31 [World Community Grid] If this happens repeatedly you may need to reset the project.
24-Sep-2010 10:52:46 [World Community Grid] Restarting task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 using beta11 version 625
24-Sep-2010 16:00:54 [World Community Grid] Computation for task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 finished
24-Sep-2010 16:00:58 [World Community Grid] Started upload of BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2_0
----------------------------------------
----------------------------------------
[Edit 3 times, last edit by rilian at Sep 24, 2010 2:26:57 PM]
[Sep 24, 2010 2:18:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
rainforest1155
Cruncher
Joined: Mar 28, 2007
Post Count: 6
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

A properly hibernated/standby'd laptop should not loose progress on tasks. If doing a full shutdown, best is to exit BOINC Manager and tell it to shut down the service too. That way BOINC has the time to save the task and it then able to resume from last checkpoint.

Hibernating still does pause WUs for me. BOIC detects that the system suspends and stops all WUs. Once the system is turned back on, CEP2 WUs start at the last checkpoint (which may result in hours of work lost).

At least it's that way for me on my i7-860 desktop machine running Win7 64bit.

Sebastian
[Sep 25, 2010 12:04:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
anhhai
Veteran Cruncher
Joined: Mar 22, 2005
Post Count: 839
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lost over 8 hours of calculating time

rainforest, you probably don't have "Leave application in memory while suspended" set. It is under Advance->perferences->"disk and memory usage"
----------------------------------------

[Sep 25, 2010 12:29:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 15   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread