Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: Lost over 8 hours of calculating time |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 15
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I am running BETA - The Clean Energy Project - Phase 2 6.25 on my laptop. Beta_E200499_593_A.24.C18H12N2S2SEsI.33.2.set1d06_2
It was running all day at work on high priority, I had to shutdown and go home. When I got there it started all over again at 0%, all the work done during the day was gone. I've never seen that happen before so I thought I'd post a heads up for everyone. I am running version 6.2.28. I am concerned that it will not complete by morning and the due date is 14 hours away. It will be the first time I didn't make a due date, other than a computer crash. |
||
|
anhhai
Veteran Cruncher Joined: Mar 22, 2005 Post Count: 839 Status: Offline Project Badges: |
drscott, if you let it run all night, it will complete by morning. CEP2 WU have a 12 hr maximum run time (except in a few error situations). The WU for this project has been known to give headaches to a lot of crunchers, due to the fact that the time between checkpoints are long, very long. The only thing I can recommend is that if you decide to run CEP2 (which you have to manually add under preferences), then when you go home, just put you laptop into suspend mode instead of shutting it down.
----------------------------------------I hope that the WCG techs programmed in some sort of fail-safe that if a WU restarts over too many times, then just error out because a some crunchers may not know that they maybe losing a lot of work if they shutdown their system. And the truth is that most crunchers only run their system a few hours a day (of course these crunchers will probably not manually add this project to their list), so there maybe a lot of lost work when this project starts. |
||
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges: |
I hope that the WCG techs programmed in some sort of fail-safe that if a WU restarts over too many times, then just error out because a some crunchers may not know that they maybe losing a lot of work if they shutdown their system. A standard BOINC-client-feature is to error-out any tasks that's re-started 100 times in a row from the same checkpoint. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
anhhai
Veteran Cruncher Joined: Mar 22, 2005 Post Count: 839 Status: Offline Project Badges: |
100 times at 3 or 4 hrs before the 1st check point means 300-400 hrs wasted. Man thats more then the 10 day time limit for a WU. I really hope only we don't end up losing a bunch of crunch time. Of course, I am sure the techs know what they are doing. But we will find out soon enough
---------------------------------------- |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
drscott, if you let it run all night, it will complete by morning. CEP2 WU have a 12 hr maximum run time (except in a few error situations). No, this was just incomplete reporting. That was inflated Elapsed time! There's yet to be a single confirmed report that the CEP2 jobs have not cut off at 12:00 CPU hours (real CPU time).
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
100 times at 3 or 4 hrs before the 1st check point means 300-400 hrs wasted. Man thats more then the 10 day time limit for a WU. I really hope only we don't end up losing a bunch of crunch time. Of course, I am sure the techs know what they are doing. But we will find out soon enough That was for dramatic effect. No-one shuts down their computer 100 times over the course of a task. Usually such restarts are driven by events on the device (security software mostly interfering), then that 100 is reached very quickly. Checkpoints are highly variable at that. The first ones can be rather quick. REMEMBER: CEP2 is expresslyopt-in, best only run off machines that crunch long hours or 24/7.
WCG Global & Research > Make Proposal Help: Start Here!
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Sep 24, 2010 7:37:23 AM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
I am running BETA - The Clean Energy Project - Phase 2 6.25 on my laptop. Beta_E200499_593_A.24.C18H12N2S2SEsI.33.2.set1d06_2 It was running all day at work on high priority, I had to shutdown and go home. When I got there it started all over again at 0%, all the work done during the day was gone. I've never seen that happen before so I thought I'd post a heads up for everyone. I am running version 6.2.28. I am concerned that it will not complete by morning and the due date is 14 hours away. It will be the first time I didn't make a due date, other than a computer crash. Hi drscott53, A properly hibernated/standby'd laptop should not loose progress on tasks. If doing a full shutdown, best is to exit BOINC Manager and tell it to shut down the service too. That way BOINC has the time to save the task and it then able to resume from last checkpoint. In production it is not recommended to select CEP2 on short running devices. As I noted, checkpoints are variable, but can take a number of hours in between, which is why the project has a warning plus the advise to switch on LAIM (leave application in memory, when preempted). cheers and thanks for contributing to WCG.
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
rilian
Veteran Cruncher Ukraine - we rule! Joined: Jun 17, 2007 Post Count: 1452 Status: Offline Project Badges: |
This WU lost several hours of CPU time as well (usually that computer did a WU in 12.00 hours.
----------------------------------------It was crunched in virtual machine on laptop (macbook) and i had to shut it down with "power off" to save battery time [19:55:26] -- it is the time when i turned VM on yesterday Project Name: Beta - The Clean Energy Project - Phase 2 Created: 9/21/10 Name: BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06 Minimum Quorum: 2 Replication: 2 ------------------------- Result Name: BETA_ E200499_ 922_ A.24.C18H12N2S2SeSi.133.1.set1d06_ 2-- <core_client_version>6.5.0</core_client_version> <![CDATA[ <stderr_txt> ERROR: could not initialize graphics pointer in shared memory. INFO: No state to restore. Start from the beginning. ERROR: could not initialize graphics pointer in shared memory. INFO: No state to restore. Start from the beginning. [21:25:14] Number of jobs = 16 [21:25:14] Starting job 0,CPU time has been restored to 0.000000. [21:38:25] Finished Job #0 [21:38:25] Starting job 1,CPU time has been restored to 445.421875. [22:05:35] Finished Job #1 [22:05:35] Starting job 2,CPU time has been restored to 1338.765625. No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting ERROR: could not initialize graphics pointer in shared memory. [10:55:11] Number of jobs = 16 [10:55:11] Starting job 2,CPU time has been restored to 1338.765625. ERROR: could not initialize graphics pointer in shared memory. [19:55:26] Number of jobs = 16 [19:55:26] Starting job 2,CPU time has been restored to 1338.765625. No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting ERROR: could not initialize graphics pointer in shared memory. [22:16:31] Number of jobs = 16 [22:16:31] Starting job 2,CPU time has been restored to 1338.765625. No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting ERROR: could not initialize graphics pointer in shared memory. [10:53:14] Number of jobs = 16 [10:53:14] Starting job 2,CPU time has been restored to 1338.765625. Application exited with RC = 0xc0000005 [16:00:23] Finished Job #2 [16:00:23] Starting job 3,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #3 [16:00:23] Starting job 4,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #4 [16:00:23] Starting job 5,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #5 [16:00:23] Starting job 6,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #6 [16:00:23] Starting job 7,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #7 [16:00:23] Starting job 8,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #8 [16:00:23] Starting job 9,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #9 [16:00:23] Starting job 10,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #10 [16:00:23] Starting job 11,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #11 [16:00:23] Starting job 12,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #12 [16:00:23] Starting job 13,CPU time has been restored to 10132.140625. [16:00:23] Skipping Job #13 [16:00:23] Starting job 14,CPU time has been restored to 10132.140625. [16:00:24] Skipping Job #14 [16:00:24] Starting job 15,CPU time has been restored to 10132.140625. [16:00:24] Skipping Job #15 called boinc_finish </stderr_txt> ]]> ------------------------ 23-Sep-2010 19:55:23 [---] Preferences limit disk usage to 3.20GB 23-Sep-2010 19:55:23 [World Community Grid] Restarting task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 using beta11 version 625 23-Sep-2010 22:16:26 [World Community Grid] Task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 exited with zero status but no 'finished' file 23-Sep-2010 22:16:26 [World Community Grid] If this happens repeatedly you may need to reset the project. 23-Sep-2010 22:16:29 [World Community Grid] Restarting task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 using beta11 version 625 24-Sep-2010 10:52:29 [World Community Grid] Task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 exited with zero status but no 'finished' file 24-Sep-2010 10:52:31 [World Community Grid] If this happens repeatedly you may need to reset the project. 24-Sep-2010 10:52:46 [World Community Grid] Restarting task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 using beta11 version 625 24-Sep-2010 16:00:54 [World Community Grid] Computation for task BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2 finished 24-Sep-2010 16:00:58 [World Community Grid] Started upload of BETA_E200499_922_A.24.C18H12N2S2SeSi.133.1.set1d06_2_0 ---------------------------------------- [Edit 3 times, last edit by rilian at Sep 24, 2010 2:26:57 PM] |
||
|
rainforest1155
Cruncher Joined: Mar 28, 2007 Post Count: 6 Status: Offline Project Badges: |
A properly hibernated/standby'd laptop should not loose progress on tasks. If doing a full shutdown, best is to exit BOINC Manager and tell it to shut down the service too. That way BOINC has the time to save the task and it then able to resume from last checkpoint. Hibernating still does pause WUs for me. BOIC detects that the system suspends and stops all WUs. Once the system is turned back on, CEP2 WUs start at the last checkpoint (which may result in hours of work lost). At least it's that way for me on my i7-860 desktop machine running Win7 64bit. Sebastian |
||
|
anhhai
Veteran Cruncher Joined: Mar 22, 2005 Post Count: 839 Status: Offline Project Badges: |
rainforest, you probably don't have "Leave application in memory while suspended" set. It is under Advance->perferences->"disk and memory usage"
---------------------------------------- |
||
|
|