Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 17
Posts: 17   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2391 times and has 16 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Always Goes Back to About 68% [RESOLVED]

When I'm running The Clean Energy Project - Phase 2 6.40, the percent done and CPU time go back to about 68% and 9 hours every time it gets suspended and resumed. Is this a bug, or do I just have to make sure nothing interferes with it until I'm done?

I'm running a Macbook Pro (5,4) with 10.7.3 and BOINC v7.0.25.

Here's the Event Log:

Wed May 9 09:46:44 2012 | World Community Grid | task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 suspended by user
Wed May 9 09:46:50 2012 | World Community Grid | task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 resumed by user
Wed May 9 09:46:51 2012 | World Community Grid | Restarting task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 using cep2 version 640 in slot 7

And yes, I have "Leave application in memory while suspended" checked.

Thanks!
----------------------------------------
[Edit 1 times, last edit by Former Member at May 10, 2012 7:56:01 PM]
[May 9, 2012 6:32:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

Dear ZachZiggster,
as long as you have engaged LAIM and don't shut down your computer you should not loose progress. The %-Progress gauge is a rather crude tool and maybe it just displays an incorrect value after the job resumes.
Best wishes from
Your Harvard CEP team
[May 9, 2012 7:06:57 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

Hello ZachZiggster,
With LAIM set in BOINC, cleanenergy is correct. If it really is going back to the check point, then it becomes a puzzle for a Mac expert.

confused
Lawrence
[May 9, 2012 7:33:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

I checked my WCG host preferences, my Bam! preferences, and my BOINC preferences, and all of them have LAIM set. confused
----------------------------------------
[Edit 1 times, last edit by Former Member at May 9, 2012 10:32:11 PM]
[May 9, 2012 10:31:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

Hello ZachZiggster,
Which means that Mac OS X is not doing something it should (perhaps because of some hardware fault) or that all our Mac users have failed to report (or notice) this problem running CEP2 on their machines.

These unique failures that affect only one user are always embarrassing to diagnose for Support. It sounds like the standard bureaucratic "Not this department. Try down the hall." Even so, all I can suggest is avoiding CEP2. It does not sound like a problem that standard Mac diagnostics will catch.

Lawrence
[May 10, 2012 2:09:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

Hi,

Let's do the following test (for the sake of convincing us doubting Tom's would we appear that to be): Pull some other science work of WCG, does not matter which and let this run concurrent to 1 CEP2 task. Then after a while, suspend the machine (hibernate or sleep) and then power up again. When I do that, the message log records "resume" for all tasks. If it's a machine problem [or the client not having accepted the LAIM activation], all running tasks would show "restart". If it's a CEP2 problem, only this one would show a "restart" and the others a "resume". Here's a sample on how this appears in the event log when I tested this scenario:
612	WCG	10-5-2012 6:57:14	[checkpoint] result GFAM_x1rr6_hPNP_0019736_0062_1 checkpointed	
613 WCG 10-5-2012 6:58:50 [checkpoint] result GFAM_x1rr6_hPNP_0019728_0171_1 checkpointed
614 WCG 10-5-2012 6:58:54 [checkpoint] result GFAM_x1rr6_hPNP_0019741_0026_1 checkpointed
615 WCG 10-5-2012 6:59:59 [checkpoint] result GFAM_x1rr6_hPNP_0019728_0143_1 checkpointed
616 WCG 10-5-2012 7:02:07 [checkpoint] result E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 checkpointed
617 WCG 10-5-2012 7:02:18 [checkpoint] result GFAM_x1rr6_hPNP_0019741_0260_0 checkpointed
618 WCG 10-5-2012 7:02:23 [checkpoint] result GFAM_x1rr6_hPNP_0019727_0149_0 checkpointed
619 WCG 10-5-2012 7:03:26 task GFAM_x1rr6_hPNP_0019773_0224_0 suspended by user
620 WCG 10-5-2012 7:03:31 task GFAM_x1rr6_hPNP_0019773_0142_1 suspended by user
621 10-5-2012 7:04:10 Windows is suspending operations
622 10-5-2012 7:04:11 Suspending computation - requested by operating system
623 WCG 10-5-2012 7:04:11 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019728_0143_1 (left in memory)
624 WCG 10-5-2012 7:04:11 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019728_0171_1 (left in memory)
625 WCG 10-5-2012 7:04:11 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019727_0149_0 (left in memory)
626 WCG 10-5-2012 7:04:11 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019733_0175_1 (left in memory)
627 WCG 10-5-2012 7:04:11 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019736_0062_1 (left in memory)
628 WCG 10-5-2012 7:04:11 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0026_1 (left in memory)
629 WCG 10-5-2012 7:04:11 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0260_0 (left in memory)
630 WCG 10-5-2012 7:04:11 [cpu_sched] Preempting E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 (left in memory)
631 10-5-2012 7:04:11 Suspending network activity - requested by operating system
632 10-5-2012 7:04:22 Resuming after OS suspension
633 10-5-2012 7:05:28 Resuming computation
634 WCG 10-5-2012 7:05:28 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019728_0143_1
635 WCG 10-5-2012 7:05:28 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019728_0171_1
636 WCG 10-5-2012 7:05:28 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019727_0149_0
637 WCG 10-5-2012 7:05:28 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019733_0175_1
638 WCG 10-5-2012 7:05:28 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019736_0062_1
639 WCG 10-5-2012 7:05:28 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0026_1
640 WCG 10-5-2012 7:05:28 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0260_0
641 WCG 10-5-2012 7:05:28 [cpu_sched] Resuming E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0
642 10-5-2012 7:05:28 Resuming network activity
643 10-5-2012 7:05:31 Windows is resuming operations


As can be seen prior to hibernating, all tasks checkpointed and "resume" is logged for all tasks, meaning a lossless pickup. As can also be seen the client logs the detection that the system is going down (does not matter in what state), and logs this, then stores the memory state to disk, or in case of sleep mode keep all in memory while using a little power so a power-up gives instant resume, where hibernate can take a little.

In case of suspending individual tasks manually which were running [with LAIM on], the same would be recorded for other WCG sciences... not a restart but a resume. Here an example cycling through these steps:

654	WCG	10-5-2012 7:19:39	task GFAM_x1rr6_hPNP_0019773_0142_1 suspended by user	
655 WCG 10-5-2012 7:19:55 task GFAM_x1rr6_hPNP_0019741_0260_0 suspended by user
656 WCG 10-5-2012 7:19:56 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0260_0 (left in memory)
657 WCG 10-5-2012 7:19:56 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019773_0224_0
658 WCG 10-5-2012 7:19:56 Resuming task GFAM_x1rr6_hPNP_0019773_0224_0 using gfam version 611 in slot 1
659 WCG 10-5-2012 7:20:02 task GFAM_x1rr6_hPNP_0019741_0260_0 resumed by user
660 WCG 10-5-2012 7:20:13 task GFAM_x1rr6_hPNP_0019773_0142_1 resumed by user
661 WCG 10-5-2012 7:20:26 task GFAM_x1rr6_hPNP_0019741_0260_0 suspended by user
662 WCG 10-5-2012 7:20:35 task GFAM_x1rr6_hPNP_0019741_0260_0 resumed by user
663 WCG 10-5-2012 7:20:45 task E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 suspended by user
664 WCG 10-5-2012 7:20:46 [cpu_sched] Preempting E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 (left in memory)
665 WCG 10-5-2012 7:20:46 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0260_0
666 WCG 10-5-2012 7:20:46 Resuming task GFAM_x1rr6_hPNP_0019741_0260_0 using gfam version 611 in slot 7
667 WCG 10-5-2012 7:20:50 task E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 resumed by user
668 WCG 10-5-2012 7:21:03 task GFAM_x1rr6_hPNP_0019773_0224_0 suspended by user
669 WCG 10-5-2012 7:21:04 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019773_0224_0 (left in memory)
670 WCG 10-5-2012 7:21:04 [cpu_sched] Resuming E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0
671 WCG 10-5-2012 7:21:04 Resuming task E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 using cep2 version 640 in slot 8

As can be seen, all lossless resumes. Like others respondents, would not know why that would be failing for CEP2 alone.

If willing to delve in a little, some log debug flags placed in the cc_config.xml might reveal more such as:

<heartbeat_debug>1</heartbeat_debug>
<mem_usage_debug>1<mem_usage_debug>
<cpu_sched>1</cpu_sched>

The latter flag is permanent part of my log setup, so I can see what the client scheduler is doing. The config manual is this http://boinc.berkeley.edu/wiki/Cc_config.xml noting that heartbeat debug is new to the latest client you're running, so it's not in there yet. Through the GUI menu read in the config and see if going to hibernate / suspend and resume any hickup record and post copies of event logs.

In normal operation of a client in round robin, alternating computing time between WCG and other active projects on the client, any task would "restart" after being preempted when LAIM is default is off, and resume when LAIM on.

--//--

P.S. Also interested in the Result Log of the CEP2 task when completed... is there a heartbeat issue, but doubt it.
[May 10, 2012 5:48:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

Thought of another test. Suspended all projects except WCG so no work would be fetched while doing this test, then suspended WCG in the project tab, then after few seconds, activated WCG again. The event log shows, all tasks were "resumed". No retreats to last checkpoints:
1178	WCG	10-5-2012 9:51:28	project suspended by user	
1179 WCG 10-5-2012 9:51:28 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019728_0143_1 (left in memory)
1180 WCG 10-5-2012 9:51:28 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019728_0171_1 (left in memory)
1181 WCG 10-5-2012 9:51:28 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019727_0149_0 (left in memory)
1182 WCG 10-5-2012 9:51:28 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019733_0175_1 (left in memory)
1183 WCG 10-5-2012 9:51:28 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019736_0062_1 (left in memory)
1184 WCG 10-5-2012 9:51:28 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0026_1 (left in memory)
1185 WCG 10-5-2012 9:51:28 [cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0260_0 (left in memory)
1186 WCG 10-5-2012 9:51:28 [cpu_sched] Preempting E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 (left in memory)
1187 WCG 10-5-2012 9:51:34 project resumed by user
1188 WCG 10-5-2012 9:51:34 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019728_0143_1
1189 WCG 10-5-2012 9:51:34 Resuming task GFAM_x1rr6_hPNP_0019728_0143_1 using gfam version 611 in slot 3
1190 WCG 10-5-2012 9:51:34 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019728_0171_1
1191 WCG 10-5-2012 9:51:34 Resuming task GFAM_x1rr6_hPNP_0019728_0171_1 using gfam version 611 in slot 0
1192 WCG 10-5-2012 9:51:34 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019727_0149_0
1193 WCG 10-5-2012 9:51:34 Resuming task GFAM_x1rr6_hPNP_0019727_0149_0 using gfam version 611 in slot 6
1194 WCG 10-5-2012 9:51:34 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019733_0175_1
1195 WCG 10-5-2012 9:51:34 Resuming task GFAM_x1rr6_hPNP_0019733_0175_1 using gfam version 611 in slot 2
1196 WCG 10-5-2012 9:51:34 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019736_0062_1
1197 WCG 10-5-2012 9:51:34 Resuming task GFAM_x1rr6_hPNP_0019736_0062_1 using gfam version 611 in slot 5
1198 WCG 10-5-2012 9:51:34 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0026_1
1199 WCG 10-5-2012 9:51:34 Resuming task GFAM_x1rr6_hPNP_0019741_0026_1 using gfam version 611 in slot 4
1200 WCG 10-5-2012 9:51:34 [cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0260_0
1201 WCG 10-5-2012 9:51:34 Resuming task GFAM_x1rr6_hPNP_0019741_0260_0 using gfam version 611 in slot 7
1202 WCG 10-5-2012 9:51:34 [cpu_sched] Resuming E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0
1203 WCG 10-5-2012 9:51:34 Resuming task E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 using cep2 version 640 in slot 8


Running test client 7.0.26 on this host. Development indicates that 7.0.27 [I've got runnning on another host] or higher will soon be promoted to "Recommended", as 7.0.25 is not exactly bug free [little embarrassing so short after heralding this version to the production world of BOINC volunteers, so I'm skeptical]. The tests as proposed would proof if LAIM is properly operating for other than CEP2 WCG sciences, or a client bug on Mac is to be considered. Appreciate that if there's a bug, we'd want to have that documented without getting the saber drawn.

ttyl

--//--
[May 10, 2012 8:02:44 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

I tried both tests, and everything said "Restarting" not "Resuming." However, only CEP2 went back to the last checkpoint.
[May 10, 2012 3:59:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

Please post the message logs so we can see all the system responses. On what you say this seems to indicate that LAIM is not activated on your 7.0.25 install for OS-X. CEP2 going back visibly, is because it has only 16 checkpoints at most, whilst other sciences sometimes have hundreds and store progress every minute if they can... by the time the client gets to recompute progress from the restart checkpoint it's already nearing the next.

Please post full content of the global_prefs_override.xml file and your full post-boot BOINC startup message log, some 35 lines from the top as well. If global_prefs_override.xml is not present, post the global_prefs.xml content.

TTYL

--//--

edit: typo
----------------------------------------
[Edit 1 times, last edit by Former Member at May 10, 2012 4:22:03 PM]
[May 10, 2012 4:16:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Always Goes Back to About 68%

Here's my global_prefs_override.xml file:

<global_preferences>
<run_on_batteries>1</run_on_batteries>
<run_if_user_active>1</run_if_user_active>
<run_gpu_if_user_active>1</run_gpu_if_user_active>
<suspend_cpu_usage>70.000000</suspend_cpu_usage>
<start_hour>0.000000</start_hour>
<end_hour>0.000000</end_hour>
<net_start_hour>0.000000</net_start_hour>
<net_end_hour>0.000000</net_end_hour>
<leave_apps_in_memory>0</leave_apps_in_memory>
<confirm_before_connecting>0</confirm_before_connecting>
<hangup_if_dialed>0</hangup_if_dialed>
<dont_verify_images>0</dont_verify_images>
<work_buf_min_days>0.100000</work_buf_min_days>
<work_buf_additional_days>0.250000</work_buf_additional_days>
<max_ncpus_pct>100.000000</max_ncpus_pct>
<cpu_scheduling_period_minutes>30.000000</cpu_scheduling_period_minutes>
<disk_interval>60.000000</disk_interval>
<disk_max_used_gb>0.000000</disk_max_used_gb>
<disk_max_used_pct>50.000000</disk_max_used_pct>
<disk_min_free_gb>0.100000</disk_min_free_gb>
<vm_max_used_pct>75.000000</vm_max_used_pct>
<ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct>
<max_bytes_sec_up>0.000000</max_bytes_sec_up>
<max_bytes_sec_down>0.000000</max_bytes_sec_down>
<cpu_usage_limit>70.000000</cpu_usage_limit>
<daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb>
<daily_xfer_period_days>0</daily_xfer_period_days>
</global_preferences>

Here's my post-boot BOINC message:


Thu May 10 08:40:49 2012 | | Starting BOINC client version 7.0.25 for x86_64-apple-darwin
Thu May 10 08:40:49 2012 | | log flags: file_xfer, sched_ops, task
Thu May 10 08:40:49 2012 | | Libraries: libcurl/7.21.7 OpenSSL/0.9.7l zlib/1.2.5 c-ares/1.7.4
Thu May 10 08:40:49 2012 | | Data directory: /Library/Application Support/BOINC Data
Thu May 10 08:40:49 2012 | | Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz [x86 Family 6 Model 23 Stepping 10]
Thu May 10 08:40:49 2012 | | Processor features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 XSAVE
Thu May 10 08:40:49 2012 | | OS: Mac OS X 10.7.3 (Darwin 11.3.0)
Thu May 10 08:40:49 2012 | | Memory: 8.00 GB physical, 41.64 GB virtual
Thu May 10 08:40:49 2012 | | Disk: 77.47 GB total, 41.40 GB free
Thu May 10 08:40:49 2012 | | Local time is UTC -7 hours
Thu May 10 08:40:49 2012 | | VirtualBox version: 4.1.14
Thu May 10 08:40:49 2012 | | NVIDIA GPU 0: GeForce 9400M (driver version 4.2.7, CUDA version 4.20, compute capability 1.1, 254MB, 179MB available, 53 GFLOPS peak)
Thu May 10 08:40:49 2012 | | OpenCL: NVIDIA GPU 0: GeForce 9400M (driver version CLH 1.0, device version OpenCL 1.0, 256MB, 179MB available)
[May 10, 2012 4:50:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 17   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread