World Community Grid - View Thread - Always Goes Back to About 68% [RESOLVED]

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: Always Goes Back to About 68% [RESOLVED]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 17

[ ]

Author

This topic has been viewed 2391 times and has 16 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Always Goes Back to About 68% [RESOLVED]

When I'm running The Clean Energy Project - Phase 2 6.40, the percent done and CPU time go back to about 68% and 9 hours every time it gets suspended and resumed. Is this a bug, or do I just have to make sure nothing interferes with it until I'm done?

I'm running a Macbook Pro (5,4) with 10.7.3 and BOINC v7.0.25.

Here's the Event Log:

Wed May 9 09:46:44 2012 | World Community Grid | task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 suspended by user
Wed May 9 09:46:50 2012 | World Community Grid | task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 resumed by user
Wed May 9 09:46:51 2012 | World Community Grid | Restarting task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 using cep2 version 640 in slot 7

And yes, I have "Leave application in memory while suspended" checked.

Thanks!

----------------------------------------
[Edit 1 times, last edit by Former Member at May 10, 2012 7:56:01 PM]

[May 9, 2012 6:32:53 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

Dear ZachZiggster,
as long as you have engaged LAIM and don't shut down your computer you should not loose progress. The %-Progress gauge is a rather crude tool and maybe it just displays an incorrect value after the job resumes.
Best wishes from
Your Harvard CEP team

[May 9, 2012 7:06:57 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

Hello ZachZiggster,
With LAIM set in BOINC, cleanenergy is correct. If it really is going back to the check point, then it becomes a puzzle for a Mac expert.

confused

Lawrence

[May 9, 2012 7:33:41 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

I checked my WCG host preferences, my Bam! preferences, and my BOINC preferences, and all of them have LAIM set. confused

----------------------------------------
[Edit 1 times, last edit by Former Member at May 9, 2012 10:32:11 PM]

[May 9, 2012 10:31:24 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

Hello ZachZiggster,
Which means that Mac OS X is not doing something it should (perhaps because of some hardware fault) or that all our Mac users have failed to report (or notice) this problem running CEP2 on their machines.

These unique failures that affect only one user are always embarrassing to diagnose for Support. It sounds like the standard bureaucratic "Not this department. Try down the hall." Even so, all I can suggest is avoiding CEP2. It does not sound like a problem that standard Mac diagnostics will catch.

Lawrence

[May 10, 2012 2:09:33 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

Hi,

Let's do the following test (for the sake of convincing us doubting Tom's would we appear that to be): Pull some other science work of WCG, does not matter which and let this run concurrent to 1 CEP2 task. Then after a while, suspend the machine (hibernate or sleep) and then power up again. When I do that, the message log records "resume" for all tasks. If it's a machine problem [or the client not having accepted the LAIM activation], all running tasks would show "restart". If it's a CEP2 problem, only this one would show a "restart" and the others a "resume". Here's a sample on how this appears in the event log when I tested this scenario:

612	WCG	10-5-2012 6:57:14	[checkpoint] result GFAM_x1rr6_hPNP_0019736_0062_1 checkpointed	
613	WCG	10-5-2012 6:58:50	[checkpoint] result GFAM_x1rr6_hPNP_0019728_0171_1 checkpointed	
614	WCG	10-5-2012 6:58:54	[checkpoint] result GFAM_x1rr6_hPNP_0019741_0026_1 checkpointed	
615	WCG	10-5-2012 6:59:59	[checkpoint] result GFAM_x1rr6_hPNP_0019728_0143_1 checkpointed	
616	WCG	10-5-2012 7:02:07	[checkpoint] result E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 checkpointed	
617	WCG	10-5-2012 7:02:18	[checkpoint] result GFAM_x1rr6_hPNP_0019741_0260_0 checkpointed	
618	WCG	10-5-2012 7:02:23	[checkpoint] result GFAM_x1rr6_hPNP_0019727_0149_0 checkpointed	
619	WCG	10-5-2012 7:03:26	task GFAM_x1rr6_hPNP_0019773_0224_0 suspended by user	
620	WCG	10-5-2012 7:03:31	task GFAM_x1rr6_hPNP_0019773_0142_1 suspended by user	
621			10-5-2012 7:04:10	Windows is suspending operations	
622			10-5-2012 7:04:11	Suspending computation - requested by operating system	
623	WCG	10-5-2012 7:04:11	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019728_0143_1 (left in memory)	
624	WCG	10-5-2012 7:04:11	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019728_0171_1 (left in memory)	
625	WCG	10-5-2012 7:04:11	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019727_0149_0 (left in memory)	
626	WCG	10-5-2012 7:04:11	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019733_0175_1 (left in memory)	
627	WCG	10-5-2012 7:04:11	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019736_0062_1 (left in memory)	
628	WCG	10-5-2012 7:04:11	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0026_1 (left in memory)	
629	WCG	10-5-2012 7:04:11	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0260_0 (left in memory)	
630	WCG	10-5-2012 7:04:11	[cpu_sched] Preempting E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 (left in memory)	
631			10-5-2012 7:04:11	Suspending network activity - requested by operating system	
632			10-5-2012 7:04:22	Resuming after OS suspension	
633			10-5-2012 7:05:28	Resuming computation	
634	WCG	10-5-2012 7:05:28	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019728_0143_1	
635	WCG	10-5-2012 7:05:28	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019728_0171_1	
636	WCG	10-5-2012 7:05:28	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019727_0149_0	
637	WCG	10-5-2012 7:05:28	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019733_0175_1	
638	WCG	10-5-2012 7:05:28	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019736_0062_1	
639	WCG	10-5-2012 7:05:28	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0026_1	
640	WCG	10-5-2012 7:05:28	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0260_0	
641	WCG	10-5-2012 7:05:28	[cpu_sched] Resuming E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0	
642			10-5-2012 7:05:28	Resuming network activity	
643			10-5-2012 7:05:31	Windows is resuming operations

As can be seen prior to hibernating, all tasks checkpointed and "resume" is logged for all tasks, meaning a lossless pickup. As can also be seen the client logs the detection that the system is going down (does not matter in what state), and logs this, then stores the memory state to disk, or in case of sleep mode keep all in memory while using a little power so a power-up gives instant resume, where hibernate can take a little.

In case of suspending individual tasks manually which were running [with LAIM on], the same would be recorded for other WCG sciences... not a restart but a resume. Here an example cycling through these steps:

654	WCG	10-5-2012 7:19:39	task GFAM_x1rr6_hPNP_0019773_0142_1 suspended by user	
655	WCG	10-5-2012 7:19:55	task GFAM_x1rr6_hPNP_0019741_0260_0 suspended by user	
656	WCG	10-5-2012 7:19:56	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0260_0 (left in memory)	
657	WCG	10-5-2012 7:19:56	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019773_0224_0	
658	WCG	10-5-2012 7:19:56	Resuming task GFAM_x1rr6_hPNP_0019773_0224_0 using gfam version 611 in slot 1	
659	WCG	10-5-2012 7:20:02	task GFAM_x1rr6_hPNP_0019741_0260_0 resumed by user	
660	WCG	10-5-2012 7:20:13	task GFAM_x1rr6_hPNP_0019773_0142_1 resumed by user	
661	WCG	10-5-2012 7:20:26	task GFAM_x1rr6_hPNP_0019741_0260_0 suspended by user	
662	WCG	10-5-2012 7:20:35	task GFAM_x1rr6_hPNP_0019741_0260_0 resumed by user	
663	WCG	10-5-2012 7:20:45	task E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 suspended by user	
664	WCG	10-5-2012 7:20:46	[cpu_sched] Preempting E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 (left in memory)	
665	WCG	10-5-2012 7:20:46	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0260_0	
666	WCG	10-5-2012 7:20:46	Resuming task GFAM_x1rr6_hPNP_0019741_0260_0 using gfam version 611 in slot 7	
667	WCG	10-5-2012 7:20:50	task E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 resumed by user	
668	WCG	10-5-2012 7:21:03	task GFAM_x1rr6_hPNP_0019773_0224_0 suspended by user	
669	WCG	10-5-2012 7:21:04	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019773_0224_0 (left in memory)	
670	WCG	10-5-2012 7:21:04	[cpu_sched] Resuming E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0	
671	WCG	10-5-2012 7:21:04	Resuming task E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 using cep2 version 640 in slot 8

As can be seen, all lossless resumes. Like others respondents, would not know why that would be failing for CEP2 alone.

If willing to delve in a little, some log debug flags placed in the cc_config.xml might reveal more such as:

<heartbeat_debug>1</heartbeat_debug>
<mem_usage_debug>1<mem_usage_debug>
<cpu_sched>1</cpu_sched>

The latter flag is permanent part of my log setup, so I can see what the client scheduler is doing. The config manual is this http://boinc.berkeley.edu/wiki/Cc_config.xml noting that heartbeat debug is new to the latest client you're running, so it's not in there yet. Through the GUI menu read in the config and see if going to hibernate / suspend and resume any hickup record and post copies of event logs.

In normal operation of a client in round robin, alternating computing time between WCG and other active projects on the client, any task would "restart" after being preempted when LAIM is default is off, and resume when LAIM on.

--//--

P.S. Also interested in the Result Log of the CEP2 task when completed... is there a heartbeat issue, but doubt it.

[May 10, 2012 5:48:38 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

Thought of another test. Suspended all projects except WCG so no work would be fetched while doing this test, then suspended WCG in the project tab, then after few seconds, activated WCG again. The event log shows, all tasks were "resumed". No retreats to last checkpoints:

1178	WCG	10-5-2012 9:51:28	project suspended by user	
1179	WCG	10-5-2012 9:51:28	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019728_0143_1 (left in memory)	
1180	WCG	10-5-2012 9:51:28	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019728_0171_1 (left in memory)	
1181	WCG	10-5-2012 9:51:28	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019727_0149_0 (left in memory)	
1182	WCG	10-5-2012 9:51:28	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019733_0175_1 (left in memory)	
1183	WCG	10-5-2012 9:51:28	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019736_0062_1 (left in memory)	
1184	WCG	10-5-2012 9:51:28	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0026_1 (left in memory)	
1185	WCG	10-5-2012 9:51:28	[cpu_sched] Preempting GFAM_x1rr6_hPNP_0019741_0260_0 (left in memory)	
1186	WCG	10-5-2012 9:51:28	[cpu_sched] Preempting E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 (left in memory)	
1187	WCG	10-5-2012 9:51:34	project resumed by user	
1188	WCG	10-5-2012 9:51:34	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019728_0143_1	
1189	WCG	10-5-2012 9:51:34	Resuming task GFAM_x1rr6_hPNP_0019728_0143_1 using gfam version 611 in slot 3	
1190	WCG	10-5-2012 9:51:34	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019728_0171_1	
1191	WCG	10-5-2012 9:51:34	Resuming task GFAM_x1rr6_hPNP_0019728_0171_1 using gfam version 611 in slot 0	
1192	WCG	10-5-2012 9:51:34	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019727_0149_0	
1193	WCG	10-5-2012 9:51:34	Resuming task GFAM_x1rr6_hPNP_0019727_0149_0 using gfam version 611 in slot 6	
1194	WCG	10-5-2012 9:51:34	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019733_0175_1	
1195	WCG	10-5-2012 9:51:34	Resuming task GFAM_x1rr6_hPNP_0019733_0175_1 using gfam version 611 in slot 2	
1196	WCG	10-5-2012 9:51:34	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019736_0062_1	
1197	WCG	10-5-2012 9:51:34	Resuming task GFAM_x1rr6_hPNP_0019736_0062_1 using gfam version 611 in slot 5	
1198	WCG	10-5-2012 9:51:34	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0026_1	
1199	WCG	10-5-2012 9:51:34	Resuming task GFAM_x1rr6_hPNP_0019741_0026_1 using gfam version 611 in slot 4	
1200	WCG	10-5-2012 9:51:34	[cpu_sched] Resuming GFAM_x1rr6_hPNP_0019741_0260_0	
1201	WCG	10-5-2012 9:51:34	Resuming task GFAM_x1rr6_hPNP_0019741_0260_0 using gfam version 611 in slot 7	
1202	WCG	10-5-2012 9:51:34	[cpu_sched] Resuming E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0	
1203	WCG	10-5-2012 9:51:34	Resuming task E207594_184_C.27.C21H10N2S3Se.01535202.0.set1d06_0 using cep2 version 640 in slot 8

Running test client 7.0.26 on this host. Development indicates that 7.0.27 [I've got runnning on another host] or higher will soon be promoted to "Recommended", as 7.0.25 is not exactly bug free [little embarrassing so short after heralding this version to the production world of BOINC volunteers, so I'm skeptical]. The tests as proposed would proof if LAIM is properly operating for other than CEP2 WCG sciences, or a client bug on Mac is to be considered. Appreciate that if there's a bug, we'd want to have that documented without getting the saber drawn.

ttyl

--//--

[May 10, 2012 8:02:44 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

I tried both tests, and everything said "Restarting" not "Resuming." However, only CEP2 went back to the last checkpoint.

[May 10, 2012 3:59:24 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

Please post the message logs so we can see all the system responses. On what you say this seems to indicate that LAIM is not activated on your 7.0.25 install for OS-X. CEP2 going back visibly, is because it has only 16 checkpoints at most, whilst other sciences sometimes have hundreds and store progress every minute if they can... by the time the client gets to recompute progress from the restart checkpoint it's already nearing the next.

Please post full content of the global_prefs_override.xml file and your full post-boot BOINC startup message log, some 35 lines from the top as well. If global_prefs_override.xml is not present, post the global_prefs.xml content.

TTYL

--//--

edit: typo

----------------------------------------
[Edit 1 times, last edit by Former Member at May 10, 2012 4:22:03 PM]

[May 10, 2012 4:16:44 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Always Goes Back to About 68%

Here's my global_prefs_override.xml file:

<global_preferences>
<run_on_batteries>1</run_on_batteries>
<run_if_user_active>1</run_if_user_active>
<run_gpu_if_user_active>1</run_gpu_if_user_active>
<suspend_cpu_usage>70.000000</suspend_cpu_usage>
<start_hour>0.000000</start_hour>
<end_hour>0.000000</end_hour>
<net_start_hour>0.000000</net_start_hour>
<net_end_hour>0.000000</net_end_hour>
<leave_apps_in_memory>0</leave_apps_in_memory>
<confirm_before_connecting>0</confirm_before_connecting>
<hangup_if_dialed>0</hangup_if_dialed>
<dont_verify_images>0</dont_verify_images>
<work_buf_min_days>0.100000</work_buf_min_days>
<work_buf_additional_days>0.250000</work_buf_additional_days>
<max_ncpus_pct>100.000000</max_ncpus_pct>
<cpu_scheduling_period_minutes>30.000000</cpu_scheduling_period_minutes>
<disk_interval>60.000000</disk_interval>
<disk_max_used_gb>0.000000</disk_max_used_gb>
<disk_max_used_pct>50.000000</disk_max_used_pct>
<disk_min_free_gb>0.100000</disk_min_free_gb>
<vm_max_used_pct>75.000000</vm_max_used_pct>
<ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct>
<max_bytes_sec_up>0.000000</max_bytes_sec_up>
<max_bytes_sec_down>0.000000</max_bytes_sec_down>
<cpu_usage_limit>70.000000</cpu_usage_limit>
<daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb>
<daily_xfer_period_days>0</daily_xfer_period_days>
</global_preferences>

Here's my post-boot BOINC message:

Thu May 10 08:40:49 2012 | | Starting BOINC client version 7.0.25 for x86_64-apple-darwin
Thu May 10 08:40:49 2012 | | log flags: file_xfer, sched_ops, task
Thu May 10 08:40:49 2012 | | Libraries: libcurl/7.21.7 OpenSSL/0.9.7l zlib/1.2.5 c-ares/1.7.4
Thu May 10 08:40:49 2012 | | Data directory: /Library/Application Support/BOINC Data
Thu May 10 08:40:49 2012 | | Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz [x86 Family 6 Model 23 Stepping 10]
Thu May 10 08:40:49 2012 | | Processor features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 XSAVE
Thu May 10 08:40:49 2012 | | OS: Mac OS X 10.7.3 (Darwin 11.3.0)
Thu May 10 08:40:49 2012 | | Memory: 8.00 GB physical, 41.64 GB virtual
Thu May 10 08:40:49 2012 | | Disk: 77.47 GB total, 41.40 GB free
Thu May 10 08:40:49 2012 | | Local time is UTC -7 hours
Thu May 10 08:40:49 2012 | | VirtualBox version: 4.1.14
Thu May 10 08:40:49 2012 | | NVIDIA GPU 0: GeForce 9400M (driver version 4.2.7, CUDA version 4.20, compute capability 1.1, 254MB, 179MB available, 53 GFLOPS peak)
Thu May 10 08:40:49 2012 | | OpenCL: NVIDIA GPU 0: GeForce 9400M (driver version CLH 1.0, device version OpenCL 1.0, 256MB, 179MB available)

[May 10, 2012 4:50:42 PM]

[ ]