| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 17
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
When I'm running The Clean Energy Project - Phase 2 6.40, the percent done and CPU time go back to about 68% and 9 hours every time it gets suspended and resumed. Is this a bug, or do I just have to make sure nothing interferes with it until I'm done?
----------------------------------------I'm running a Macbook Pro (5,4) with 10.7.3 and BOINC v7.0.25. Here's the Event Log: Wed May 9 09:46:44 2012 | World Community Grid | task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 suspended by user Wed May 9 09:46:50 2012 | World Community Grid | task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 resumed by user Wed May 9 09:46:51 2012 | World Community Grid | Restarting task E207501_566_C.28.C19H11N7SSi.01691539.3.set1d06_0 using cep2 version 640 in slot 7 And yes, I have "Leave application in memory while suspended" checked. Thanks! [Edit 1 times, last edit by Former Member at May 10, 2012 7:56:01 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Dear ZachZiggster,
as long as you have engaged LAIM and don't shut down your computer you should not loose progress. The %-Progress gauge is a rather crude tool and maybe it just displays an incorrect value after the job resumes. Best wishes from Your Harvard CEP team |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello ZachZiggster,
With LAIM set in BOINC, cleanenergy is correct. If it really is going back to the check point, then it becomes a puzzle for a Mac expert. Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I checked my WCG host preferences, my Bam! preferences, and my BOINC preferences, and all of them have LAIM set.
----------------------------------------![]() [Edit 1 times, last edit by Former Member at May 9, 2012 10:32:11 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello ZachZiggster,
Which means that Mac OS X is not doing something it should (perhaps because of some hardware fault) or that all our Mac users have failed to report (or notice) this problem running CEP2 on their machines. These unique failures that affect only one user are always embarrassing to diagnose for Support. It sounds like the standard bureaucratic "Not this department. Try down the hall." Even so, all I can suggest is avoiding CEP2. It does not sound like a problem that standard Mac diagnostics will catch. Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi,
Let's do the following test (for the sake of convincing us doubting Tom's would we appear that to be): Pull some other science work of WCG, does not matter which and let this run concurrent to 1 CEP2 task. Then after a while, suspend the machine (hibernate or sleep) and then power up again. When I do that, the message log records "resume" for all tasks. If it's a machine problem [or the client not having accepted the LAIM activation], all running tasks would show "restart". If it's a CEP2 problem, only this one would show a "restart" and the others a "resume". Here's a sample on how this appears in the event log when I tested this scenario: 612 WCG 10-5-2012 6:57:14 [checkpoint] result GFAM_x1rr6_hPNP_0019736_0062_1 checkpointed As can be seen prior to hibernating, all tasks checkpointed and "resume" is logged for all tasks, meaning a lossless pickup. As can also be seen the client logs the detection that the system is going down (does not matter in what state), and logs this, then stores the memory state to disk, or in case of sleep mode keep all in memory while using a little power so a power-up gives instant resume, where hibernate can take a little. In case of suspending individual tasks manually which were running [with LAIM on], the same would be recorded for other WCG sciences... not a restart but a resume. Here an example cycling through these steps: 654 WCG 10-5-2012 7:19:39 task GFAM_x1rr6_hPNP_0019773_0142_1 suspended by user As can be seen, all lossless resumes. Like others respondents, would not know why that would be failing for CEP2 alone. If willing to delve in a little, some log debug flags placed in the cc_config.xml might reveal more such as: <heartbeat_debug>1</heartbeat_debug> <mem_usage_debug>1<mem_usage_debug> <cpu_sched>1</cpu_sched> The latter flag is permanent part of my log setup, so I can see what the client scheduler is doing. The config manual is this http://boinc.berkeley.edu/wiki/Cc_config.xml noting that heartbeat debug is new to the latest client you're running, so it's not in there yet. Through the GUI menu read in the config and see if going to hibernate / suspend and resume any hickup record and post copies of event logs. In normal operation of a client in round robin, alternating computing time between WCG and other active projects on the client, any task would "restart" after being preempted when LAIM is default is off, and resume when LAIM on. --//-- P.S. Also interested in the Result Log of the CEP2 task when completed... is there a heartbeat issue, but doubt it. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thought of another test. Suspended all projects except WCG so no work would be fetched while doing this test, then suspended WCG in the project tab, then after few seconds, activated WCG again. The event log shows, all tasks were "resumed". No retreats to last checkpoints:
1178 WCG 10-5-2012 9:51:28 project suspended by user Running test client 7.0.26 on this host. Development indicates that 7.0.27 [I've got runnning on another host] or higher will soon be promoted to "Recommended", as 7.0.25 is not exactly bug free [little embarrassing so short after heralding this version to the production world of BOINC volunteers, so I'm skeptical]. The tests as proposed would proof if LAIM is properly operating for other than CEP2 WCG sciences, or a client bug on Mac is to be considered. Appreciate that if there's a bug, we'd want to have that documented without getting the saber drawn. ttyl --//-- |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I tried both tests, and everything said "Restarting" not "Resuming." However, only CEP2 went back to the last checkpoint.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Please post the message logs so we can see all the system responses. On what you say this seems to indicate that LAIM is not activated on your 7.0.25 install for OS-X. CEP2 going back visibly, is because it has only 16 checkpoints at most, whilst other sciences sometimes have hundreds and store progress every minute if they can... by the time the client gets to recompute progress from the restart checkpoint it's already nearing the next.
----------------------------------------Please post full content of the global_prefs_override.xml file and your full post-boot BOINC startup message log, some 35 lines from the top as well. If global_prefs_override.xml is not present, post the global_prefs.xml content. TTYL --//-- edit: typo [Edit 1 times, last edit by Former Member at May 10, 2012 4:22:03 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Here's my global_prefs_override.xml file:
<global_preferences> <run_on_batteries>1</run_on_batteries> <run_if_user_active>1</run_if_user_active> <run_gpu_if_user_active>1</run_gpu_if_user_active> <suspend_cpu_usage>70.000000</suspend_cpu_usage> <start_hour>0.000000</start_hour> <end_hour>0.000000</end_hour> <net_start_hour>0.000000</net_start_hour> <net_end_hour>0.000000</net_end_hour> <leave_apps_in_memory>0</leave_apps_in_memory> <confirm_before_connecting>0</confirm_before_connecting> <hangup_if_dialed>0</hangup_if_dialed> <dont_verify_images>0</dont_verify_images> <work_buf_min_days>0.100000</work_buf_min_days> <work_buf_additional_days>0.250000</work_buf_additional_days> <max_ncpus_pct>100.000000</max_ncpus_pct> <cpu_scheduling_period_minutes>30.000000</cpu_scheduling_period_minutes> <disk_interval>60.000000</disk_interval> <disk_max_used_gb>0.000000</disk_max_used_gb> <disk_max_used_pct>50.000000</disk_max_used_pct> <disk_min_free_gb>0.100000</disk_min_free_gb> <vm_max_used_pct>75.000000</vm_max_used_pct> <ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct> <ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct> <max_bytes_sec_up>0.000000</max_bytes_sec_up> <max_bytes_sec_down>0.000000</max_bytes_sec_down> <cpu_usage_limit>70.000000</cpu_usage_limit> <daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb> <daily_xfer_period_days>0</daily_xfer_period_days> </global_preferences> Here's my post-boot BOINC message: Thu May 10 08:40:49 2012 | | Starting BOINC client version 7.0.25 for x86_64-apple-darwin Thu May 10 08:40:49 2012 | | log flags: file_xfer, sched_ops, task Thu May 10 08:40:49 2012 | | Libraries: libcurl/7.21.7 OpenSSL/0.9.7l zlib/1.2.5 c-ares/1.7.4 Thu May 10 08:40:49 2012 | | Data directory: /Library/Application Support/BOINC Data Thu May 10 08:40:49 2012 | | Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz [x86 Family 6 Model 23 Stepping 10] Thu May 10 08:40:49 2012 | | Processor features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 XSAVE Thu May 10 08:40:49 2012 | | OS: Mac OS X 10.7.3 (Darwin 11.3.0) Thu May 10 08:40:49 2012 | | Memory: 8.00 GB physical, 41.64 GB virtual Thu May 10 08:40:49 2012 | | Disk: 77.47 GB total, 41.40 GB free Thu May 10 08:40:49 2012 | | Local time is UTC -7 hours Thu May 10 08:40:49 2012 | | VirtualBox version: 4.1.14 Thu May 10 08:40:49 2012 | | NVIDIA GPU 0: GeForce 9400M (driver version 4.2.7, CUDA version 4.20, compute capability 1.1, 254MB, 179MB available, 53 GFLOPS peak) Thu May 10 08:40:49 2012 | | OpenCL: NVIDIA GPU 0: GeForce 9400M (driver version CLH 1.0, device version OpenCL 1.0, 256MB, 179MB available) |
||
|
|
|