| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 16
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi,
I have a really annoying problem. I have i7-720QM processor laptop with Gentoo Linux amd64 operating system, boinc installed as Gentoo package. I configured boinc to run 4 tasks at a time (50% of processors) at 100% load. At the same time when the computations run, I am working on the laptop. Sometimes, once per x days, I get short freeze while working on the laptop, but after that, all tasks that were running are aborted with computation error. Most often this happens where the tasks ran for more than 10 hrs each and just few hours were left, like last time when I lost about 30 hours of runtime because of this. It drives me mad to loose that much computation time because of this issue. I have no idea what causes the freezes, but is there any way to prevent the tasks from being aborted when this occurs? It's really stupid to loose 30 hours just because of few seconds freeze. Thanks for any hints. |
||
|
|
sk..
Master Cruncher http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif Joined: Mar 22, 2007 Post Count: 2324 Status: Offline Project Badges:
|
Can only suggest you check your boinc settings. Perhaps set to save to hard drive rather than RAM would help? If you see this do a hard system restart (hit the restart button ASAP) and you might be able to pick up where you left off.
|
||
|
|
Coleslaw
Veteran Cruncher USA Joined: Mar 29, 2007 Post Count: 1343 Status: Offline Project Badges:
|
How much RAM is in your system? It could be a case of not enough memory. I'm not sure much on Linux, but I know that with Windows...the hard drive read/writes can cause the system to lag to the point of erroring out sometimes. My Dual core laptop had that problem when running memory hungry apps alongside other memory hog programs. Combine that with something like a Bit Torrent program and your hard drive may not keep up. Also, your graphics may require a good chunk of your memory, so depending on what you are doing, you may see that lag.
----------------------------------------Edit: Also, some apps use more memory towards the end of the job then at the beginning which may also be why they are at 10+ hours when it happens. ![]() ![]() ![]() ![]() [Edit 1 times, last edit by Coleslaw at Oct 4, 2010 10:55:06 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It would be very helpful to see error logs of those tasks.
|
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
I have no idea what causes the freezes, but is there any way to prevent the tasks from being aborted when this occurs? It's really stupid to loose 30 hours just because of few seconds freeze Freezes are bad. If they last longer than 30 seconds and BOINC cannot reset the tasks to the last checkpoint or beginning, the tasks are declared broken. I've found when running an indexer such as recoll, it eats masses amount of CPU time to the point of tasks resetting with the infamous "zero status" or heartbeat, but as yet not killing them. Sufficient RAM and Swap File is first order.BTW, how to set (checkpoints) to save to memory and not to disk I've yet to find out. My interval minimum is set to 300 seconds. If they are not allowed to write, they are skipped. 300 seconds is just enough to not see frequent writes in the message log so getting a typical list as below that gives confidence that boots at the right timing for CEP2 jobs (2 of 4 cores manual control), can be reasonably well selected (and believe it or not, never booted so frequent as with Linux). And bono_vox is right. The relevant portions of Result Log and the Messages from BOINC stdoutdae.txt give us reference (as yet to log anything in stderrdae.txt here)... now it's extended guessing. Edit: A message log sample of checpoints: Tue 05 Oct 2010 07:35:17 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed Tue 05 Oct 2010 07:39:28 AM CEST [checkpoint_debug] result CMD2_0852-2JPQ_A.clustersOccur-KIF11A.clustersOccur_0_0 checkpointed Tue 05 Oct 2010 07:40:05 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1MHN_A.clustersOccur_0_1 checkpointed Tue 05 Oct 2010 07:40:33 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed Tue 05 Oct 2010 07:40:34 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1MHN_A.clustersOccur_0_1 checkpointed Tue 05 Oct 2010 07:42:40 AM CEST [checkpoint_debug] result E200397_771_A.25.C18H11N5OS.6.0.set1d06_2 checkpointed Tue 05 Oct 2010 07:44:42 AM CEST [checkpoint_debug] result CMD2_0852-2JPQ_A.clustersOccur-KIF11A.clustersOccur_0_0 checkpointed Tue 05 Oct 2010 07:45:46 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed Tue 05 Oct 2010 07:47:22 AM CEST [checkpoint_debug] result E200397_771_A.25.C18H11N5OS.6.0.set1d06_2 checkpointed Tue 05 Oct 2010 07:49:59 AM CEST [checkpoint_debug] result CMD2_0852-2JPQ_A.clustersOccur-KIF11A.clustersOccur_0_0 checkpointed Tue 05 Oct 2010 07:51:06 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed Tue 05 Oct 2010 07:55:12 AM CEST [checkpoint_debug] result CMD2_0852-2JPQ_A.clustersOccur-KIF11A.clustersOccur_0_0 checkpointed Tue 05 Oct 2010 07:56:17 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Oct 5, 2010 6:08:35 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
first, thank you for all the responses. i will try to answer all in this post.
Perhaps set to save to hard drive rather than RAM would help? this is my configuration: <global_preferences> <run_on_batteries>0</run_on_batteries> <run_if_user_active>1</run_if_user_active> <run_gpu_if_user_active>0</run_gpu_if_user_active> <idle_time_to_run>3.000000</idle_time_to_run> <suspend_cpu_usage>25.000000</suspend_cpu_usage> <start_hour>0.000000</start_hour> <end_hour>0.000000</end_hour> <net_start_hour>0.000000</net_start_hour> <net_end_hour>0.000000</net_end_hour> <leave_apps_in_memory>0</leave_apps_in_memory> <confirm_before_connecting>0</confirm_before_connecting> <hangup_if_dialed>0</hangup_if_dialed> <dont_verify_images>0</dont_verify_images> <work_buf_min_days>0.100000</work_buf_min_days> <work_buf_additional_days>0.250000</work_buf_additional_days> <max_ncpus_pct>50.000000</max_ncpus_pct> <cpu_scheduling_period_minutes>120.000000</cpu_scheduling_period_minutes> <disk_interval>60.000000</disk_interval> <disk_max_used_gb>10.000000</disk_max_used_gb> <disk_max_used_pct>75.000000</disk_max_used_pct> <disk_min_free_gb>0.100000</disk_min_free_gb> <vm_max_used_pct>75.000000</vm_max_used_pct> <ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct> <ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct> <max_bytes_sec_up>0.000000</max_bytes_sec_up> <max_bytes_sec_down>0.000000</max_bytes_sec_down> <cpu_usage_limit>100.000000</cpu_usage_limit> <daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb> <daily_xfer_period_days>0</daily_xfer_period_days> </global_preferences> i can't see where i would specify where the work should be saved, but in the gui the description says that checkpoints are saved on disk each 60 seconds. If you see this do a hard system restart (hit the restart button ASAP) and you might be able to pick up where you left off. this is not an option for me, i have other work on the laptop too which i do not want to loose nor i want corrupted filesystem etc. How much RAM is in your system? It could be a case of not enough memory. i have 8GB of ram + 8GB of swap ... never used even full ram yet, so memory should not be the cause. It would be very helpful to see error logs of those tasks. here is the log that i found. it ends at point where i decide to stop the whole project for a while. 04-Oct-2010 21:48:29 [World Community Grid] Computation for task E200409_713_A.26.C18H6N4OS3.30.set1d06_1 finished 04-Oct-2010 21:48:29 [World Community Grid] Starting E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 04-Oct-2010 21:48:29 [World Community Grid] Starting task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 using cep2 version 619 04-Oct-2010 21:48:30 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_0 04-Oct-2010 21:48:30 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_1 SIGPIPE: write on a pipe with no reader 04-Oct-2010 21:48:33 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_0 04-Oct-2010 21:48:33 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_2 04-Oct-2010 21:48:35 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_1 04-Oct-2010 21:48:35 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_3 04-Oct-2010 21:48:44 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_2 04-Oct-2010 21:48:44 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_3 04-Oct-2010 21:48:44 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_4 04-Oct-2010 21:49:17 [World Community Grid] Computation for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 finished 04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_0 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent 04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_1 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent 04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_2 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent 04-Oct-2010 21:49:17 [World Community Grid] Starting X0000038831348200410061550_1 04-Oct-2010 21:49:17 [World Community Grid] Starting task X0000038831348200410061550_1 using hcc1 version 608 04-Oct-2010 21:49:17 [---] Suspending computation - CPU usage is too high 04-Oct-2010 21:49:19 [World Community Grid] Computation for task E200409_648_A.26.C18H10N6OS.5.3.set1d06_0 finished 04-Oct-2010 21:49:19 [World Community Grid] Output file E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_0 for task E200409_648_A.26.C18H10N6OS.5.3.set1d06_0 absent 04-Oct-2010 21:49:19 [World Community Grid] Output file E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_1 for task E200409_648_A.26.C18H10N6OS.5.3.set1d06_0 absent 04-Oct-2010 21:49:19 [World Community Grid] Output file E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_2 for task E200409_648_A.26.C18H10N6OS.5.3.set1d06_0 absent 04-Oct-2010 21:49:20 [World Community Grid] Started upload of E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_3 04-Oct-2010 21:49:20 [World Community Grid] Computation for task faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1 finished 04-Oct-2010 21:49:22 [World Community Grid] Finished upload of E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_3 04-Oct-2010 21:49:22 [World Community Grid] Started upload of E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_3 04-Oct-2010 21:49:22 [World Community Grid] Computation for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 finished 04-Oct-2010 21:49:22 [World Community Grid] Output file E200413_609_A.21.C17H12N2OS.47.1.set1d06_1_0 for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 absent 04-Oct-2010 21:49:22 [World Community Grid] Output file E200413_609_A.21.C17H12N2OS.47.1.set1d06_1_1 for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 absent 04-Oct-2010 21:49:22 [World Community Grid] Output file E200413_609_A.21.C17H12N2OS.47.1.set1d06_1_2 for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 absent 04-Oct-2010 21:49:22 [World Community Grid] Output file E200413_609_A.21.C17H12N2OS.47.1.set1d06_1_3 for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 absent 04-Oct-2010 21:49:23 [World Community Grid] Finished upload of E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_3 04-Oct-2010 21:49:23 [World Community Grid] Started upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_0 04-Oct-2010 21:49:24 [World Community Grid] Finished upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_0 04-Oct-2010 21:49:24 [World Community Grid] Started upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_1 04-Oct-2010 21:49:26 [World Community Grid] Finished upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_1 04-Oct-2010 21:49:26 [World Community Grid] Started upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_2 04-Oct-2010 21:49:28 [World Community Grid] Finished upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_2 04-Oct-2010 21:49:28 [---] Resuming computation 04-Oct-2010 21:49:28 [World Community Grid] Starting HFCC_n1_02121177_n1_0001_1 04-Oct-2010 21:49:28 [World Community Grid] Starting task HFCC_n1_02121177_n1_0001_1 using hfcc version 611 04-Oct-2010 21:49:28 [World Community Grid] Starting E200414_100_A.21.C17H12N2S2.78.4.set1d06_1 04-Oct-2010 21:49:28 [World Community Grid] Starting task E200414_100_A.21.C17H12N2S2.78.4.set1d06_1 using cep2 version 619 04-Oct-2010 21:49:28 [World Community Grid] Starting E200413_524_A.22.C16H10N4O2.6.0.set1d06_1 04-Oct-2010 21:49:28 [World Community Grid] Starting task E200413_524_A.22.C16H10N4O2.6.0.set1d06_1 using cep2 version 619 04-Oct-2010 21:50:21 [World Community Grid] Temporarily failed upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_4: HTTP error 04-Oct-2010 21:50:21 [World Community Grid] Backing off 1 min 0 sec on upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_4 04-Oct-2010 21:50:28 [World Community Grid] Sending scheduler request: To fetch work. 04-Oct-2010 21:50:28 [World Community Grid] Reporting 4 completed tasks, requesting new tasks 04-Oct-2010 21:50:31 [World Community Grid] Scheduler request completed: got 1 new tasks 04-Oct-2010 21:50:34 [World Community Grid] Started download of X0000038850105200409151340_X0000038850105200409151340.jp2 04-Oct-2010 21:50:35 [World Community Grid] Finished download of X0000038850105200409151340_X0000038850105200409151340.jp2 are there any other logs that might be useful? where will i find them? Freezes are bad. If they last longer than 30 seconds and BOINC cannot reset the tasks to the last checkpoint or beginning, the tasks are declared broken. i think my freezes are not that long, max 10 seconds probably. |
||
|
|
sk..
Master Cruncher http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif Joined: Mar 22, 2007 Post Count: 2324 Status: Offline Project Badges:
|
This looks like the culprit,
04-Oct-2010 21:49:17 [---] Suspending computation - CPU usage is too high <suspend_cpu_usage>25.000000</suspend_cpu_usage> Change <suspend_cpu_usage> to 0 Boinc is suspending all tasks to disk when you use over 25% of the CPU and making a mess of it. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
cool, thx a lot, just switched to that value, will report back if that would not help :-)
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
<leave_apps_in_memory>0</leave_apps_in_memory>
also recommend "Leave App in memory when suspended" be checked |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
... but, the situation already goes wooky, with the SIGPIPE line which is discussed in this BOINC developers forum thread: http://boinc.berkeley.edu/dev/forum_thread.php?id=2427 which by the read is exactly that, something timing out due a freeze.
----------------------------------------04-Oct-2010 21:48:30 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_0 04-Oct-2010 21:48:30 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_1 SIGPIPE: write on a pipe with no reader 04-Oct-2010 21:48:33 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_0 04-Oct-2010 21:48:33 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_2 04-Oct-2010 21:48:35 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_1 04-Oct-2010 21:48:35 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_3 04-Oct-2010 21:48:44 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_2 04-Oct-2010 21:48:44 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_3 04-Oct-2010 21:48:44 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_4 04-Oct-2010 21:49:17 [World Community Grid] Computation for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 finished 04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_0 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent 04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_1 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent 04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_2 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent 04-Oct-2010 21:49:17 [World Community Grid] Starting X0000038831348200410061550_1 04-Oct-2010 21:49:17 [World Community Grid] Starting task X0000038831348200410061550_1 using hcc1 version 608 The task Result logs you find by going the the Result Status page and in the Status column hitting the links such as Invalid, Error, User Aborted. LAIM set to *ON* and *While processor usage...* set to zero (0) definitely helps, former preventing unloading by the second (fixed in 6.12) and when there is truly high loads that could interfere with interrupts, probably better to be set to for instance 90%, so tasks don't break during freezes. That's where I've got it set for quite some time now. edit: To add, if a system goes really in a state that BOINC don't even gets a chance to suspend, then there's potential trouble.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Oct 5, 2010 10:36:13 AM] |
||
|
|
|