Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 16
Posts: 16   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2458 times and has 15 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Short freeze and then computation errors and tasks aborted

Hi,

I have a really annoying problem. I have i7-720QM processor laptop with Gentoo Linux amd64 operating system, boinc installed as Gentoo package. I configured boinc to run 4 tasks at a time (50% of processors) at 100% load. At the same time when the computations run, I am working on the laptop. Sometimes, once per x days, I get short freeze while working on the laptop, but after that, all tasks that were running are aborted with computation error. Most often this happens where the tasks ran for more than 10 hrs each and just few hours were left, like last time when I lost about 30 hours of runtime because of this. It drives me mad to loose that much computation time because of this issue.

I have no idea what causes the freezes, but is there any way to prevent the tasks from being aborted when this occurs? It's really stupid to loose 30 hours just because of few seconds freeze.

Thanks for any hints.
[Oct 4, 2010 8:33:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

Can only suggest you check your boinc settings. Perhaps set to save to hard drive rather than RAM would help? If you see this do a hard system restart (hit the restart button ASAP) and you might be able to pick up where you left off.
[Oct 4, 2010 9:45:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Coleslaw
Veteran Cruncher
USA
Joined: Mar 29, 2007
Post Count: 1343
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

How much RAM is in your system? It could be a case of not enough memory. I'm not sure much on Linux, but I know that with Windows...the hard drive read/writes can cause the system to lag to the point of erroring out sometimes. My Dual core laptop had that problem when running memory hungry apps alongside other memory hog programs. Combine that with something like a Bit Torrent program and your hard drive may not keep up. Also, your graphics may require a good chunk of your memory, so depending on what you are doing, you may see that lag.

Edit: Also, some apps use more memory towards the end of the job then at the beginning which may also be why they are at 10+ hours when it happens.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Coleslaw at Oct 4, 2010 10:55:06 PM]
[Oct 4, 2010 10:52:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

It would be very helpful to see error logs of those tasks.
[Oct 4, 2010 11:23:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

I have no idea what causes the freezes, but is there any way to prevent the tasks from being aborted when this occurs? It's really stupid to loose 30 hours just because of few seconds freeze
Freezes are bad. If they last longer than 30 seconds and BOINC cannot reset the tasks to the last checkpoint or beginning, the tasks are declared broken. I've found when running an indexer such as recoll, it eats masses amount of CPU time to the point of tasks resetting with the infamous "zero status" or heartbeat, but as yet not killing them. Sufficient RAM and Swap File is first order.

BTW, how to set (checkpoints) to save to memory and not to disk I've yet to find out. My interval minimum is set to 300 seconds. If they are not allowed to write, they are skipped. 300 seconds is just enough to not see frequent writes in the message log so getting a typical list as below that gives confidence that boots at the right timing for CEP2 jobs (2 of 4 cores manual control), can be reasonably well selected (and believe it or not, never booted so frequent as with Linux).

And bono_vox is right. The relevant portions of Result Log and the Messages from BOINC stdoutdae.txt give us reference (as yet to log anything in stderrdae.txt here)... now it's extended guessing.

Edit: A message log sample of checpoints:

Tue 05 Oct 2010 07:35:17 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed
Tue 05 Oct 2010 07:39:28 AM CEST [checkpoint_debug] result CMD2_0852-2JPQ_A.clustersOccur-KIF11A.clustersOccur_0_0 checkpointed
Tue 05 Oct 2010 07:40:05 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1MHN_A.clustersOccur_0_1 checkpointed
Tue 05 Oct 2010 07:40:33 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed
Tue 05 Oct 2010 07:40:34 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1MHN_A.clustersOccur_0_1 checkpointed
Tue 05 Oct 2010 07:42:40 AM CEST [checkpoint_debug] result E200397_771_A.25.C18H11N5OS.6.0.set1d06_2 checkpointed
Tue 05 Oct 2010 07:44:42 AM CEST [checkpoint_debug] result CMD2_0852-2JPQ_A.clustersOccur-KIF11A.clustersOccur_0_0 checkpointed
Tue 05 Oct 2010 07:45:46 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed
Tue 05 Oct 2010 07:47:22 AM CEST [checkpoint_debug] result E200397_771_A.25.C18H11N5OS.6.0.set1d06_2 checkpointed
Tue 05 Oct 2010 07:49:59 AM CEST [checkpoint_debug] result CMD2_0852-2JPQ_A.clustersOccur-KIF11A.clustersOccur_0_0 checkpointed
Tue 05 Oct 2010 07:51:06 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed
Tue 05 Oct 2010 07:55:12 AM CEST [checkpoint_debug] result CMD2_0852-2JPQ_A.clustersOccur-KIF11A.clustersOccur_0_0 checkpointed
Tue 05 Oct 2010 07:56:17 AM CEST [checkpoint_debug] result CMD2_0852-1J2J_B.clustersOccur-1NZW_A.clustersOccur_7_0 checkpointed
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Oct 5, 2010 6:08:35 AM]
[Oct 5, 2010 6:05:59 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

first, thank you for all the responses. i will try to answer all in this post.

Perhaps set to save to hard drive rather than RAM would help?


this is my configuration:

<global_preferences>
<run_on_batteries>0</run_on_batteries>
<run_if_user_active>1</run_if_user_active>
<run_gpu_if_user_active>0</run_gpu_if_user_active>
<idle_time_to_run>3.000000</idle_time_to_run>
<suspend_cpu_usage>25.000000</suspend_cpu_usage>
<start_hour>0.000000</start_hour>
<end_hour>0.000000</end_hour>
<net_start_hour>0.000000</net_start_hour>
<net_end_hour>0.000000</net_end_hour>
<leave_apps_in_memory>0</leave_apps_in_memory>
<confirm_before_connecting>0</confirm_before_connecting>
<hangup_if_dialed>0</hangup_if_dialed>
<dont_verify_images>0</dont_verify_images>
<work_buf_min_days>0.100000</work_buf_min_days>
<work_buf_additional_days>0.250000</work_buf_additional_days>
<max_ncpus_pct>50.000000</max_ncpus_pct>
<cpu_scheduling_period_minutes>120.000000</cpu_scheduling_period_minutes>
<disk_interval>60.000000</disk_interval>
<disk_max_used_gb>10.000000</disk_max_used_gb>
<disk_max_used_pct>75.000000</disk_max_used_pct>
<disk_min_free_gb>0.100000</disk_min_free_gb>
<vm_max_used_pct>75.000000</vm_max_used_pct>
<ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct>
<max_bytes_sec_up>0.000000</max_bytes_sec_up>
<max_bytes_sec_down>0.000000</max_bytes_sec_down>
<cpu_usage_limit>100.000000</cpu_usage_limit>
<daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb>
<daily_xfer_period_days>0</daily_xfer_period_days>
</global_preferences>

i can't see where i would specify where the work should be saved, but in the gui the description says that checkpoints are saved on disk each 60 seconds.

If you see this do a hard system restart (hit the restart button ASAP) and you might be able to pick up where you left off.


this is not an option for me, i have other work on the laptop too which i do not want to loose nor i want corrupted filesystem etc.

How much RAM is in your system? It could be a case of not enough memory.


i have 8GB of ram + 8GB of swap ... never used even full ram yet, so memory should not be the cause.

It would be very helpful to see error logs of those tasks.


here is the log that i found. it ends at point where i decide to stop the whole project for a while.

04-Oct-2010 21:48:29 [World Community Grid] Computation for task E200409_713_A.26.C18H6N4OS3.30.set1d06_1 finished
04-Oct-2010 21:48:29 [World Community Grid] Starting E200413_609_A.21.C17H12N2OS.47.1.set1d06_1
04-Oct-2010 21:48:29 [World Community Grid] Starting task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 using cep2 version 619
04-Oct-2010 21:48:30 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_0
04-Oct-2010 21:48:30 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_1
SIGPIPE: write on a pipe with no reader
04-Oct-2010 21:48:33 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_0
04-Oct-2010 21:48:33 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_2
04-Oct-2010 21:48:35 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_1
04-Oct-2010 21:48:35 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_3
04-Oct-2010 21:48:44 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_2
04-Oct-2010 21:48:44 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_3
04-Oct-2010 21:48:44 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_4
04-Oct-2010 21:49:17 [World Community Grid] Computation for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 finished
04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_0 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent
04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_1 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent
04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_2 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent
04-Oct-2010 21:49:17 [World Community Grid] Starting X0000038831348200410061550_1
04-Oct-2010 21:49:17 [World Community Grid] Starting task X0000038831348200410061550_1 using hcc1 version 608
04-Oct-2010 21:49:17 [---] Suspending computation - CPU usage is too high
04-Oct-2010 21:49:19 [World Community Grid] Computation for task E200409_648_A.26.C18H10N6OS.5.3.set1d06_0 finished
04-Oct-2010 21:49:19 [World Community Grid] Output file E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_0 for task E200409_648_A.26.C18H10N6OS.5.3.set1d06_0 absent
04-Oct-2010 21:49:19 [World Community Grid] Output file E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_1 for task E200409_648_A.26.C18H10N6OS.5.3.set1d06_0 absent
04-Oct-2010 21:49:19 [World Community Grid] Output file E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_2 for task E200409_648_A.26.C18H10N6OS.5.3.set1d06_0 absent
04-Oct-2010 21:49:20 [World Community Grid] Started upload of E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_3
04-Oct-2010 21:49:20 [World Community Grid] Computation for task faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1 finished
04-Oct-2010 21:49:22 [World Community Grid] Finished upload of E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_3
04-Oct-2010 21:49:22 [World Community Grid] Started upload of E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_3
04-Oct-2010 21:49:22 [World Community Grid] Computation for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 finished
04-Oct-2010 21:49:22 [World Community Grid] Output file E200413_609_A.21.C17H12N2OS.47.1.set1d06_1_0 for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 absent
04-Oct-2010 21:49:22 [World Community Grid] Output file E200413_609_A.21.C17H12N2OS.47.1.set1d06_1_1 for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 absent
04-Oct-2010 21:49:22 [World Community Grid] Output file E200413_609_A.21.C17H12N2OS.47.1.set1d06_1_2 for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 absent
04-Oct-2010 21:49:22 [World Community Grid] Output file E200413_609_A.21.C17H12N2OS.47.1.set1d06_1_3 for task E200413_609_A.21.C17H12N2OS.47.1.set1d06_1 absent
04-Oct-2010 21:49:23 [World Community Grid] Finished upload of E200409_648_A.26.C18H10N6OS.5.3.set1d06_0_3
04-Oct-2010 21:49:23 [World Community Grid] Started upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_0
04-Oct-2010 21:49:24 [World Community Grid] Finished upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_0
04-Oct-2010 21:49:24 [World Community Grid] Started upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_1
04-Oct-2010 21:49:26 [World Community Grid] Finished upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_1
04-Oct-2010 21:49:26 [World Community Grid] Started upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_2
04-Oct-2010 21:49:28 [World Community Grid] Finished upload of faah13674_ZINC04227283_xEyeSiteXtl5NI_01_1_2
04-Oct-2010 21:49:28 [---] Resuming computation
04-Oct-2010 21:49:28 [World Community Grid] Starting HFCC_n1_02121177_n1_0001_1
04-Oct-2010 21:49:28 [World Community Grid] Starting task HFCC_n1_02121177_n1_0001_1 using hfcc version 611
04-Oct-2010 21:49:28 [World Community Grid] Starting E200414_100_A.21.C17H12N2S2.78.4.set1d06_1
04-Oct-2010 21:49:28 [World Community Grid] Starting task E200414_100_A.21.C17H12N2S2.78.4.set1d06_1 using cep2 version 619
04-Oct-2010 21:49:28 [World Community Grid] Starting E200413_524_A.22.C16H10N4O2.6.0.set1d06_1
04-Oct-2010 21:49:28 [World Community Grid] Starting task E200413_524_A.22.C16H10N4O2.6.0.set1d06_1 using cep2 version 619
04-Oct-2010 21:50:21 [World Community Grid] Temporarily failed upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_4: HTTP error
04-Oct-2010 21:50:21 [World Community Grid] Backing off 1 min 0 sec on upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_4
04-Oct-2010 21:50:28 [World Community Grid] Sending scheduler request: To fetch work.
04-Oct-2010 21:50:28 [World Community Grid] Reporting 4 completed tasks, requesting new tasks
04-Oct-2010 21:50:31 [World Community Grid] Scheduler request completed: got 1 new tasks
04-Oct-2010 21:50:34 [World Community Grid] Started download of X0000038850105200409151340_X0000038850105200409151340.jp2
04-Oct-2010 21:50:35 [World Community Grid] Finished download of X0000038850105200409151340_X0000038850105200409151340.jp2

are there any other logs that might be useful? where will i find them?

Freezes are bad. If they last longer than 30 seconds and BOINC cannot reset the tasks to the last checkpoint or beginning, the tasks are declared broken.


i think my freezes are not that long, max 10 seconds probably.
[Oct 5, 2010 8:17:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

This looks like the culprit,
04-Oct-2010 21:49:17 [---] Suspending computation - CPU usage is too high

<suspend_cpu_usage>25.000000</suspend_cpu_usage>

Change <suspend_cpu_usage> to 0

Boinc is suspending all tasks to disk when you use over 25% of the CPU and making a mess of it.
[Oct 5, 2010 9:50:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

cool, thx a lot, just switched to that value, will report back if that would not help :-)
[Oct 5, 2010 9:56:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

<leave_apps_in_memory>0</leave_apps_in_memory>

also recommend "Leave App in memory when suspended" be checked
[Oct 5, 2010 10:14:39 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Short freeze and then computation errors and tasks aborted

... but, the situation already goes wooky, with the SIGPIPE line which is discussed in this BOINC developers forum thread: http://boinc.berkeley.edu/dev/forum_thread.php?id=2427 which by the read is exactly that, something timing out due a freeze.

04-Oct-2010 21:48:30 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_0
04-Oct-2010 21:48:30 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_1
SIGPIPE: write on a pipe with no reader
04-Oct-2010 21:48:33 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_0
04-Oct-2010 21:48:33 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_2
04-Oct-2010 21:48:35 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_1
04-Oct-2010 21:48:35 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_3
04-Oct-2010 21:48:44 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_2
04-Oct-2010 21:48:44 [World Community Grid] Finished upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_3
04-Oct-2010 21:48:44 [World Community Grid] Started upload of E200409_713_A.26.C18H6N4OS3.30.set1d06_1_4
04-Oct-2010 21:49:17 [World Community Grid] Computation for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 finished
04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_0 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent
04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_1 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent
04-Oct-2010 21:49:17 [World Community Grid] Output file E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1_2 for task E200405_757_A.25.C20H12N2OS2.22.2.set1d06_1 absent
04-Oct-2010 21:49:17 [World Community Grid] Starting X0000038831348200410061550_1
04-Oct-2010 21:49:17 [World Community Grid] Starting task X0000038831348200410061550_1 using hcc1 version 608

The task Result logs you find by going the the Result Status page and in the Status column hitting the links such as Invalid, Error, User Aborted.

LAIM set to *ON* and *While processor usage...* set to zero (0) definitely helps, former preventing unloading by the second (fixed in 6.12) and when there is truly high loads that could interfere with interrupts, probably better to be set to for instance 90%, so tasks don't break during freezes. That's where I've got it set for quite some time now.

edit: To add, if a system goes really in a state that BOINC don't even gets a chance to suspend, then there's potential trouble.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Oct 5, 2010 10:36:13 AM]
[Oct 5, 2010 10:34:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 16   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread