Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 118
Posts: 118   Pages: 12   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 25941 times and has 117 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Two observations are still unexplained.

1) All 3 "exits with zero status but no finish file" occurred with the same time stamp to the second as another unit finishing or starting. That sounds like a BOINC or local system issue. Too much of a coincidence otherwise.

2) All 3 of those units that exited then restarted on the same machine. Not a normal situation!
[Aug 18, 2014 9:36:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

I'm not sure this is specific to the beta or not, but I've noticed more "No heartbeat from core client for 30 sec - exiting" messages than I used to get on one of my machines. So this evening I spent some time watching it.

I observed both active beta tasks reset themselves while I was copying a large video file from the system disk (the one that BOINC uses) to a removable disk. It occurred to me that, even though the CPU was hardly being used, there was obviously continuous reading of the BOINC disk. And if the task communication is not threaded, and one of the beta tasks tried to do some I/O, its low priority would cause it to go into a wait state until the file had completed copying (something that took several minutes).

I therefore conclude that either (a) inter-task "heartbeat" communication needs to be threaded so that it doesn't wait on I/O or (b -- and not as good) the 30 seconds need to be increased quite considerably.

Just my 2p'th.
[Aug 18, 2014 10:24:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Mine errored out around 1 hr 20 min but log says:
Computation for task BETA_E225106_546_S.326.C37H26N2O4S2.KRRWRYCZFPKTNW-UHFFFAOYSA-N.20_s1_14_4 finished

Mac laptop, OS 10.9.4 2.4 ghz, i5
----------------------------------------

[Aug 18, 2014 11:19:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KWSN - A Shrubbery
Master Cruncher
Joined: Jan 8, 2006
Post Count: 1585
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

No idea how many my systems grabbed. Results status page shows 14 pages worth. A few have ended rather early.

Hopefully you'll get some useful data off the results.
----------------------------------------

Distributed computing volunteer since September 27, 2000
[Aug 18, 2014 11:55:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KLiK
Master Cruncher
Croatia
Joined: Nov 13, 2006
Post Count: 3108
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Still waiting for the results:

BETA_ E225108_ 236_ S.328.C46H31N3.OZIJTKWPTJUCKQ-UHFFFAOYSA-N.7_ s1_ 14_ 1-- p4l-fsc1410a In Progress 8/18/14 17:49:35 8/22/14 17:49:35 0.00 / 0.00 0.0 / 0.0
BETA_ E225108_ 70_ S.328.C41H25N7O1.JFONBYKDRGSVKP-UHFFFAOYSA-N.12_ s1_ 14_ 1-- VS4 In Progress 8/18/14 17:46:52 8/22/14 17:46:52 0.00 / 0.00 0.0 / 0.0
BETA_ E225106_ 551_ S.326.C37H25N1O5S2.KZXWWNDVNPIUHY-UHFFFAOYSA-N.5_ s1_ 14_ 0-- p4l-fsc1410a In Progress 8/15/14 16:42:48 8/19/14 16:42:48 0.00 / 0.00 0.0 / 0.0

One is on a laptop, and one on XEON server machine I use at home.
----------------------------------------
oldies:UDgrid.org & PS3 Life@home


non-profit org. Play4Life in Zagreb, Croatia
[Aug 19, 2014 5:42:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
astroWX
Advanced Cruncher
USA
Joined: Sep 1, 2007
Post Count: 56
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

My farm caught a grand total of, count 'em, one task.

Nothing to add to what has already been posted.

Task ran 5:20:46 on i5-3550 (Ivy Bridge), no problems. Four upload files, no problems.
[Aug 19, 2014 7:22:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Now that I can easily see the Result Log for the units that exited with zero status but no finish file, those exits were all caused by "No heartbeat from core client for 30 sec". In all 3 cases, they were at the same time as another unit starting or finishing. After restarting, all 3 continued successfully to finish during Job#6 with RC = 0x1 and are now in PVal state.

Units that ran on the i5-750 completed either during Job#0 in 1.2 hours or during Job#6 in 5.2 to 8.9 hours (with first checkpoint after 2.5 to 4 hours). On an i7-4770K, completions took 4.9 to 6.8 hours, with similar first checkpoints.
[Aug 19, 2014 8:42:21 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

The heartbeat issue is a classic case of point 2 in your previous post, zero status an indicator, and too many causing a task to abort too. Again, yes again, we need staggered starting. Any time a device starts up or was out of work, then pulls a new set of cep2 this is a number one cause of heartbeat failure and or dreadful efficiency, all the tasks competing to get access to the storage area.

This i will repeat till sick of it!


The development ticket has been in long now, but they rather waste amazing effort on getting to transmit video presentations in the agent notices rather than focusing on getting science computed with least possible failure. Something to bring up at the sztaki workshop! Advocated climate change mitigation which cep2 is related to, are not going to be light, they are data hungry, and model growers.
----------------------------------------
[Edit 1 times, last edit by Former Member at Aug 19, 2014 9:32:47 AM]
[Aug 19, 2014 9:27:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

lavaflow, I agree completely.
[Aug 19, 2014 9:38:52 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mamajuanauk
Master Cruncher
United Kingdom
Joined: Dec 15, 2012
Post Count: 1900
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

I don't know if this has already been said, but all my Beta's on one machine - Ubuntu 12.04/server have errored! the error from one is below, let me know if you want more info...
Result Log 	

Result Name: BETA_ E225108_ 863_ S.328.C42F3H27N2O1.KALSHVNKCWZFFK-UHFFFAOYSA-N.17_ s1_ 14_ 1--
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[19:02:18] Number of jobs = 8
[19:02:18] Starting job 0,CPU time has been restored to 0.000000.
[19:03:56] Starting new Job
[19:03:56] Qink name = fldman
[19:04:24] Qink name = gesman
[19:04:38] Qink name = scfman
[20:31:10] Qink name = anlman
[20:31:10] Qink name = drvman
[20:35:35] Qink name = optman
[20:35:36] Qink name = fldman
[20:35:36] Qink name = gesman
[20:35:40] Qink name = scfman
[21:04:15] Qink name = anlman
[21:04:15] Qink name = drvman
[21:08:27] Qink name = optman
[21:08:28] Qink name = fldman
[21:08:28] Qink name = gesman
[21:08:31] Qink name = scfman
[21:35:39] Qink name = anlman
[21:35:40] Qink name = drvman
[21:39:52] Qink name = optman
[21:39:53] Qink name = fldman
[21:39:53] Qink name = gesman
[21:39:57] Qink name = scfman
[22:06:34] Qink name = anlman
[22:06:34] Qink name = drvman
[22:11:06] Qink name = optman
[22:11:07] Qink name = fldman
[22:11:07] Qink name = gesman
[22:11:11] Qink name = scfman
[22:37:47] Qink name = anlman
[22:37:47] Qink name = drvman
[22:42:03] Qink name = optman
[22:42:07] Qink name = fldman
[22:42:07] Qink name = gesman
[22:42:11] Qink name = scfman
[23:08:27] Qink name = anlman
[23:08:27] Qink name = drvman
[23:12:49] Qink name = optman
[23:12:54] Qink name = fldman
[23:12:54] Qink name = gesman
[23:12:58] Qink name = scfman
[23:37:11] Qink name = anlman
[23:37:12] Qink name = drvman
[23:41:37] Qink name = optman
[23:41:38] Qink name = fldman
[23:41:38] Qink name = gesman
[23:41:42] Qink name = scfman
[00:03:58] Qink name = anlman
[00:03:58] Qink name = drvman
[00:08:06] Qink name = optman
[00:08:08] Qink name = fldman
[00:08:08] Qink name = gesman
[00:08:12] Qink name = scfman
[00:30:45] Qink name = anlman
[00:30:45] Qink name = drvman
[00:35:18] Qink name = optman
[00:35:19] Qink name = fldman
[00:35:19] Qink name = gesman
[00:35:25] Qink name = scfman
[00:55:59] Qink name = anlman
[00:55:59] Qink name = drvman
[01:00:01] Qink name = optman
[01:00:02] Qink name = fldman
[01:00:02] Qink name = gesman
[01:00:06] Qink name = scfman
[01:19:55] Qink name = anlman
[01:19:56] Qink name = drvman
[01:23:52] Qink name = optman
[01:23:53] Qink name = fldman
[01:23:53] Qink name = gesman
[01:23:56] Qink name = scfman
[01:43:13] Qink name = anlman
[01:43:14] Qink name = drvman
[01:47:04] Qink name = optman
[01:47:04] Qink name = fldman
[01:47:04] Qink name = gesman
[01:47:08] Qink name = scfman
[02:03:27] Qink name = anlman
[02:03:27] Qink name = drvman
[02:07:17] Qink name = optman
[02:07:18] Qink name = anlman
[02:13:10] End of Job
[02:13:13] Finished Job #0
[02:13:13] Starting job 1,CPU time has been restored to 23050.884000.
[02:13:14] Starting new Job
[02:13:14] Qink name = fldman
[05:00:40] Qink name = gesman
[05:00:40] Qink name = scfman
[06:58:25] Qink name = anlman
[07:04:41] End of Job
[07:04:42] Finished Job #1
[07:04:42] Starting job 2,CPU time has been restored to 30292.588000.
[07:04:43] Starting new Job
[07:04:43] Qink name = fldman
[07:04:45] Qink name = gesman
[07:04:47] Qink name = scfman
[07:30:01] Qink name = anlman
[07:36:16] End of Job
[07:36:18] Finished Job #2
[07:36:18] Starting job 3,CPU time has been restored to 32062.652000.
[07:36:19] Starting new Job
[07:36:19] Qink name = fldman
[07:36:22] Qink name = gesman
[07:36:22] Qink name = scfman
[08:07:54] Qink name = anlman
[08:14:00] End of Job
[08:14:02] Finished Job #3
[08:14:02] Starting job 4,CPU time has been restored to 34252.720000.
[08:14:02] Starting new Job
[08:14:03] Qink name = fldman
[08:14:05] Qink name = gesman
[08:14:06] Qink name = scfman
[08:35:51] Qink name = anlman
[08:42:21] End of Job
[08:42:22] Finished Job #4
[08:42:22] Starting job 5,CPU time has been restored to 35895.344000.
[08:42:23] Starting new Job
[08:42:23] Qink name = fldman
[08:42:26] Qink name = gesman
[08:42:26] Qink name = scfman
[08:56:46] Qink name = anlman
[09:07:39] End of Job
[09:07:43] Finished Job #5
[09:07:43] Starting job 6,CPU time has been restored to 37368.596000.
[09:07:44] Starting new Job
[09:07:45] Qink name = fldman
[09:08:03] Qink name = gesman
[09:08:06] Qink name = scfman

</stderr_txt>
]]>

----------------------------------------
Mamajuanauk is the Name! Crunching is the Game!



[Aug 19, 2014 1:49:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 118   Pages: 12   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread