World Community Grid - View Thread - Clean Energy Project - Phase 2 BETA test new workunits

World Community Grid Forums

Category: Beta Testing

Forum: Beta Test Support Forum

Thread: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 118

[ ]

Author

This topic has been viewed 26664 times and has 117 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Two observations are still unexplained.

1) All 3 "exits with zero status but no finish file" occurred with the same time stamp to the second as another unit finishing or starting. That sounds like a BOINC or local system issue. Too much of a coincidence otherwise.

2) All 3 of those units that exited then restarted on the same machine. Not a normal situation!

[Aug 18, 2014 9:36:34 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

I'm not sure this is specific to the beta or not, but I've noticed more "No heartbeat from core client for 30 sec - exiting" messages than I used to get on one of my machines. So this evening I spent some time watching it.

I observed both active beta tasks reset themselves while I was copying a large video file from the system disk (the one that BOINC uses) to a removable disk. It occurred to me that, even though the CPU was hardly being used, there was obviously continuous reading of the BOINC disk. And if the task communication is not threaded, and one of the beta tasks tried to do some I/O, its low priority would cause it to go into a wait state until the file had completed copying (something that took several minutes).

I therefore conclude that either (a) inter-task "heartbeat" communication needs to be threaded so that it doesn't wait on I/O or (b -- and not as good) the 30 seconds need to be increased quite considerably.

Just my 2p'th.

[Aug 18, 2014 10:24:31 PM]

Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:

10 year badge for The Clean Energy Project - Phase 2

90 day badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

50 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Mine errored out around 1 hr 20 min but log says:
Computation for task BETA_E225106_546_S.326.C37H26N2O4S2.KRRWRYCZFPKTNW-UHFFFAOYSA-N.20_s1_14_4 finished

Mac laptop, OS 10.9.4 2.4 ghz, i5

----------------------------------------

[Aug 18, 2014 11:19:36 PM]

KWSN - A Shrubbery
Master Cruncher
Joined: Jan 8, 2006
Post Count: 1585
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

No idea how many my systems grabbed. Results status page shows 14 pages worth. A few have ended rather early.

Hopefully you'll get some useful data off the results.

----------------------------------------

Distributed computing volunteer since September 27, 2000

[Aug 18, 2014 11:55:38 PM]

KLiK
Master Cruncher
Croatia
Joined: Nov 13, 2006
Post Count: 3108
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Still waiting for the results:

BETA_ E225108_ 236_ S.328.C46H31N3.OZIJTKWPTJUCKQ-UHFFFAOYSA-N.7_ s1_ 14_ 1-- p4l-fsc1410a In Progress 8/18/14 17:49:35 8/22/14 17:49:35 0.00 / 0.00 0.0 / 0.0
BETA_ E225108_ 70_ S.328.C41H25N7O1.JFONBYKDRGSVKP-UHFFFAOYSA-N.12_ s1_ 14_ 1-- VS4 In Progress 8/18/14 17:46:52 8/22/14 17:46:52 0.00 / 0.00 0.0 / 0.0
BETA_ E225106_ 551_ S.326.C37H25N1O5S2.KZXWWNDVNPIUHY-UHFFFAOYSA-N.5_ s1_ 14_ 0-- p4l-fsc1410a In Progress 8/15/14 16:42:48 8/19/14 16:42:48 0.00 / 0.00 0.0 / 0.0

One is on a laptop, and one on XEON server machine I use at home.

----------------------------------------

oldies:UDgrid.org & PS3 Life@home

non-profit org. Play4Life in Zagreb, Croatia

[Aug 19, 2014 5:42:56 AM]

astroWX
Advanced Cruncher
USA
Joined: Sep 1, 2007
Post Count: 56
Status: Offline
Project Badges:

45 day badge for Human Proteome Folding - Phase 2

1 year badge for Help Fight Childhood Cancer

5 year badge for Computing for Clean Water

2 year badge for Computing for Sustainable Water

180 day badge for Uncovering Genome Mysteries

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

My farm caught a grand total of, count 'em, one task.

Nothing to add to what has already been posted.

Task ran 5:20:46 on i5-3550 (Ivy Bridge), no problems. Four upload files, no problems.

[Aug 19, 2014 7:22:02 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Now that I can easily see the Result Log for the units that exited with zero status but no finish file, those exits were all caused by "No heartbeat from core client for 30 sec". In all 3 cases, they were at the same time as another unit starting or finishing. After restarting, all 3 continued successfully to finish during Job#6 with RC = 0x1 and are now in PVal state.

Units that ran on the i5-750 completed either during Job#0 in 1.2 hours or during Job#6 in 5.2 to 8.9 hours (with first checkpoint after 2.5 to 4 hours). On an i7-4770K, completions took 4.9 to 6.8 hours, with similar first checkpoints.

[Aug 19, 2014 8:42:21 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

The heartbeat issue is a classic case of point 2 in your previous post, zero status an indicator, and too many causing a task to abort too. Again, yes again, we need staggered starting. Any time a device starts up or was out of work, then pulls a new set of cep2 this is a number one cause of heartbeat failure and or dreadful efficiency, all the tasks competing to get access to the storage area.

This i will repeat till sick of it!

The development ticket has been in long now, but they rather waste amazing effort on getting to transmit video presentations in the agent notices rather than focusing on getting science computed with least possible failure. Something to bring up at the sztaki workshop! Advocated climate change mitigation which cep2 is related to, are not going to be light, they are data hungry, and model growers.

----------------------------------------
[Edit 1 times, last edit by Former Member at Aug 19, 2014 9:32:47 AM]

[Aug 19, 2014 9:27:56 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

lavaflow, I agree completely.

[Aug 19, 2014 9:38:52 AM]

Mamajuanauk
Master Cruncher
United Kingdom
Joined: Dec 15, 2012
Post Count: 1900
Status: Offline
Project Badges:

45 day badge for Help Fight Childhood Cancer

50 year badge for The Clean Energy Project - Phase 2

45 day badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

100 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

I don't know if this has already been said, but all my Beta's on one machine - Ubuntu 12.04/server have errored! the error from one is below, let me know if you want more info...

Result Log 	

Result Name: BETA_ E225108_ 863_ S.328.C42F3H27N2O1.KALSHVNKCWZFFK-UHFFFAOYSA-N.17_ s1_ 14_ 1--
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[19:02:18] Number of jobs = 8
[19:02:18] Starting job 0,CPU time has been restored to 0.000000.
[19:03:56] Starting new Job
[19:03:56] Qink name = fldman
[19:04:24] Qink name = gesman
[19:04:38] Qink name = scfman
[20:31:10] Qink name = anlman
[20:31:10] Qink name = drvman
[20:35:35] Qink name = optman
[20:35:36] Qink name = fldman
[20:35:36] Qink name = gesman
[20:35:40] Qink name = scfman
[21:04:15] Qink name = anlman
[21:04:15] Qink name = drvman
[21:08:27] Qink name = optman
[21:08:28] Qink name = fldman
[21:08:28] Qink name = gesman
[21:08:31] Qink name = scfman
[21:35:39] Qink name = anlman
[21:35:40] Qink name = drvman
[21:39:52] Qink name = optman
[21:39:53] Qink name = fldman
[21:39:53] Qink name = gesman
[21:39:57] Qink name = scfman
[22:06:34] Qink name = anlman
[22:06:34] Qink name = drvman
[22:11:06] Qink name = optman
[22:11:07] Qink name = fldman
[22:11:07] Qink name = gesman
[22:11:11] Qink name = scfman
[22:37:47] Qink name = anlman
[22:37:47] Qink name = drvman
[22:42:03] Qink name = optman
[22:42:07] Qink name = fldman
[22:42:07] Qink name = gesman
[22:42:11] Qink name = scfman
[23:08:27] Qink name = anlman
[23:08:27] Qink name = drvman
[23:12:49] Qink name = optman
[23:12:54] Qink name = fldman
[23:12:54] Qink name = gesman
[23:12:58] Qink name = scfman
[23:37:11] Qink name = anlman
[23:37:12] Qink name = drvman
[23:41:37] Qink name = optman
[23:41:38] Qink name = fldman
[23:41:38] Qink name = gesman
[23:41:42] Qink name = scfman
[00:03:58] Qink name = anlman
[00:03:58] Qink name = drvman
[00:08:06] Qink name = optman
[00:08:08] Qink name = fldman
[00:08:08] Qink name = gesman
[00:08:12] Qink name = scfman
[00:30:45] Qink name = anlman
[00:30:45] Qink name = drvman
[00:35:18] Qink name = optman
[00:35:19] Qink name = fldman
[00:35:19] Qink name = gesman
[00:35:25] Qink name = scfman
[00:55:59] Qink name = anlman
[00:55:59] Qink name = drvman
[01:00:01] Qink name = optman
[01:00:02] Qink name = fldman
[01:00:02] Qink name = gesman
[01:00:06] Qink name = scfman
[01:19:55] Qink name = anlman
[01:19:56] Qink name = drvman
[01:23:52] Qink name = optman
[01:23:53] Qink name = fldman
[01:23:53] Qink name = gesman
[01:23:56] Qink name = scfman
[01:43:13] Qink name = anlman
[01:43:14] Qink name = drvman
[01:47:04] Qink name = optman
[01:47:04] Qink name = fldman
[01:47:04] Qink name = gesman
[01:47:08] Qink name = scfman
[02:03:27] Qink name = anlman
[02:03:27] Qink name = drvman
[02:07:17] Qink name = optman
[02:07:18] Qink name = anlman
[02:13:10] End of Job
[02:13:13] Finished Job #0
[02:13:13] Starting job 1,CPU time has been restored to 23050.884000.
[02:13:14] Starting new Job
[02:13:14] Qink name = fldman
[05:00:40] Qink name = gesman
[05:00:40] Qink name = scfman
[06:58:25] Qink name = anlman
[07:04:41] End of Job
[07:04:42] Finished Job #1
[07:04:42] Starting job 2,CPU time has been restored to 30292.588000.
[07:04:43] Starting new Job
[07:04:43] Qink name = fldman
[07:04:45] Qink name = gesman
[07:04:47] Qink name = scfman
[07:30:01] Qink name = anlman
[07:36:16] End of Job
[07:36:18] Finished Job #2
[07:36:18] Starting job 3,CPU time has been restored to 32062.652000.
[07:36:19] Starting new Job
[07:36:19] Qink name = fldman
[07:36:22] Qink name = gesman
[07:36:22] Qink name = scfman
[08:07:54] Qink name = anlman
[08:14:00] End of Job
[08:14:02] Finished Job #3
[08:14:02] Starting job 4,CPU time has been restored to 34252.720000.
[08:14:02] Starting new Job
[08:14:03] Qink name = fldman
[08:14:05] Qink name = gesman
[08:14:06] Qink name = scfman
[08:35:51] Qink name = anlman
[08:42:21] End of Job
[08:42:22] Finished Job #4
[08:42:22] Starting job 5,CPU time has been restored to 35895.344000.
[08:42:23] Starting new Job
[08:42:23] Qink name = fldman
[08:42:26] Qink name = gesman
[08:42:26] Qink name = scfman
[08:56:46] Qink name = anlman
[09:07:39] End of Job
[09:07:43] Finished Job #5
[09:07:43] Starting job 6,CPU time has been restored to 37368.596000.
[09:07:44] Starting new Job
[09:07:45] Qink name = fldman
[09:08:03] Qink name = gesman
[09:08:06] Qink name = scfman

</stderr_txt>
]]>

----------------------------------------

Mamajuanauk is the Name! Crunching is the Game!

[Aug 19, 2014 1:49:54 PM]

[ ]