Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 114
Posts: 114   Pages: 12   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 11128 times and has 113 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

The purging issue / true obliteration of slot content of old jobs was resolved somewhere 7.6.9.


The WCG release is 7.2.42 for Linux/Mac and 7.2.47 for Win. If this issue is affecting WUs, then a higher WCG release will need to be organised.
[Feb 28, 2016 11:18:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Eric_Kaiser
Veteran Cruncher
Germany (Hessen)
Joined: May 7, 2013
Post Count: 1047
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

Got one that errored out:

Result Name: BETA_ E236295_ 816_ S.316.C31H24N4O6S1Si1.MUWKPBKDUBLKIH-UHFFFAOYSA-N.4_ s1_ 14_ 1--
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[11:02:49] Number of jobs = 5
[11:02:49] Starting job 0,CPU time has been restored to 0.000000.
[11:02:49] Starting new Job
[11:02:49] Qink name = fldman
[11:02:51] Qink name = gesman
[11:02:52] Qink name = scfman
[12:57:18] Qink name = anlman
[12:57:19] Qink name = drvman
[12:59:48] Qink name = optman
[12:59:48] Qink name = fldman
[12:59:48] Qink name = gesman
[12:59:49] Qink name = scfman
[13:15:24] Qink name = anlman
[13:15:24] Qink name = drvman
[13:17:51] Qink name = optman
[13:17:51] Qink name = fldman
[13:17:51] Qink name = gesman
[13:17:52] Qink name = scfman
[13:32:52] Qink name = anlman
[13:32:52] Qink name = drvman
[13:35:17] Qink name = optman
[13:35:17] Qink name = fldman
[13:35:17] Qink name = gesman
[13:35:18] Qink name = scfman
[13:49:26] Qink name = anlman
[13:49:26] Qink name = drvman
[13:51:52] Qink name = optman
[13:51:52] Qink name = fldman
[13:51:52] Qink name = gesman
[13:51:54] Qink name = scfman
[14:06:52] Qink name = anlman
[14:06:52] Qink name = drvman
[14:09:18] Qink name = optman
[14:09:18] Qink name = fldman
[14:09:18] Qink name = gesman
[14:09:19] Qink name = scfman
[14:24:22] Qink name = anlman
[14:24:22] Qink name = drvman
[14:26:49] Qink name = optman
[14:26:49] Qink name = fldman
[14:26:49] Qink name = gesman
[14:26:51] Qink name = scfman
[14:41:59] Qink name = anlman
[14:41:59] Qink name = drvman
[14:44:23] Qink name = optman
[14:44:23] Qink name = fldman
[14:44:23] Qink name = gesman
[14:44:24] Qink name = scfman
[14:59:31] Qink name = anlman
[14:59:31] Qink name = drvman
[15:01:56] Qink name = optman
[15:01:56] Qink name = fldman
[15:01:56] Qink name = gesman
[15:01:58] Qink name = scfman
[15:17:06] Qink name = anlman
[15:17:06] Qink name = drvman
[15:19:33] Qink name = optman
[15:19:33] Qink name = fldman
[15:19:33] Qink name = gesman
[15:19:35] Qink name = scfman
[15:33:50] Qink name = anlman
[15:33:50] Qink name = drvman
[15:36:17] Qink name = optman
[15:36:17] Qink name = fldman
[15:36:17] Qink name = gesman
[15:36:18] Qink name = scfman
[15:50:38] Qink name = anlman
[15:50:38] Qink name = drvman
[15:53:05] Qink name = optman
[15:53:05] Qink name = fldman
[15:53:05] Qink name = gesman
[15:53:06] Qink name = scfman
[16:08:15] Qink name = anlman
[16:08:15] Qink name = drvman
[16:10:40] Qink name = optman
[16:10:41] Qink name = fldman
[16:10:41] Qink name = gesman
[16:10:42] Qink name = scfman
[16:24:47] Qink name = anlman
[16:24:47] Qink name = drvman
[16:27:11] Qink name = optman
[16:27:11] Qink name = fldman
[16:27:11] Qink name = gesman
[16:27:13] Qink name = scfman
[16:41:25] Qink name = anlman
[16:41:25] Qink name = drvman
[16:43:51] Qink name = optman
[16:43:51] Qink name = fldman
[16:43:51] Qink name = gesman
[16:43:53] Qink name = scfman
[16:57:46] Qink name = anlman
[16:57:46] Qink name = drvman
[17:00:13] Qink name = optman
[17:00:14] Qink name = fldman
[17:00:14] Qink name = gesman
[17:00:15] Qink name = scfman
[17:13:01] Qink name = anlman
[17:13:01] Qink name = drvman
[17:15:26] Qink name = optman
[17:15:26] Qink name = fldman
[17:15:26] Qink name = gesman
[17:15:28] Qink name = scfman
[17:28:22] Qink name = anlman
[17:28:22] Qink name = drvman

</stderr_txt>
]]>

The wu errored out on two other clients with "Disk usage limit exceeded" which is not a error from the wu.
----------------------------------------

[Feb 28, 2016 11:48:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

Don't hold me to it, but the root problem was related to VM jobs greater than 4GB, which somehow did not get deleted on closing, so the next job would walk into an old/non empty slot with the max right off the bat. Not presently a problem for WCG, unless you also run those VM jobs. [Think this was discussed on these forums too, as then gradually you'd walk into the max allowed slots -200-, where normally empty/old slots would get deleted by BOINC.

edit: One of the related fixes

7.6.1.

client: detect errors in directory enumeration.
Previously, the dir_scan() function didn't distinguish between
• reaching the end of the directory.
• errors.
It just returned nonzero in either case. This means that the function that cleans out a slot dir (client_clean_out_dir()) could potentially return success even though the directory is nonempty. This could potentially cause the recently-reported problem where a slot dir contains a VM image from a previous job.

and

client: fix bug when delete > 4GB file.
The function to delete a slot dir file (delete_project_owned_file()) called boinc_file_or_symlink_exists(), and returning success (with no message) if this return false.
boinc_file_or_symlink_exists() incorrectly returned false for > 4GB file on Win, because it used stat(), which handles only 32 bit file size.

Fix: remove the call to boinc_file_or_symlink_exists(); instead, always call DeleteFile(), and check for the ERROR_FILE_NOT_FOUND status. David will fix the stat() problem later.
client, Win: use _stat64() instead of _stat(); _stat() returns error for > 4GB files.

then in 7.6.11

client: fix a bug introduced in commit [44c82be] which prevented the re-use of empty slots.
This bug affects only Mac / Linux / UNIX builds. It does not affect Windows.

BTW, CEP2 v7.0 where this beta is running on is only in 32 bit
----------------------------------------
[Edit 2 times, last edit by SekeRob* at Feb 28, 2016 12:02:15 PM]
[Feb 28, 2016 11:52:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1316
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

The purging issue / true obliteration of slot content of old jobs was resolved somewhere 7.6.9.

The WCG release is 7.2.42 for Linux/Mac and 7.2.47 for Win. If this issue is affecting WUs, then a higher WCG release will need to be organised.

The problem here is not purging remnants from previous tasks having used a BOINC-slot, what is solved in newer BOINC-clients, but the application itself produces more/bigger files together going over the slot disk_bound of 2048MB.

Normally a CEP2 uses about 800-900MB of disk space in a slot from about 6570 files.
----------------------------------------

[Feb 28, 2016 11:54:39 AM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

You introduced the purging angle :O). Anyway, it's been a very very long time since the CEP2 slot limit was set at 2GB [and some complaining about it, not being able to get work for CEP2 as run of the mill, that number was hardly reached], but I gather they could get as close to being that. I hope this is not a bug, just a relative small upping needed for the disk_bound setting. If it is an [old] bug, we could be longer, then an opportunity maybe to lower the setting when fixed. :)))
[Feb 28, 2016 12:15:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1316
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

Dilemma: Run or abort.

I got number 5 of a workunit (_4); 3 failed due to maximum disk usage, 1 is PVal.
However this PVal has a strange application exit, just where the error tasks got their abort signal.

[05:38:13] Qink name = anlman
Application exited with RC = 0x8b
[05:52:15] Finished Job #3
[05:52:15] Starting job 4,CPU time has been restored to 21391.796000.
[05:52:15] Skipping Job #4
05:52:16 (17591): called boinc_finish


Maybe this link to the workunit is clickable: http://www.worldcommunitygrid.org/ms/device/v....do?workunitId=1636667981
----------------------------------------

[Feb 28, 2016 12:32:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

[Regrettably] We still only get to see the header, not the quorum/distribution details of links to results of other members.
[Feb 28, 2016 12:36:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2089
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

There's one _0 still running in the 17th hour after 1 checkpoint so far.
BETA_E236295_768_S.316.C33H24N10O3.KXJKZDDABLYRQF-UHFFFAOYSA-N.13_s1_14_0--
CPU Time at last checkpoint: 16:04:32
Fraction done: 91.353%
Elapsed time: 17:52:55

Let's have a look.

INFO: No state to restore. Start from the beginning.
[09:18:25] Number of jobs = 5
[09:18:25] Starting job 0,CPU time has been restored to 0.000000.
[09:18:28] Starting new Job
[09:18:28] Qink name = fldman
[09:18:30] Qink name = gesman
[09:18:31] Qink name = scfman
[10:48:04] Qink name = anlman
[10:48:04] Qink name = drvman
[10:51:56] Qink name = optman
...
(This was yesterday, we now return to today, more than 23½ hours later)
...
[10:34:40] Qink name = drvman
[10:38:24] Qink name = optman
[10:38:25] Qink name = anlman
[10:40:25] End of Job
[10:40:27] Finished Job #0
[10:40:27] Starting job 1,CPU time has been restored to 57872.721486.
[10:40:31] Starting new Job
[10:40:31] Qink name = fldman
[10:40:32] Qink name = gesman
[10:40:32] Qink name = scfman
...
(Waiting for the 18th hour mark)
...
(There it is:)
Fraction done: 91.995%
Elapsed time: 18:00:00
(and still running! Watching live!)
...
[13:00:21] Qink name = anlman
[13:02:18] End of Job
[13:02:20] Finished Job #1
[13:02:20] Starting job 2,CPU time has been restored to 63587.578195.
[13:02:24] Starting new Job
[13:02:24] Qink name = fldman
[13:02:26] Qink name = gesman
[13:02:27] Qink name = scfman
...
CPU time: 17:54:35
Elapsed time: 19:23:29
Estimated time remaining: 00:05:48
Fraction done: 99.499%
...
[13:27:37] Qink name = anlman
[13:29:30] End of Job
[13:29:33] Finished Job #2
[13:29:33] Starting job 3,CPU time has been restored to 64611.824290.
[13:29:37] Starting new Job
[13:29:37] Qink name = fldman
[13:29:46] Qink name = gesman
[13:29:48] Qink name = scfman
...
CPU time: 17:59:10
Elapsed time: 19:29:29
Estimated time remaining: 00:00:54
Fraction done: 99.923%
...
CPU time: 18:00:00
Elapsed time: 19:30:21
...
Killing job because cpu time limit has been exceeded. 64611.824290||189.070813||0.000000
[13:34:13] Finished Job #3
13:34:16 (26240): called boinc_finish

Uploading
[Feb 28, 2016 12:39:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

At least you had one checkpoint > continue to PV, if no checkpoint slammed with compute error status.
[Feb 28, 2016 12:43:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 Beta Feb 24, 2016 [ Issues Thread ]

I have several in PV with the killing job tag. There is one thing they all have in common besides the tag and that is they all had the exact same CPU time of 18 hours. Not 1 second more or less. The elapsed times were all different. The first log is from a Win7 64 bit machine. The second is from an XP 32 bit machine. None were resends. What exactly did the XP machine do for 18 hours besides finish job #0?
EDIT: Found another from the XP box that finished to job #3. Added results below.

Result Log

Result Name: BETA_ E236295_ 80_ S.318.C35H26N6O1S2.ZJGWWSBLAUPGGW-UHFFFAOYSA-N.1_ s1_ 14_ 1--
<core_client_version>7.4.36</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[19:57:25] Number of jobs = 5
[19:57:25] Starting job 0,CPU time has been restored to 0.000000.
[12:33:29] Finished Job #0
[12:33:29] Starting job 1,CPU time has been restored to 59495.879781.
[13:46:29] Finished Job #1
[13:46:29] Starting job 2,CPU time has been restored to 63871.146228.
Killing job because cpu time has been exceeded. Subjob start time = -1377440680, Subjob current time = 1089417188
[14:02:03] Finished Job #2
14:02:07 (4484): called boinc_finish

</stderr_txt>
]]>


Result Log

Result Name: BETA_ E236293_ 616_ S.316.C34H25N9O3.FXWAJNJNCOLDMR-UHFFFAOYSA-N.11_ s1_ 14_ 1--
<core_client_version>7.4.42</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[04:42:28] Number of jobs = 5
[04:42:28] Starting job 0,CPU time has been restored to 0.000000.
Killing job because cpu time has been exceeded. Subjob start time = 0, Subjob current time = 0
[23:36:02] Finished Job #0
23:36:03 (2652): called boinc_finish

</stderr_txt>
]]>


Result Log

Result Name: BETA_ E236293_ 149_ S.318.C36H27N5O1S2.XARSRLYGJRENDJ-UHFFFAOYSA-N.14_ s1_ 14_ 1--
<core_client_version>7.4.42</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[04:17:54] Number of jobs = 5
[04:17:54] Starting job 0,CPU time has been restored to 0.000000.
[15:51:06] Finished Job #0
[15:51:06] Starting job 1,CPU time has been restored to 39822.468750.
[17:54:10] Finished Job #1
[17:54:10] Starting job 2,CPU time has been restored to 47101.250000.
[18:18:16] Finished Job #2
[18:18:16] Starting job 3,CPU time has been restored to 48478.890625.
Killing job because cpu time has been exceeded. Subjob start time = -2147483648, Subjob current time = 1088924636
[22:54:53] Finished Job #3
22:54:56 (3672): called boinc_finish

</stderr_txt>
]]>
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


----------------------------------------
[Edit 2 times, last edit by nanoprobe at Feb 28, 2016 1:23:45 PM]
[Feb 28, 2016 1:18:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 114   Pages: 12   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread