World Community Grid - View Thread - All work units failing to process

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: All work units failing to process

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 10

[ ]

Author

This topic has been viewed 2130 times and has 9 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


All work units failing to process

Hello,
I just installed BOINC 5.8.16 (due to an old GLIBC version) on a CentOS 3.9.
After registering the client, it started downloading severall workunits very quickly (8 CPUs on the system). However 4 failed within a couple of seconds after downloading, always with the same message:

<core_client_version>5.8.16</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1)
</message>
<stderr_txt>
ERROR:: Exit at: rotamer_functions.cc line:1404

</stderr_txt>
]]>

4 other work units are still in progress, at least according to the website, however the system is idle at >99%, so no real processing is going on. Even with "./boinc -return_results_immediately" I do not see anything going on.

What could be going on here?

[Apr 20, 2010 8:32:19 PM]

JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

10 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: All work units failing to process

Welcome here danielfrank!
Could you please post the message log of your BOINC client from the very first line down to the error messages (or down to the end if is no longer doing anything).

Read you later.
Jean.

----------------------------------------

Team--> Decrypthon -->Statistics/Join -->Thread

[Apr 20, 2010 9:57:11 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: All work units failing to process

I noticed that everything works fine on another system that's using OpenSolaris with a CentOS 5.4 zone and is working quite fine, so I have already rebuild my boinc zone and it's running fine now.
I don't know if it's the old glibc in CentOS 3.9, but this could be the problem.

[Apr 21, 2010 9:22:25 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: All work units failing to process

Interesting, OpenSolaris with a CentOS zone. First reading someone got it to run on WCG, there having been only one previous mention here of http://www.opensolaris.com/get/index.jsp exactly one year ago, a few members running CentOS but never in connection.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Apr 21, 2010 10:01:24 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: All work units failing to process

Opensolaris with a linux zone doesn't work as well as I thought, at least for WCG. Only Help Cure Muscular Dystrophy - Phase 2 seems to process fine.

All workunits of other projects (of the "available projects") fail in some way:
Some fail within a couple of seconds, others fail after some time. I've already opted out of these projects to avoid causing unnecessary errors.

If someone is interested, I can write a quick guide on how to get WCG running in a linux zone in opensolaris, but considering the moderate results, I'm not sure it would do WCG any good to make it too easy right now.

If there's any interest in debugging the failures during processing of the workunits, I'm available.

[Apr 22, 2010 9:43:30 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: All work units failing to process

If HCMD2 runs, then it might be a available memory issue as this science is by far the smallest with a footprint of 7-9MB ram , 47-380 MB VM use on a windows system. The message log stored in the stdoutdae.txt file and the Result Status page > Status links of the errors may give more hints.

Unless there's interest from the members don't think there's much to gain in a write up but thanks for the offer.

Happy crunching... on HCMD2... every little bit helps.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Apr 22, 2010 10:25:03 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: All work units failing to process

There shouldn't be any memory issues, I have allocated the zone a total of 4 GB ram with 7 WUs processing in parallel, so every WU should have around 512 MB available.
Also when I check the error output of some workunits it looks to me like there are differences during calculation.
For example the WU CMD2_ 0396-PGTBA.clustersOccur-1YNS_ A.clustersOccur_ 13_ 80445_ 81058_ 80549_ 80676_ 0-- ran for 2.9 hours and gave the following result:

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<stderr_txt>
INFO: Initializing Platform.
INFO: No state to restore. Start from the beginning.
called boinc_finish

</stderr_txt>
]]>

To me this looks like the WU finished perfectly fine and there's also no error in the BOINC client log.

For HFCC_ s2_ 01780840_ s2_ 0000_ 0-- I would assume that some of the calculation just give different results than expected:

INFO:[02:31:03] Start AutoGrid...

autogrid: autogrid4: Successful Completion.
INFO:[02:31:13] End AutoGrid...
Beginning AutoDock...
INFO: Setting num_generations: 27000
WARNING! The population appears to have converged, so this run will shortly terminate.
_maxGenSeenSoFar changed: 6750
About to enter main loop...(dockings already completed: 0)
WARNING! The population appears to have converged, so this run will shortly terminate.
Updating Best Energy for WU: 0.00
Finished Docking number 0
WARNING! The population appears to have converged, so this run will shortly terminate.
Finished Docking number 1
WARNING! The population appears to have converged, so this run will shortly terminate.
Updating Best Energy for WU: -4.78
Finished Docking number 2
WARNING! The population appears to have converged, so this run will shortly terminate.
Finished Docking number 3
WARNING! The population appears to have converged, so this run will shortly terminate.
Updating Best Energy for WU: -4.80
Finished Docking number 4
WARNING! The population appears to have converged, so this run will shortly terminate.
Finished Docking number 5
WARNING! The population appears to have converged, so this run will shortly terminate.
Finished Docking number 6
[...]
WARNING! The population appears to have converged, so this run will shortly terminate.
Finished Docking number 254

________________________________________________________________________________

autodock4: Successful Completion on "World Community Grid device"

________________________________________________________________________________

INFO:[02:37:43] Start AutoGrid...

autogrid: autogrid4: Successful Completion.
INFO:[02:38:13] End AutoGrid...
Beginning AutoDock...
INFO: Setting num_generations: 27000
About to enter main loop...(dockings already completed: 255)
WARNING! The population appears to have converged, so this run will shortly terminate.
Finished Docking number 0

________________________________________________________________________________

autodock4: Successful Completion on "World Community Grid device"

________________________________________________________________________________

called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>HFCC_s2_01780840_s2_0000_0_1</file_name>
<error_code>-131</error_code>
</file_xfer_error>

This seems to happen on all of the HFCC WUs.

[Apr 23, 2010 1:06:02 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: All work units failing to process

Thanks for following through in the discovery. The -131 has been reported a few times in past, here one thread http://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,19794

Seems by the official message description the output file gets too big, maybe indeed because of those many messages logged we don't generally see for a good result:

ERR_FILE_TOO_BIG -131

One of the output files is bigger than the maximum set by the project for upload.
BOINC will not try to upload this file.

Solution: Go to the project's forums and report this behavior.

The first log is normal btw for the HCMD2 task.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Apr 23, 2010 1:15:40 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: All work units failing to process

Thanks for following through in the discovery. The -131 has been reported a few times in past
[...]
Solution: Go to the project's forums and report this behavior.

Ok, anywhere else (besides here) I should report this?

The first log is normal btw for the HCMD2 task.
Still, the WU is listed with an error for my client and I can see two other results done by other people that are valid.

[Apr 23, 2010 3:11:03 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: All work units failing to process

Sorry, misunderstood that these HCMD2 jobs were turning out valid. I'd more have expected for a normal ending task to generate an invalid state, but from what you say, they never pass through a Pending Validation state and only go bust when the quorum comparison is done.

On techs/programmer investigating this... It's a judgment call: A very rare configuration / OS. The WCG staff will have to say whether they're in for a look... they have their hobby moments and Eurekas at times.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Apr 23, 2010 3:36:38 PM]

[ ]