World Community Grid - View Thread

World Community Grid Forums

Category: Completed Research

Forum: FightAIDS@Home Phase 2

Thread: Errors in FAHB

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 14

[ ]

Author

This topic has been viewed 3766 times and has 13 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Errors in FAHB

Ran out of Zika so switched to FAHB and have been getting a number of errors as follows:
<core_client_version>7.16.3</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
INFO: result number = 0
%IMPACT-I: Requested file to open for appending md.out Does not exist.
Opening it as a new file.
%IMPACT-I: Softcore binding energy with umax = 1000.00000
%IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic
Non-Polar Hydration Model
%IMPACT-I: Hybrid potential for binding with lambda = 0.23370
agbnpf_assign_parameters(): info: attempting to load from SQL tables.
%IMPACT-E: Non-valid values generated from rrespa. This is probably
because of bad initial geometry. Please run minimization process for
some steps before running MD

</stderr_txt>
]]>

There has been about 10 in the last hour. Since the website at Temple says everything is 100%, what are these WUs?

[Oct 28, 2019 2:02:48 PM]

Sid2
Senior Cruncher
USA
Joined: Jun 12, 2007
Post Count: 259
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Errors in FAHB

I also had a FAH error out:

FAH2_ 002410_ zinc14744839_ 000001_ 000042_ 177_ 0--

----------------------------------------

[Oct 28, 2019 3:11:52 PM]

Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:

10 year badge for The Clean Energy Project - Phase 2

90 day badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

50 year badge for Uncovering Genome Mysteries

50 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project


Re: Errors in FAHB

There has been about 10 in the last hour. Since the website at Temple says everything is 100%, what are these WUs?

Seems to be that the chain has started, so the first parts have all been sent out, but yeah, I would assume 100% = 100% are done but I've got steady flow of them. I've had 4 error out, 3 were on the same machine and the other was in the other room with the same internet connection. All errored out immediately. Has had clean results for several days.

The error message on my four was different than yours:

WU download error: couldn't get input files:
<file_xfer_error>
<file_name>1f10ac96799ef1342c547eeca0a61c17.dms</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
<error_message>MD5 check failed</error_message>

----------------------------------------

[Oct 28, 2019 3:39:22 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Errors in FAHB

These look like different work units as they run in about 6 hours where the "older" WUs ran in about 12 to 15. I'm wondering if these are "betas" not labeled as betas. Hybrid beta?

[Oct 28, 2019 9:39:53 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Errors in FAHB

I've had one of the "couldn't get input files" WUs too:
. . FAH2_002384_zinc12100055_000004_000037_182
My copy was repair unit 1, original wingman had an identical error log.
Repair unit 2 is still In Progress.

[Oct 29, 2019 12:19:30 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

100 year badge for FightAIDS@Home - Phase 2


Re: Errors in FAHB

I'm looking into it. It kind of sounds like something is wrong with the files on our end or a transfer issue since multiple people on the same workunit are encountering the problem. It could be something simple, but if you give me some time, I'll see if I can track it down.

Thanks,
-Uplinger

[Oct 29, 2019 2:48:48 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:


Re: Errors in FAHB

So far, I was looking at the two that were posted as issues in this thread. They both had issues downloading the larger file on the input. It is 2.7MB. I attempted to download them from the definition in the workunit xml and they downloaded fine. I'm still looking, but things are looking towards a network issue, because from the second workunit listed, the other member downloaded the file correctly also and completed it.

FAH2_002384_zinc12100055_000004_000037_182
FAH2_002410_zinc14744839_000001_000042_177

I checked times they were sent out, and they were sent pretty quickly near each other:
2019-10-28 14:39:54
2019-10-28 14:46:07

I'm not seeing it as a create unit issue either, because the results were created 2 hours before the first transfer issues. Then the next workunit created worked....

2019-10-28 12:44:04
2019-10-28 14:44:05

Also, what is strange is the two failures listed, 1 is from the 28th, the other is from the 29th....almost 19 hours after.

Still looking though...
Thanks,
-Uplinger

[Oct 29, 2019 3:08:38 PM]

Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:


Re: Errors in FAHB

I love your work ethic Keith! I went to my logs and three are still listed with "couldn't get input files"

1st machine
FAH2_ 002372_ zinc11963773_ 000004_ 000037_ 168_ 0--
10/27/19 18:10:22
Error for me, valid for next chap.

2nd machine - same home
FAH2_ 002382_ zinc12097868_ 000001_ 000047_ 183_ 0--
10/29/19 10:12:14
Error for me, valid for next chap.

2nd machine - same home
FAH2_ 002372_ zinc11963773_ 000001_ 000006_ 153_ 0--
10/29/19 14:33:21
Error for me, error for next chap, valid for 3rd person
The error for me and next person was the same error.

Hope it helps!

Edit: Both of these machines are less than 6 months old, good chips and specs, and internet connection is solid.

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Seoulpowergrid at Oct 30, 2019 9:08:55 AM]

[Oct 30, 2019 9:04:56 AM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:


Re: Errors in FAHB

Seoulpowergrid,

Thanks! I'm still investigating, having these 3 should help in figuring out what is happening. I'm into the logs for transfers as well as the scheduler.

The thing that is strange is the timing of the failures. Usually the first copy of it fails, almost as if the file can't be found...but it exists and is found later.

Thanks,
-Uplinger

[Oct 30, 2019 12:52:21 PM]

Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:


Re: Errors in FAHB

Over the last three days I found like 15 more with "couldn't get input files". Some are from the previously mentioned two machines that are in my home and the others are from other machines in the same city or even a different city. All are Windows boxes - none of my Linux machines have had this issue. The send time and return time with error are consistently five minutes apart.

Most recent is this
FAH2_ 002387_ zinc12164843_ 000004_ 000068_ 172_ 0--
Sent time for me:
11/3/19 14:54:26

Sent time for 1st wingman, who also errored out:
11/3/19 15:01:01

Next wingman go the WU at the following time and it seems it didn't error out:
11/3/19 15:06:43

Others are as follows.
FAH2_ 002447_ zinc18137783_ 000001_ 000065_ 168_ 0--
11/3/19 10:49:34

FAH2_ 002447_ zinc18137783_ 000002_ 000038_ 176_ 1--
11/3/19 10:47:29

FAH2_ 002384_ zinc12100055_ 000001_ 000059_ 173_ 0--
11/3/19 07:53:13

FAH2_ 002410_ zinc14744839_ 000002_ 000003_ 177_ 0--
11/3/19 03:22:06

FAH2_ 002257_ zinc01099260_ 000004_ 000023_ 190_ 0--
11/2/19 12:55:04

FAH2_ 002404_ zinc14537162_ 000003_ 000079_ 171_ 0--
11/2/19 11:15:26

FAH2_ 002404_ zinc14537162_ 000001_ 000008_ 176_ 0--
11/2/19 10:32:10

FAH2_ 002372_ zinc11963773_ 000002_ 000093_ 185_ 0--
11/2/19 08:22:14

FAH2_ 002257_ zinc01099260_ 000003_ 000019_ 187_ 0--
11/2/19 01:02:37

FAH2_ 002372_ zinc11963773_ 000003_ 000099_ 185_ 0--
11/1/19 19:41:10

FAH2_ 002384_ zinc12100055_ 000004_ 000078_ 180_ 0--
11/1/19 05:40:40

---------
And this one had a different error for me and valid result for wingman.
FAH2_ 002691_ zinc18249840_ 000002_ 000098_ 186_ 0--
11/2/19 23:57:59

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
Reached the end of the file.
(0x26) - exit code 38 (0x26)</message>
<stderr_txt>
INFO: result number = 0
%IMPACT-I: Requested file to open for appending md.out Does not exist.
Opening it as a new file.
%IMPACT-I: Softcore binding energy with umax = 1000.00000
%IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic
Non-Polar Hydration Model
%IMPACT-I: Hybrid potential for binding with lambda = 0.64390
agbnpf_assign_parameters(): info: attempting to load from SQL tables.
[09:27:43] INFO: Checkpointed. Progress 500 of 10000 steps complete CPU time 870.625000
[09:42:26] INFO: Checkpointed. Progress 1000 of 10000 steps complete CPU time 1623.546875
[09:57:12] INFO: Checkpointed. Progress 1500 of 10000 steps complete CPU time 2362.375000
forrtl: No process is on the other end of the pipe.
forrtl: severe (38): error during write, unit 6, file CONOUT$
Image PC Routine Line Source
wcgrid_fahb_bedam 00BF95B0 Unknown Unknown Unknown
wcgrid_fahb_bedam 00BC36AE Unknown Unknown Unknown
wcgrid_fahb_bedam 00BC1094 Unknown Unknown Unknown
wcgrid_fahb_bedam 009F907C _cwrite_ 37 utilities.for
wcgrid_fahb_bedam 008E20C7 Unknown Unknown Unknown
wcgrid_fahb_bedam 008E20C7 Unknown Unknown Unknown
wcgrid_fahb_bedam 00BA0EAD Unknown Unknown Unknown

</stderr_txt>
]]>

----------------------------------------

[Nov 3, 2019 3:30:49 PM]

[ ]