Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: FightAIDS@Home Phase 2 Thread: Errors in FAHB |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 14
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Ran out of Zika so switched to FAHB and have been getting a number of errors as follows:
<core_client_version>7.16.3</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> INFO: result number = 0 %IMPACT-I: Requested file to open for appending md.out Does not exist. Opening it as a new file. %IMPACT-I: Softcore binding energy with umax = 1000.00000 %IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic Non-Polar Hydration Model %IMPACT-I: Hybrid potential for binding with lambda = 0.23370 agbnpf_assign_parameters(): info: attempting to load from SQL tables. %IMPACT-E: Non-valid values generated from rrespa. This is probably because of bad initial geometry. Please run minimization process for some steps before running MD </stderr_txt> ]]> There has been about 10 in the last hour. Since the website at Temple says everything is 100%, what are these WUs? |
||
|
Sid2
Senior Cruncher USA Joined: Jun 12, 2007 Post Count: 259 Status: Offline Project Badges: |
I also had a FAH error out:
----------------------------------------FAH2_ 002410_ zinc14744839_ 000001_ 000042_ 177_ 0-- |
||
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 815 Status: Offline Project Badges: |
There has been about 10 in the last hour. Since the website at Temple says everything is 100%, what are these WUs? Seems to be that the chain has started, so the first parts have all been sent out, but yeah, I would assume 100% = 100% are done but I've got steady flow of them. I've had 4 error out, 3 were on the same machine and the other was in the other room with the same internet connection. All errored out immediately. Has had clean results for several days. The error message on my four was different than yours: WU download error: couldn't get input files: <file_xfer_error> <file_name>1f10ac96799ef1342c547eeca0a61c17.dms</file_name> <error_code>-119 (md5 checksum failed for file)</error_code> <error_message>MD5 check failed</error_message> |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
These look like different work units as they run in about 6 hours where the "older" WUs ran in about 12 to 15. I'm wondering if these are "betas" not labeled as betas. Hybrid beta?
|
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: |
I've had one of the "couldn't get input files" WUs too:
. . FAH2_002384_zinc12100055_000004_000037_182 My copy was repair unit 1, original wingman had an identical error log. Repair unit 2 is still In Progress. |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
I'm looking into it. It kind of sounds like something is wrong with the files on our end or a transfer issue since multiple people on the same workunit are encountering the problem. It could be something simple, but if you give me some time, I'll see if I can track it down.
Thanks, -Uplinger |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
So far, I was looking at the two that were posted as issues in this thread. They both had issues downloading the larger file on the input. It is 2.7MB. I attempted to download them from the definition in the workunit xml and they downloaded fine. I'm still looking, but things are looking towards a network issue, because from the second workunit listed, the other member downloaded the file correctly also and completed it.
FAH2_002384_zinc12100055_000004_000037_182 FAH2_002410_zinc14744839_000001_000042_177 I checked times they were sent out, and they were sent pretty quickly near each other: 2019-10-28 14:39:54 2019-10-28 14:46:07 I'm not seeing it as a create unit issue either, because the results were created 2 hours before the first transfer issues. Then the next workunit created worked.... 2019-10-28 12:44:04 2019-10-28 14:44:05 Also, what is strange is the two failures listed, 1 is from the 28th, the other is from the 29th....almost 19 hours after. Still looking though... Thanks, -Uplinger |
||
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 815 Status: Offline Project Badges: |
I love your work ethic Keith! I went to my logs and three are still listed with "couldn't get input files"
----------------------------------------1st machine FAH2_ 002372_ zinc11963773_ 000004_ 000037_ 168_ 0-- 10/27/19 18:10:22 Error for me, valid for next chap. 2nd machine - same home FAH2_ 002382_ zinc12097868_ 000001_ 000047_ 183_ 0-- 10/29/19 10:12:14 Error for me, valid for next chap. 2nd machine - same home FAH2_ 002372_ zinc11963773_ 000001_ 000006_ 153_ 0-- 10/29/19 14:33:21 Error for me, error for next chap, valid for 3rd person The error for me and next person was the same error. Hope it helps! Edit: Both of these machines are less than 6 months old, good chips and specs, and internet connection is solid. [Edit 1 times, last edit by Seoulpowergrid at Oct 30, 2019 9:08:55 AM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
Seoulpowergrid,
Thanks! I'm still investigating, having these 3 should help in figuring out what is happening. I'm into the logs for transfers as well as the scheduler. The thing that is strange is the timing of the failures. Usually the first copy of it fails, almost as if the file can't be found...but it exists and is found later. Thanks, -Uplinger |
||
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 815 Status: Offline Project Badges: |
Over the last three days I found like 15 more with "couldn't get input files". Some are from the previously mentioned two machines that are in my home and the others are from other machines in the same city or even a different city. All are Windows boxes - none of my Linux machines have had this issue. The send time and return time with error are consistently five minutes apart.
----------------------------------------Most recent is this FAH2_ 002387_ zinc12164843_ 000004_ 000068_ 172_ 0-- Sent time for me: 11/3/19 14:54:26 Sent time for 1st wingman, who also errored out: 11/3/19 15:01:01 Next wingman go the WU at the following time and it seems it didn't error out: 11/3/19 15:06:43 Others are as follows. FAH2_ 002447_ zinc18137783_ 000001_ 000065_ 168_ 0-- 11/3/19 10:49:34 FAH2_ 002447_ zinc18137783_ 000002_ 000038_ 176_ 1-- 11/3/19 10:47:29 FAH2_ 002384_ zinc12100055_ 000001_ 000059_ 173_ 0-- 11/3/19 07:53:13 FAH2_ 002410_ zinc14744839_ 000002_ 000003_ 177_ 0-- 11/3/19 03:22:06 FAH2_ 002257_ zinc01099260_ 000004_ 000023_ 190_ 0-- 11/2/19 12:55:04 FAH2_ 002404_ zinc14537162_ 000003_ 000079_ 171_ 0-- 11/2/19 11:15:26 FAH2_ 002404_ zinc14537162_ 000001_ 000008_ 176_ 0-- 11/2/19 10:32:10 FAH2_ 002372_ zinc11963773_ 000002_ 000093_ 185_ 0-- 11/2/19 08:22:14 FAH2_ 002257_ zinc01099260_ 000003_ 000019_ 187_ 0-- 11/2/19 01:02:37 FAH2_ 002372_ zinc11963773_ 000003_ 000099_ 185_ 0-- 11/1/19 19:41:10 FAH2_ 002384_ zinc12100055_ 000004_ 000078_ 180_ 0-- 11/1/19 05:40:40 --------- And this one had a different error for me and valid result for wingman. FAH2_ 002691_ zinc18249840_ 000002_ 000098_ 186_ 0-- 11/2/19 23:57:59 <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> Reached the end of the file. (0x26) - exit code 38 (0x26)</message> <stderr_txt> INFO: result number = 0 %IMPACT-I: Requested file to open for appending md.out Does not exist. Opening it as a new file. %IMPACT-I: Softcore binding energy with umax = 1000.00000 %IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic Non-Polar Hydration Model %IMPACT-I: Hybrid potential for binding with lambda = 0.64390 agbnpf_assign_parameters(): info: attempting to load from SQL tables. [09:27:43] INFO: Checkpointed. Progress 500 of 10000 steps complete CPU time 870.625000 [09:42:26] INFO: Checkpointed. Progress 1000 of 10000 steps complete CPU time 1623.546875 [09:57:12] INFO: Checkpointed. Progress 1500 of 10000 steps complete CPU time 2362.375000 forrtl: No process is on the other end of the pipe. forrtl: severe (38): error during write, unit 6, file CONOUT$ Image PC Routine Line Source wcgrid_fahb_bedam 00BF95B0 Unknown Unknown Unknown wcgrid_fahb_bedam 00BC36AE Unknown Unknown Unknown wcgrid_fahb_bedam 00BC1094 Unknown Unknown Unknown wcgrid_fahb_bedam 009F907C _cwrite_ 37 utilities.for wcgrid_fahb_bedam 008E20C7 Unknown Unknown Unknown wcgrid_fahb_bedam 008E20C7 Unknown Unknown Unknown wcgrid_fahb_bedam 00BA0EAD Unknown Unknown Unknown </stderr_txt> ]]> |
||
|
|