Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: FightAIDS@Home Phase 2 Thread: All jobs are failing with Invalid |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 47
|
Author |
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Hi SekeRob. You asked for it. Project Name: FightAIDS@Home - Phase 2 Created: 04/23/2016 20:15:44 Name: FAH2_000077_avx38672_000072_0022_021 Minimum Quorum: 1 Replication: 2 Result Name OS type OS version App Version Number Status Sent Time Time Due / Return Time CPU Time / Elapsed Time (hours) Claimed/ Granted BOINC Credit FAH2_ 000077_ avx38672_ 000072_ 0022_ 021_ 2-- Linux 3.10.0-327.4.5.el7.x86_64 - In Progress 4/25/16 05:03:35 4/29/16 05:03:35 8.32 116.6 / 0.0 FAH2_ 000077_ avx38672_ 000072_ 0022_ 021_ 1-- Linux 3.19.0-20-generic - In Progress 4/25/16 05:03:31 4/29/16 05:03:31 3.63 77.7 / 0.0 FAH2_ 000077_ avx38672_ 000072_ 0022_ 021_ 0-- Linux 3.16.0-70-generic 715 Invalid 4/24/16 02:50:20 4/25/16 05:02:00 13.06 490.2 / 0.0 === ME. @TECHS, why are two copies sent out after an invalid? What happens when both copies start trickling? What happens if both finish proper? Very contra to what you indicated FAH2 would be. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Seeing the same failure issues for my WUs after my computer is offline (which is regularly is).
No more FA@H for me :( |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
Sekerob,
Yes it is something I've been looking into. Two copies at the same time should not be sent out. I thought I had a fix for it within the transitioner, but that does not appear to be working as expected. I need to enable more logging to see if I can catch where it is getting bumped. Thanks, -Uplinger |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Not just this project but in general, if a machine has been consistently been returning invalid results, it would be nice to cease sending WU's to them for those particular projects. Sometimes an issue on a single machine just causes more work to be done overall. Take OET; if they return a result but yet they are not trusted to return valid results, when the wingman returns theirs, the check will show that they are not identical and the WU gets sent to a third computer. So you have gone from it only needing to be sent to 1 computer to three computers needing to work on the same WU.
|
||
|
supdood
Senior Cruncher USA Joined: Aug 6, 2015 Post Count: 333 Status: Offline Project Badges: |
Sekerob, Yes it is something I've been looking into. Two copies at the same time should not be sent out. I thought I had a fix for it within the transitioner, but that does not appear to be working as expected. I need to enable more logging to see if I can catch where it is getting bumped. Thanks, -Uplinger Just had one go invalid (unknown reason) and then two additional WUs were created and sent: Result Log Result Name: FAH2_ 000081_ avx38741_ 000032_ 0054_ 021_ 0-- <core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> [08:25:36] INFO:Turning trickle messaging on. [08:25:36] INFO:Turning intermediate uploads on. %IMPACT-I: Requested file to open for appending md.out Does not exist. Opening it as a new file. %IMPACT-I: Softcore binding energy with umax = 1000.00000 %IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic Non-Polar Hydration Model %IMPACT-I: Hybrid potential for binding with lambda = 0.55000 agbnpf_assign_parameters(): info: attempting to load from SQL tables. [08:39:33] INFO: Checkpointed. Progress 1000 of 100000 steps complete CPU time 733.859904 [08:54:02] INFO: Checkpointed. Progress 2000 of 100000 steps complete CPU time 1511.244087 [09:09:24] INFO: Checkpointed. Progress 3000 of 100000 steps complete CPU time 2297.067925 [09:23:59] INFO: Checkpointed. Progress 4000 of 100000 steps complete CPU time 3096.136247 [09:38:59] INFO: Checkpointed. Progress 5000 of 100000 steps complete CPU time 3886.671314 [09:53:01] INFO: Checkpointed. Progress 6000 of 100000 steps complete CPU time 4671.824347 [10:07:43] INFO: Checkpointed. Progress 7000 of 100000 steps complete CPU time 5471.095471 [10:28:14] INFO: Checkpointed. Progress 8000 of 100000 steps complete CPU time 6264.766158 [10:41:56] INFO: Checkpointed. Progress 9000 of 100000 steps complete CPU time 7061.338465 [10:56:53] INFO: Sending trickle message to server. [10:56:53] INFO: Starting intermediate upload, index = 1 [10:56:53] INFO: Checkpointed. Progress 10000 of 100000 steps complete CPU time 7861.826396 ....... [13:09:31] INFO: Checkpointed. Progress 80000 of 100000 steps complete CPU time 65666.575422 [13:24:29] INFO: Checkpointed. Progress 81000 of 100000 steps complete CPU time 66456.392885 [13:38:27] INFO: Checkpointed. Progress 82000 of 100000 steps complete CPU time 67241.062315 [13:52:58] INFO: Checkpointed. Progress 83000 of 100000 steps complete CPU time 68030.973378 [14:07:03] INFO: Checkpointed. Progress 84000 of 100000 steps complete CPU time 68813.770796 [14:21:08] INFO: Checkpointed. Progress 85000 of 100000 steps complete CPU time 69601.295044 [14:35:09] INFO: Checkpointed. Progress 86000 of 100000 steps complete CPU time 70375.231605 [15:19:28] INFO: Checkpointed. Progress 87000 of 100000 steps complete CPU time 71158.309825 [15:33:41] INFO: Checkpointed. Progress 88000 of 100000 steps complete CPU time 71950.342502 [15:47:10] INFO: Checkpointed. Progress 89000 of 100000 steps complete CPU time 72736.010338 [16:03:28] INFO:Turning trickle messaging on. [16:03:28] INFO:Turning intermediate uploads on. %IMPACT-I: Softcore binding energy with umax = 1000.00000 %IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic Non-Polar Hydration Model %IMPACT-I: Hybrid potential for binding with lambda = 0.55000 agbnpf_assign_parameters(): info: attempting to load from SQL tables. [16:09:13] INFO: Sending trickle message to server. [16:09:13] INFO: Starting intermediate upload, index = 9 [16:09:13] INFO: Checkpoint skipped. Progress 90000/100000 CPU time 73079.368196 [08:15:54] INFO:Turning trickle messaging on. [08:15:54] INFO:Turning intermediate uploads on. %IMPACT-I: Softcore binding energy with umax = 1000.00000 %IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic Non-Polar Hydration Model %IMPACT-I: Hybrid potential for binding with lambda = 0.55000 agbnpf_assign_parameters(): info: attempting to load from SQL tables. [08:32:12] INFO: Sending trickle message to server. [08:32:12] INFO: Starting intermediate upload, index = 9 [08:32:12] INFO: Checkpointed. Progress 90000 of 100000 steps complete CPU time 73544.297976 [08:46:00] INFO: Checkpointed. Progress 91000 of 100000 steps complete CPU time 74324.302976 [08:59:41] INFO: Checkpointed. Progress 92000 of 100000 steps complete CPU time 75115.633649 [09:15:11] INFO: Checkpointed. Progress 93000 of 100000 steps complete CPU time 75909.865940 [09:29:08] INFO: Checkpointed. Progress 94000 of 100000 steps complete CPU time 76693.692965 [09:43:07] INFO: Checkpointed. Progress 95000 of 100000 steps complete CPU time 77478.596396 [10:01:13] INFO: Checkpointed. Progress 96000 of 100000 steps complete CPU time 78267.571453 [10:14:48] INFO: Checkpointed. Progress 97000 of 100000 steps complete CPU time 79053.098889 [10:29:13] INFO: Checkpointed. Progress 98000 of 100000 steps complete CPU time 79841.527943 [10:43:02] INFO: Checkpointed. Progress 99000 of 100000 steps complete CPU time 80622.780951 [10:57:30] INFO: Checkpointed. Progress 100000 of 100000 steps complete CPU time 81416.061636 %IMPACT-I: Species 1 written to SQL file md-out1.dms %IMPACT-I: Species 2 written to SQL file md-out2.dms 10:57:32 (2580): called boinc_finish(0) </stderr_txt> ]]> FAH2_ 000081_ avx38741_ 000032_ 0054_ 021_ 2-- Microsoft Windows 10 Professional x64 Edition, (10.00.10586.00) - In Progress 5/5/16 14:57:59 5/9/16 14:57:59 0.00 0.0 / 0.0 FAH2_ 000081_ avx38741_ 000032_ 0054_ 021_ 1-- Microsoft Windows 7 x64 Edition, Service Pack 1, (06.01.7601.00) - In Progress 5/5/16 14:57:58 5/9/16 14:57:58 0.00 0.0 / 0.0 FAH2_ 000081_ avx38741_ 000032_ 0054_ 021_ 0-- Microsoft Windows 7 Enterprise x64 Edition, Service Pack 1, (06.01.7601.00) 714 Invalid 5/2/16 12:25:28 5/5/16 14:57:45 22.62 451.0 / 0.0 |
||
|
Ian Cantwell
Cruncher Joined: Jul 19, 2013 Post Count: 15 Status: Offline Project Badges: |
The following is listed as invalid: FAH2_ 000091_ avx38782_ 000051_ 0005_ 020_ 0--
I've not had this problem before. The result log has <core_client_version>7.2.47</core_client_version> <![CDATA[ <stderr_txt> [08:53:44] INFO:Turning trickle messaging on. [08:53:44] INFO:Turning intermediate uploads on. %IMPACT-I: Requested file to open for appending md.out Does not exist. Opening it as a new file. %IMPACT-I: Softcore binding energy with umax = 1000.00000 %IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic Non-Polar Hydration Model %IMPACT-I: Hybrid potential for binding with lambda = 0.00600 agbnpf_assign_parameters(): info: attempting to load from SQL tables. Than it checkpoints successfully to the end and finishes with %IMPACT-I: Species 1 written to SQL file md-out1.dms %IMPACT-I: Species 2 written to SQL file md-out2.dms 08:35:08 (7752): called boinc_finish(0) |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
The operation with FAH2 is straight... do not interrupt communication while computing. If the last trickle is send, and any previous trickle was not already reported, the validation fails. If you set the trickle log flag in the cc_config.xml,
<trickle_debug>1</trickle_debug> then the actual trickle initiation is printed in the event log. |
||
|
Ian Cantwell
Cruncher Joined: Jul 19, 2013 Post Count: 15 Status: Offline Project Badges: |
I rechecked the result log and found that: NFO: Starting intermediate upload, index = 1 to 9, as far as I can see all trickles were reported. Comparing it with a valid unit I see no difference
My computers do sometimes spontaneously go to sleep but if this was an issue more of my units would fail |
||
|
Greger
Cruncher Joined: Aug 1, 2013 Post Count: 29 Status: Offline Project Badges: |
No info that network was required, found this after huge amount task getting invalid.
260 task lost Learn the hard way for each project. |
||
|
Caranthir
Cruncher Joined: May 7, 2016 Post Count: 8 Status: Offline Project Badges: |
I also had lots of invalid results with old BOINC Client. My mistake was downloading the Boinc client from the WCG website ( https://secure.worldcommunitygrid.org/reg/ms/viewDownloadAgain.do ) which is a much older version (7.2.47) of BOINC. I got lots of invalid results with this client and wasted a lot of time. Then as i was looking for the reason for invalid results, I saw that there is a much newer version of BOINC (7.6.22) on the official website. ( https://boinc.berkeley.edu/download.php ) I started using the newer version and now all my results are valid.
|
||
|
|