Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 23
|
![]() |
Author |
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I am still seeing the 0x100 error in Job #6. Lost 10+ hours on this this unit. Two wingmen have error'ed out also.
----------------------------------------This project is OK with wasting our CPU time. These errors keep happening and don't get corrected. That's the way it is. We'll just have to deal with it. E226230_ 927_ S.298.C38H25N3O3.VUZWCEQNTYAYDC-UHFFFAOYSA-N.10_ s1_ 14_ 2-- 00:51:17] Starting job 6,CPU time has been restored to 22252.462689. [00:51:17] Starting new Job [00:51:17] Qink name = fldman [00:51:26] Qink name = gesman [00:51:27] Qink name = scfman Application exited with RC = 0x100 [05:00:44] Finished Job #6 [05:00:44] Starting job 7,CPU time has been restored to 36896.925910. [05:00:44] Skipping Job #7 05:00:49 (7534): called boinc_finish </stderr_txt> ]]> E226230_ 927_ S.298.C38H25N3O3.VUZWCEQNTYAYDC-UHFFFAOYSA-N.10_ s1_ 14_ 2-- 700 Error 11/6/14 23:17:46 11/7/14 09:57:33 10.32 265.4 / 0.0 E226230_ 927_ S.298.C38H25N3O3.VUZWCEQNTYAYDC-UHFFFAOYSA-N.10_ s1_ 14_ 1-- 700 Error 11/6/14 23:11:20 11/7/14 19:19:40 18.00 191.5 / 0.0 E226230_ 927_ S.298.C38H25N3O3.VUZWCEQNTYAYDC-UHFFFAOYSA-N.10_ s1_ 14_ 0-- 700 Error 10/31/14 17:21:12 11/5/14 21:06:51 7.40 273.6 / 0.0 [Edit 2 times, last edit by AgrFan at Nov 10, 2014 3:38:17 AM] |
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Another 9 hours wasted. Wingman error'ed out also.
----------------------------------------E226369_ 821_ S.324.C39H26S4.WYWZASOXIHBODY-UHFFFAOYSA-N.8_ s1_ 14_ 0-- [16:26:44] Starting job 6,CPU time has been restored to 22551.200000. [16:26:44] Starting new Job [16:26:44] Qink name = fldman [16:26:52] Qink name = gesman [16:26:53] Qink name = scfman Application exited with RC = 0x100 [19:23:14] Finished Job #6 [19:23:14] Starting job 7,CPU time has been restored to 32939.460000. [19:23:14] Skipping Job #7 19:23:20 (4530): called boinc_finish </stderr_txt> ]]> E226369_ 821_ S.324.C39H26S4.WYWZASOXIHBODY-UHFFFAOYSA-N.8_ s1_ 14_ 1-- 700 Error 11/8/14 15:07:15 11/9/14 04:16:42 4.82 217.7 / 0.0 E226369_ 821_ S.324.C39H26S4.WYWZASOXIHBODY-UHFFFAOYSA-N.8_ s1_ 14_ 0-- 700 Error 11/8/14 15:01:22 11/9/14 00:31:09 9.20 235.2 / 0.0 [Edit 1 times, last edit by AgrFan at Nov 10, 2014 3:37:37 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just to clarify, do these show as an Error in the Workunit Status? All current CEP2 units end during Job #6 and I have vague memories of RC = 0x100 being a "success" exit code on Linux (my Windows equivalent showing RC = 0x1). Shame if your examples are indeed Error status.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just to clarify, do these show as an Error in the Workunit Status? All current CEP2 units end during Job #6 and I have vague memories of RC = 0x100 being a "success" exit code on Linux (my Windows equivalent showing RC = 0x1). Shame if your examples are indeed Error status. Yes mine shows as an error. E226337_ 551_ S.314.C39H22N6O2.MQFBHXKKBRAWIG-UHFFFAOYSA-N.7_ s1_ 14_ 0-- 700 Error 11/6/14 20:56:47 11/7/14 20:24:12 12.02 463.0 / 0.0 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi all,
I will chat with our friends at IBM with this, but as I said earlier - the internal grid errors are not my area of expertise, and so I wont muddy the water! I would like to say that I am sure that if there were simple fix then it would already be fixed :) This is especially true if the error originates from within the quantum chemistry software which is very complex. Your Harvard CEP Team |
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
And yet another 10 hours wasted. Wingman error'ed out also.
----------------------------------------E226372_ 813_ S.318.C40H20N6O2.QCVJGDUPAQTTBK-UHFFFAOYSA-N.13_ s1_ 14_ 1-- [19:48:53] Starting job 6,CPU time has been restored to 22938.537567. [19:48:53] Starting new Job [19:48:54] Qink name = fldman [19:49:03] Qink name = gesman [19:49:05] Qink name = scfman Application exited with RC = 0x100 [23:42:06] Finished Job #6 [23:42:06] Starting job 7,CPU time has been restored to 36617.672459. [23:42:06] Skipping Job #7 23:42:12 (9504): called boinc_finish </stderr_txt> ]]> E226372_ 813_ S.318.C40H20N6O2.QCVJGDUPAQTTBK-UHFFFAOYSA-N.13_ s1_ 14_ 1-- 700 Error 11/8/14 18:04:32 11/9/14 04:39:23 10.24 241.0 / 0.0 E226372_ 813_ S.318.C40H20N6O2.QCVJGDUPAQTTBK-UHFFFAOYSA-N.13_ s1_ 14_ 0-- 700 Error 11/8/14 17:56:38 11/9/14 17:40:41 8.28 262.3 / 0.0 [Edit 1 times, last edit by AgrFan at Nov 10, 2014 3:36:38 AM] |
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
8+ hours wasted. Three wingmen error'ed out also.
----------------------------------------[10:14:25] Starting job 6,CPU time has been restored to 20040.770000. [10:14:25] Starting new Job [10:14:25] Qink name = fldman [10:14:33] Qink name = gesman [10:14:34] Qink name = scfman Application exited with RC = 0x100 [13:14:45] Finished Job #6 [13:14:45] Starting job 7,CPU time has been restored to 30642.560000. [13:14:45] Skipping Job #7 13:14:50 (8390): called boinc_finish </stderr_txt> ]]> E226353_ 379_ S.314.C34H20N6O4S1.STWPYDFUUJKZTQ-UHFFFAOYSA-N.7_ s1_ 14_ 4-- - In Progress 11/13/14 18:29:49 11/17/14 06:29:49 0.00 0.0 / 0.0 E226353_ 379_ S.314.C34H20N6O4S1.STWPYDFUUJKZTQ-UHFFFAOYSA-N.7_ s1_ 14_ 3-- 700 Error 11/13/14 04:32:47 11/13/14 18:23:41 8.56 186.2 / 0.0 E226353_ 379_ S.314.C34H20N6O4S1.STWPYDFUUJKZTQ-UHFFFAOYSA-N.7_ s1_ 14_ 2-- 700 Error 11/13/14 04:32:25 11/13/14 10:00:52 4.57 198.9 / 0.0 E226353_ 379_ S.314.C34H20N6O4S1.STWPYDFUUJKZTQ-UHFFFAOYSA-N.7_ s1_ 14_ 1-- 700 Error 11/7/14 18:07:36 11/8/14 04:10:41 5.06 214.5 / 0.0 E226353_ 379_ S.314.C34H20N6O4S1.STWPYDFUUJKZTQ-UHFFFAOYSA-N.7_ s1_ 14_ 0-- 700 Error 11/7/14 18:03:23 11/13/14 04:32:05 7.80 265.0 / 0.0 [Edit 1 times, last edit by AgrFan at Nov 14, 2014 2:52:50 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Result Name: E226281_ 641_ S.320.C42H38N2O2.YDSAJTJFNHKLPF-UHFFFAOYSA-N.18_ s1_ 14_ 3--
<core_client_version>7.4.27</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [02:11:18] Number of jobs = 8 [02:11:18] Starting job 0,CPU time has been restored to 0.000000. [08:09:49] Finished Job #0 [08:09:49] Starting job 1,CPU time has been restored to 21000.137015. [08:30:28] Finished Job #1 [08:30:28] Starting job 2,CPU time has been restored to 22231.421708. [08:51:38] Finished Job #2 [08:51:38] Starting job 3,CPU time has been restored to 23480.568515. [09:15:11] Finished Job #3 [09:15:11] Starting job 4,CPU time has been restored to 24885.014318. [09:33:20] Finished Job #4 [09:33:20] Starting job 5,CPU time has been restored to 25966.350850. [09:50:08] Finished Job #5 [09:50:08] Starting job 6,CPU time has been restored to 26966.379660. Application exited with RC = 0x1 [13:11:02] Finished Job #6 [13:11:02] Starting job 7,CPU time has been restored to 38841.004179. [13:11:02] Skipping Job #7 13:11:10 (5372): called boinc_finish </stderr_txt> ]]> The above one is marked as PVAL. On the other hand, Result Name: E226281_ 641_ S.320.C42H38N2O2.YDSAJTJFNHKLPF-UHFFFAOYSA-N.18_ s1_ 14_ 1-- <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [20:10:33] Number of jobs = 8 [20:10:33] Starting job 0,CPU time has been restored to 0.000000. [02:10:58] Finished Job #0 [02:10:58] Starting job 1,CPU time has been restored to 21279.207204. [02:32:41] Finished Job #1 [02:32:41] Starting job 2,CPU time has been restored to 22573.407100. [02:53:19] Finished Job #2 [02:53:19] Starting job 3,CPU time has been restored to 23791.119706. [03:17:42] Finished Job #3 [03:17:42] Starting job 4,CPU time has been restored to 25242.256608. [03:36:12] Finished Job #4 [03:36:12] Starting job 5,CPU time has been restored to 26339.817244. [03:52:06] Finished Job #5 [03:52:06] Starting job 6,CPU time has been restored to 27282.812089. Application exited with RC = 0x1 [07:43:55] Finished Job #6 [07:43:55] Starting job 7,CPU time has been restored to 40976.002665. [07:43:55] Skipping Job #7 07:44:03 (3916): called boinc_finish </stderr_txt> ]]> The above one is marked as Error. What is the difference? I have experienced this type of results many times. In my understanding, RC=0x1(Windows) or RC=0x100(Linux) on job#6 is quite usual and should be treated as Valid. E226281_ 641_ S.320.C42H38N2O2.YDSAJTJFNHKLPF-UHFFFAOYSA-N.18_ s1_ 14_ 3-- 700 Pending Validation 14/11/14 17:09:26 14/11/15 05:10:32 10.79 349.4 / 0.0 E226281_ 641_ S.320.C42H38N2O2.YDSAJTJFNHKLPF-UHFFFAOYSA-N.18_ s1_ 14_ 4-- - In Progress 14/11/14 17:08:56 14/11/18 05:08:56 0.00 0.0 / 0.0 E226281_ 641_ S.320.C42H38N2O2.YDSAJTJFNHKLPF-UHFFFAOYSA-N.18_ s1_ 14_ 2-- 700 Error 14/11/13 12:30:10 14/11/14 17:06:20 10.49 381.5 / 0.0 E226281_ 641_ S.320.C42H38N2O2.YDSAJTJFNHKLPF-UHFFFAOYSA-N.18_ s1_ 14_ 1-- 700 Error 14/11/03 12:29:43 14/11/04 22:47:51 11.38 429.0 / 0.0 E226281_ 641_ S.320.C42H38N2O2.YDSAJTJFNHKLPF-UHFFFAOYSA-N.18_ s1_ 14_ 0-- - No Reply 14/11/03 12:27:16 14/11/13 12:27:16 0.00 0.0 / 0.0 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi All,
Our monthly phone call with IBM is coming up. I will bring this up with them then! Your Harvard CEP Team |
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
14+ hours ... wingman error'ed out also.
----------------------------------------[14:49:59] Starting job 6,CPU time has been restored to 30570.640000. [14:49:59] Starting new Job [14:49:59] Qink name = fldman [14:50:07] Qink name = gesman [14:50:08] Qink name = scfman Application exited with RC = 0x100 [20:50:08] Finished Job #6 [20:50:08] Starting job 7,CPU time has been restored to 51850.100000. [20:50:08] Skipping Job #7 20:50:14 (9299): called boinc_finish </stderr_txt> ]]> E226458_ 129_ S.326.C36H18N4S4.WJTXHZYZGYUENY-UHFFFAOYSA-N.2_ s1_ 14_ 2-- - In Progress 11/16/14 03:32:44 11/19/14 15:32:44 0.00 0.0 / 0.0 E226458_ 129_ S.326.C36H18N4S4.WJTXHZYZGYUENY-UHFFFAOYSA-N.2_ s1_ 14_ 3-- - In Progress 11/16/14 03:32:30 11/19/14 15:32:30 0.00 0.0 / 0.0 E226458_ 129_ S.326.C36H18N4S4.WJTXHZYZGYUENY-UHFFFAOYSA-N.2_ s1_ 14_ 1-- 700 Error 11/14/14 11:15:17 11/16/14 03:25:02 9.26 268.1 / 0.0 E226458_ 129_ S.326.C36H18N4S4.WJTXHZYZGYUENY-UHFFFAOYSA-N.2_ s1_ 14_ 0-- 700 Error 11/14/14 11:10:49 11/15/14 01:59:21 14.47 301.8 / 0.0 [Edit 1 times, last edit by AgrFan at Nov 16, 2014 4:03:38 AM] |
||
|
|
![]() |