| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 11
|
|
| Author |
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
OPN1_0031441_00522_3-- Linux Fedora 717 Error 1/17/21 02:09:18 1/17/21 05:51:31 3.62 72.5 / 0.0_0: (unknown error) - exit code -1 (0xffffffff) _1, _2, _3: process exited with code 255 (0xff, -1) (These are, in fact, bitwise the same error message.) OPN1_0031441_00522_3-- Linux Fedora 717 Error 1/17/21 02:09:18 1/17/21 05:51:31 3.62 72.5 / 0.0 |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Just had my first error on OPN1 - more of the same, but less output!... I got _3 and _2 and _4 haven't replied yet, but I think the outcome's a foregone conclusion!
----------------------------------------Result Name: OPN1_ 0031441_ 00814_ 3-- I collect data on jobs and ligand sizes as OPN1 jobs turn up: this one had a single job with a 58 atom ligand with 36 branches (torsional degrees of freedom) - that's a complex beastie!!! I've seen larger ligands (65 or 66 atoms being the largest I've seen) but the next largest number of branches I've ever seen was 21 (on a 39 atom ligand). I wonder if these failing jobs all have extremely complex ligands? Cheers - Al. [Edited to correct "largest ligand" numbers] [Edit 1 times, last edit by alanb1951 at Jan 22, 2021 3:30:46 AM] |
||
|
|
Macromancer
Veteran Cruncher United States Joined: Sep 6, 2016 Post Count: 994 Status: Offline Project Badges:
|
Getting these computation error crashes somewhat frequently, i.e. weekly. Causes the PC to reboot as well.
Result Name: OPN1_ 0033471_ 08268_ 0-- <core_client_version>7.14.3</core_client_version> <![CDATA[ <message> (unknown error) - exit code -1073741819 (0xc0000005)</message> <stderr_txt> </stderr_txt> ]]> Macromancer |
||
|
|
William Albert
Cruncher Joined: Apr 5, 2020 Post Count: 41 Status: Offline Project Badges:
|
Getting these computation error crashes somewhat frequently, i.e. weekly. Causes the PC to reboot as well. A broken work unit will cause that work unit to crash (or otherwise error out), but it shouldn't destabilize the entire computer. If running WU's is causing your computer to reboot completely, that's a clear sign that your computer is having some type of hardware reliability issue, and you should resolve that before continuing to crunch WU's. [Edit 3 times, last edit by William Albert at Feb 3, 2021 5:16:18 PM] |
||
|
|
Macromancer
Veteran Cruncher United States Joined: Sep 6, 2016 Post Count: 994 Status: Offline Project Badges:
|
Getting these computation error crashes somewhat frequently, i.e. weekly. Causes the PC to reboot as well. A broken work unit will cause that work unit to crash (or otherwise error out), but it shouldn't destabilize the entire computer. If running WU's is causing your computer to reboot completely, that's a clear sign that your computer is having some type of hardware reliability issue, and you should resolve that before continuing to crunch WU's. it very well could be a hardware issue since this is the only PC that has computation errors Could it be as simple as I need to allocate more disk space? <core_client_version>7.14.3</core_client_version> <![CDATA[ <message> Disk usage limit exceeded</message> <stderr_txt> INFO:[08:50:35] Start AutoGrid... autogrid4: Successful Completion. INFO:[08:50:50] End AutoGrid... INFO:[08:50:50] Start AutoDock for ZINC000414667010-ACR2.6_RX1--6y84_001_gln110-rot--CYS156_wcgsplit2.dpf(Job #0)... INFO: In AutoDock main_autodock() Beginning AutoDock... INFO: Setting num_generations: 27000 About to enter main loop...(dockings already completed: 0) INFO:[08:53:44] Finished Docking number 0 INFO:[08:56:36] Finished Docking number 1 INFO:[08:59:24] Finished Docking number 2 . . . INFO:[09:56:30] Finished Docking number 22 INFO:[09:59:19] Finished Docking number 23 INFO:[11:53:17] Start AutoGrid... autogrid4: Successful Completion. INFO:[11:53:32] End AutoGrid... INFO:[11:53:32] Start AutoDock for ZINC000414667010-ACR2.6_RX1--6y84_001_gln110-rot--CYS156_wcgsplit2.dpf(Job #0)... INFO: In AutoDock main_autodock() Beginning AutoDock... INFO: Setting num_generations: 27000 About to enter main loop...(dockings already completed: 0) INFO:[11:56:23] Finished Docking number 0 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x00007FFCC3F39AD2 |
||
|
|
ChristianVirtual
Advanced Cruncher Japan Joined: Jan 11, 2014 Post Count: 55 Status: Offline Project Badges:
|
Got one of those
----------------------------------------Result Name: OPN1_ 0033471_ 06810_ 3-- <core_client_version>7.11.0</core_client_version> <![CDATA[ <message> process exited with code 255 (0xff, -1)</message> <stderr_txt> INFO:[16:08:37] Start AutoGrid... autogrid4: Successful Completion. INFO:[16:08:54] End AutoGrid... INFO:[16:08:54] Start AutoDock for ZINC000100467005-ACR2.44_RX1--6y84_001_gln110-rot--CYS156_wcgsplit2.dpf(Job #0)... INFO: In AutoDock main_autodock() Beginning AutoDock... </stderr_txt> ]]> Failed on four other donors too but looks like my account is set on quorum 2 now ? Somehow I pile up a bunch of pages with pending validation/verification.
Active with WCG, GPUGrid, F@H
----------------------------------------[Edit 1 times, last edit by ChristianVirtual at Feb 4, 2021 11:53:51 AM] |
||
|
|
Macromancer
Veteran Cruncher United States Joined: Sep 6, 2016 Post Count: 994 Status: Offline Project Badges:
|
One of my linux boxes just shut down last night due to an OPN1 error. It's hard to believe I suddenly have hardware issues with two separate PCs. There must be an issue with the OPN1 work units. Bummer.
Macromancer |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
$ wcgstats -wQt2 -aOPN1 -sE
----------------------------------------OPN1_0033471_01974_3-- MSWin 10 717 Error 2/4/21 11:02:06 2/4/21 14:33:50 0.01 0.2 / 0.0To be more specific:$ wcgstats -wQt2 -aOPN1 -sE -w OPN1_0033471_01974_3-- MSWin 10 717 Error 2/4/21 11:02:06 2/4/21 14:33:50 0.01 0.2 / 0.0[Edit 2 times, last edit by adriverhoef at Feb 4, 2021 4:33:16 PM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
In Adriverhoef's case I am going out on a limb and say that unit was a malformed unit, which will be corrected by the techs/scientists and resubmitted. I have seen these occasionally on different machines and the usual giveaway is the amount of time they use, which is minimal, maybe just seconds or a minute or two.If they are failing on disparate machines, they must be some sort of bad build.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
It's a pity that OPN1 work-units don't write a bit of job-description information to stderr so we can see what the jobs are like... (Compare with MIP1, for instance.)
I've had four of these job crashes so far, on three different machines (including one on my Pi4!) though none of them took the [Linux] system down! - as I noted further up this thread, the crashing (sub)jobs are working with ligands that have very large numbers of branches (and hence high "Torsional Degrees of Freedom"). I've been logging OPN1 job parameters for quite a while now (I have data on over 32,000 work-units containing over 70,000 jobs) and the largest number of branches I've seen in a task that completed without error has been 21. Three of the four tasks that I've had fail had 36 branches and the other one had 30! (I've not had any batch 33471 tasks, so I seem to have missed out on this tranche of "bad" units! ...) Unfortunately, unless one is willing to dive into the various files that turn up with an OPN1 job before it starts to run, there doesn't seem to be any way to get hold of those numbers; by the time it has crashed, it's too late! So we are left to speculate on reasons; are the units malformed, or are they simply too complex for some limit in the existing software??? (For instance, if more degrees of freedom need more heavily nested function calls, perhaps some stack limit is being broken...) Hopefully, if this becomes a more common occurrence something will be done about it (especially if it seems to be able to crash someone's machine on a regular basis!) Cheers - Al. |
||
|
|
|