Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 171
Posts: 171   Pages: 18   [ Previous Page | 9 10 11 12 13 14 15 16 17 18 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 701739 times and has 170 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 971
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

@knreed - any chance something like this could be done as a side-effect of the OPNG down-time?


This is something that we will pass along to the Krembil staff as we transition everything over to them. The next few months will be an intensive focus on migrating the system over to Krembil's infrastructure and cross-training their staff to become familiar with the different pieces system (some of this has been going for awhile now, but some couldn't start until the change became official).

As a result, I don't think there will be much capacity to take this on in the short term, but I think that is something that can be revisited once the migration is complete.

Thanks for the acknowledgment - I didn't really think there was much chance for now, especially if it didn't take too long to sort out the validator, but I did think it was worth a mention...

Thanks also for the effort entailed in sorting out the validator; it looks as if it is helping with these awkward tasks!

Good luck with the ongoing work.

Cheers - Al.
[Sep 23, 2021 9:26:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Error GPU work units

(slightly off-topic, but closely related ...)
I've had 15 WUS from 3 devices error out, mostly or all when restarting from sleep or hibernation.
I haven't seen this behaviour before.

The detailed error messages are in the form:
Error: Unable to open job description file OPNG_<WU ID number>.job
and the exit codes are:
<message>
The environment is incorrect.
(0xa) - exit code 10 (0xa)</message>


Example of the end part of a log file:
INFO:[08:59:15] Start AutoDock for OB3ZINC001107003131--7jji_003_mgl_rot-N121--AARG190_inert.dpf(Job #64)...
OpenCL device: GeForce GTX 970
INFO:[08:59:45] End AutoDock...
INFO:[08:59:45] Start AutoDock for OB3ZINC000903000887--7jji_003_mgl_rot-N121--AARG190_inert.dpf(Job #65)...
OpenCL device: GeForce GTX 970
projects/www.worldcommunitygrid.org/wcgrid_opng_autodockgpu_7.28_windows_x86_64__opencl_nvidia_102 -jobs OPNG_0087276_00021.job -input OPNG_0087276_00021.zip -seed 660440409 -wcgruns 13300 -wcgdpf 266
INFO: Using gpu device from app init data 0
Error: Unable to open job description file OPNG_0087276_00021.job
15:10:10 (7176): called boinc_finish(10)

The machine (3770K, Win7-x64) went into scheduled sleep at 09:00 and was wakened manually at around15:09 (UTC+10, 24/9/21)
-----
Off-topic: I trialled putting the parts of the above message that are in Courier font into "small" font, but they become ridiculously tiny. Yes, ridiculously tiny!
Could we perhaps have something in-between?
And when typing in the intial message, I selected some text, then slected Courier in the Font box, and the forum software just inserted <square bracket> font=courier new] and <square bracket> font] around the selected text, in situ.
But when I did this for some other text after the first preview, I got popup windows which duplicated the text upon closing the popups. Inconsistent.

And why is the text in the results status pages so huge. And grey. And why is the "apply filters" button there is initially scrolled off-page?
HTH
[Sep 24, 2021 7:48:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2186
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error GPU work units

Yes, I noticed the same here. Not that I put my computer to sleep, but I tried suspending the WU, shut down BOINC, and then restarted BOINC, and resumed the WU.

It was like yours it seems one of the '"big'" ones with many "jobs" in it. The "normal" ones, are perfectly safe to suspend and resume, as well as put the computer to sleep. However these "big" ones, does not seem able to handle that, without erroring out

My WU, (Windows 8.1) https://www.worldcommunitygrid.org/contribution/workunit/821144975

Part of data from stderr_txt:

<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
Felaktig miljö. (which in English means "The environment is incorrect"
(0xa) - exit code 10 (0xa)</message>


And at the end part of stderr.txt:

INFO:[14:55:43] Start AutoDock for OB3ZINC000923890044_2--7jji_003_mgl_rot-N121--AARG190_inert.dpf(Job #15)...
OpenCL device: GeForce GTX 980
projects/www.worldcommunitygrid.org/wcgrid_opng_autodockgpu_7.28_windows_x86_64__opencl_nvidia_102 -jobs OPNG_0087329_00019.job -input OPNG_0087329_00019.zip -seed 1054932250 -wcgruns 13150 -wcgdpf 263
INFO: Using gpu device from app init data 0
Error: Unable to open job description file OPNG_0087329_00019.job
15:00:08 (2636): called boinc_finish(10)

----------------------------------------
[Edit 3 times, last edit by Grumpy Swede at Sep 24, 2021 1:34:43 PM]
[Sep 24, 2021 1:30:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error GPU work units

Thanks, Grumpy Swede smile
I just found another oddity: A WU where all 7 copies were deemed "Too late".
OPNG_0086420_00045
My log file for it looks normal.
[Sep 24, 2021 3:25:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2186
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error GPU work units

Thanks, Grumpy Swede smile
I just found another oddity: A WU where all 7 copies were deemed "Too late".
OPNG_0086420_00045
My log file for it looks normal.

That's from a batch known to be problematic. The reason for the GPU pause the other days was for the team to try to fix that issue with an updated validator. There's thousands of those problematic batch 0086XXX WU's out there. So, that one you can forget about. smile

Edit: The validator fix didn't solve the "tight" issue totally, but it became at least a bit better. We're now past the 0086XXX batch, so until another equally "tight" batch comes our way, we will hopefully see less of the "Too Late", and "Invalid" messages.
----------------------------------------
[Edit 2 times, last edit by Grumpy Swede at Sep 24, 2021 4:08:24 PM]
[Sep 24, 2021 4:02:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error GPU work units

Something strange and new is happening with the validator. I got replication 7 (_6) of https://www.worldcommunitygrid.org/contribution/workunit/826331107.

None of the others have passed verification yet.

Edit - first and last are valid, the five in the middle are invalid. What's up with that?
----------------------------------------
[Edit 1 times, last edit by Richard Haselgrove at Sep 29, 2021 7:47:30 AM]
[Sep 29, 2021 7:45:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2186
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error GPU work units

Something strange and new is happening with the validator. I got replication 7 (_6) of https://www.worldcommunitygrid.org/contribution/workunit/826331107.

None of the others have passed verification yet.

Edit - first and last are valid, the five in the middle are invalid. What's up with that?

I think this one is on its way to the same high number of replications: https://www.worldcommunitygrid.org/contribution/workunit/827428602
[Sep 29, 2021 9:03:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2167
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

Something to sort out for someone …

workunit 273672597
OPNG_0172048_00083_0  Linux Ubuntu  Inval  2023-02-28T20:14:15  2023-02-28T21:24:11    0.25/0.25      58.6/0.0
OPNG_0172048_00083_1 Linux Ubuntu Inval 2023-02-28T20:14:33 2023-02-28T23:40:14 0.13/0.13 58.6/0.0
OPNG_0172048_00083_2 Fedora Linux Inval 2023-04-03T01:12:33 2023-04-03T01:23:46 0.17/0.18 58.6/0.0
OPNG_0172048_00083_3 Linux Ubuntu Inval 2023-04-03T01:12:36 2023-04-03T02:13:17 0.18/0.18 58.6/0.0
OPNG_0172048_00083_4 Linux Ubuntu Inval 2023-04-03T01:12:37 2023-04-03T01:41:05 0.18/0.24 58.6/0.0
OPNG_0172048_00083_5 Linuxmint SAbrt 2023-04-03T01:43:57 2023-04-03T02:03:13 0.00/0.00 0.0/0.0
OPNG_0172048_00083_6 Linuxmint Inval 2023-04-03T01:43:59 2023-04-03T01:56:47 0.20/0.20 58.6/0.0

workunit 273984646
OPNG_0172088_00112_0  Linuxmint     Inval  2023-03-01T01:44:42  2023-03-01T01:55:48    0.09/0.09      58.6/0.0
OPNG_0172088_00112_1 Linuxmint Inval 2023-04-06T14:06:15 2023-04-06T17:16:46 0.31/0.31 58.6/0.0
OPNG_0172088_00112_2 Linuxmint SAbrt 2023-04-06T14:06:16 2023-04-06T17:51:41 0.00/0.00 0.0/0.0
OPNG_0172088_00112_3 Fedora Linux Inval 2023-04-06T17:20:29 2023-04-06T17:58:31 0.31/0.32 58.6/0.0
OPNG_0172088_00112_4 Fedora Linux Inval 2023-04-06T17:20:33 2023-04-06T17:33:03 0.17/0.18 58.6/0.0
OPNG_0172088_00112_5 Linux Ubuntu Inval 2023-04-06T17:36:24 2023-04-06T17:44:30 0.10/0.10 58.6/0.0
OPNG_0172088_00112_6 Linux Ubuntu Inval 2023-04-06T17:36:24 2023-04-06T17:51:20 0.18/0.18 58.6/0.0

workunit 276778923
OPNG_0171938_00043_0  Linux Ubuntu  Inval  2023-04-06T09:51:33  2023-04-06T10:01:49    0.03/0.03      58.6/0.0
OPNG_0171938_00043_1 Linuxmint SAbrt 2023-04-06T20:22:30 2023-04-08T08:22:30 0.00/0.00 0.0/0.0
OPNG_0171938_00043_2 Linux Ubuntu Inval 2023-04-06T20:22:31 2023-04-06T22:14:40 0.60/0.61 58.6/0.0
OPNG_0171938_00043_3 LinuxMint Inval 2023-04-06T22:14:56 2023-04-06T22:25:20 0.16/0.16 58.6/0.0
OPNG_0171938_00043_4 Linux Ubuntu Inval 2023-04-06T22:15:07 2023-04-06T22:19:20 0.05/0.05 58.6/0.0
OPNG_0171938_00043_5 Fedora Linux Inval 2023-04-06T22:19:32 2023-04-06T22:30:53 0.17/0.17 58.6/0.0
OPNG_0171938_00043_6 Linux Ubuntu Inval 2023-04-06T22:19:34 2023-04-06T22:26:43 0.11/0.11 58.6/0.0

----------------------------------------
[Edit 2 times, last edit by adriverhoef at Apr 29, 2023 9:34:50 PM]
[Apr 3, 2023 9:10:22 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2167
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

It's been a few weeks since I saw an all-Invalid workunit. Here follows one I stumbled across today:

workunit 295237528
OPNG_0187139_00111_0  Fedora Linux  Inval  2023-04-29T12:56:41  2023-04-29T13:34:02    0.15/0.15      58.6/0.0
OPNG_0187139_00111_1 Linux Ubuntu Inval 2023-04-29T13:34:24 2023-04-29T16:15:57 0.07/0.07 58.6/0.0
OPNG_0187139_00111_2 Linuxmint SAbrt 2023-04-29T13:34:31 2023-04-29T17:44:33 0.00/0.00 0.0/0.0
OPNG_0187139_00111_3 Linux Ubuntu Inval 2023-04-29T16:16:10 2023-04-29T17:24:51 0.12/0.12 58.6/0.0
OPNG_0187139_00111_4 Linux Ubuntu SAbrt 2023-04-29T16:16:28 2023-04-29T17:46:03 0.00/0.00 0.0/0.0
OPNG_0187139_00111_5 Linux GNOME Inval 2023-04-29T17:25:50 2023-04-29T17:41:54 0.05/0.05 58.6/0.0
OPNG_0187139_00111_6 Linux Ubuntu Inval 2023-04-29T17:25:36 2023-04-29T17:41:49 0.07/0.09 58.6/0.0
OPNG_0187139_00111_7 Linux Ubuntu Inval 2023-05-01T16:56:20 2023-05-01T17:08:18 0.16/0.16 58.6/0.0
OPNG_0187139_00111_8 Linuxmint Inval 2023-05-01T16:56:15 2023-05-01T17:02:35 0.08/0.08 58.6/0.0
OPNG_0187139_00111_9 MSWin 10 Inval 2023-05-13T03:49:23 2023-05-13T03:57:42 0.04/0.06 58.6/0.0
OPNG_0187139_00111_10 MSWin 7 SAbrt 2023-05-13T03:49:24 2023-05-13T04:00:39 0.00/0.00 0.0/0.0

And this is what I found in its error log:
	INFO:[18:34:42] Start AutoDock for OB37744998--5rmm_rna5_0--BASN179.dpf(Job #8)...
OpenCL device: NVIDIA GeForce GTX 1660 Ti
Error: Two atoms have the same XYZ coordinates!
INFO:[18:34:42] End AutoDock...
[ERROR] Failed to open either source or destination files while appending wcg_autodock4_sub.dlg to wcg_autodock4.dlg. Error: 2
[ERROR] Failed to open either source or destination files while appending wcg_autodock4_sub.xml to wcg_ad4-result.xml. Error: 2
INFO:[18:34:43] Start AutoDock for OB37708922--5rmm_rna5_0--BASN179.dpf(Job #9)...

EDIT 09-05-2023: It wasn't over yet, wingmen _7 and _8 have been added.
EDIT 15-05-2023: It wasn't over yet, wingmen _9 and _10 have been added.
----------------------------------------
[Edit 8 times, last edit by adriverhoef at May 15, 2023 5:20:35 PM]
[Apr 6, 2023 6:05:04 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2167
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

It's two days later, another workunit with all-Invalid tasks ...
workunit 296839243
OPNG_0185377_00039_0  Fedora Linux  Inval  2023-05-01T16:37:18  2023-05-01T16:41:40    0.06/0.07      58.6/0.0
OPNG_0185377_00039_1 Linuxmint Inval 2023-05-01T16:41:12 2023-05-01T16:51:44 0.07/0.07 58.6/0.0
OPNG_0185377_00039_2 Linux Ubuntu SAbrt 2023-05-01T16:51:47 2023-05-01T18:10:44 0.00/0.00 0.0/0.0
OPNG_0185377_00039_3 Linux Debian Inval 2023-05-01T16:52:14 2023-05-01T16:54:20 0.02/0.02 58.6/0.0
OPNG_0185377_00039_4 Linux GNOME SAbrt 2023-05-01T16:52:15 2023-05-01T20:16:50 0.00/0.00 0.0/0.0
OPNG_0185377_00039_5 Linux Ubuntu Inval 2023-05-01T16:54:42 2023-05-01T18:05:26 0.08/0.08 58.6/0.0
OPNG_0185377_00039_6 Linux Ubuntu Inval 2023-05-01T16:54:47 2023-05-01T17:00:35 0.02/0.02 58.6/0.0
OPNG_0185377_00039_7 Linuxmint SAbrt 2023-05-01T17:01:28 2023-05-01T18:31:29 0.00/0.00 0.0/0.0
OPNG_0185377_00039_8 LinuxMint SAbrt 2023-05-01T17:01:31 2023-05-01T18:19:23 0.00/0.00 0.0/0.0
OPNG_0185377_00039_9 MSWin 10 Inval 2023-05-13T03:49:22 2023-05-13T03:57:00 0.09/0.09 58.6/0.0
OPNG_0185377_00039_10 MSWin 10 Inval 2023-05-13T03:49:23 2023-05-13T03:53:33 0.04/0.05 58.6/0.0
---------------------------------------------------------------------------------------------------------------------------------------
Details:
	Logfile:
<core_client_version>7.20.2</core_client_version>
<stderr_txt>
../../projects/www.worldcommunitygrid.org/wcgrid_opng_autodockgpu_7.28_x86_64-pc-linux-gnu__opencl_nvidia_102 -jobs OPNG_0185377_00039.job -input OPNG_0185377_00039.zip -seed 454129411 -wcgruns 1800 -wcgdpf 36
INFO: Using gpu device from app init data 0
INFO:[18:37:28] Start AutoGrid...

autogrid4: Successful Completion.
INFO:[18:37:45] End AutoGrid...
INFO:[18:37:45] Start AutoDock for OB37744998--5rl7_nuc_0--ASER289.dpf(Job #0)...
OpenCL device: NVIDIA GeForce GTX 1660 SUPER
Error: Two atoms have the same XYZ coordinates!
INFO:[18:37:47] End AutoDock...
[ERROR] Failed to open either source or destination files while appending wcg_autodock4_sub.dlg to wcg_autodock4.dlg. Error: 2
[ERROR] Failed to open either source or destination files while appending wcg_autodock4_sub.xml to wcg_ad4-result.xml. Error: 2
INFO:[18:37:47] Start AutoDock for OB37736533_4--5rl7_nuc_0--ASER289.dpf(Job #1)...

No further errors until the task was completed and uploaded, same outcome for each completed task as part of the workunit.

Note: task _0 led to Invalid, upon which task _1 was sent.
When that one led to Invalid, tasks _2, _3 and _4 were sent.
When task _3 went Invalid, tasks _5 and _6 were sent out.
When task _6 got an Invalid outcome, tasks _7 and _8 were transmitted.
When task _5 fell into the Invalid trap, too, all possible remaining tasks were Server Aborted.

If you wonder 'where did I see this error before?', well, here it is in thread 44931.

Adri
PS This workunit also had a due time of 2023-05-03T04:37:18+0000 initially (task _0), that's 1½ days instead of the usual three. thinking

EDIT 15-05-2023: It wasn't over yet, wingmen _9 and _10 have been added.
----------------------------------------
[Edit 5 times, last edit by adriverhoef at May 15, 2023 5:23:22 PM]
[May 1, 2023 8:03:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 171   Pages: 18   [ Previous Page | 9 10 11 12 13 14 15 16 17 18 | Next Page ]
[ Jump to Last Post ]
Post new Thread