Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 171
Posts: 171   Pages: 18   [ Previous Page | 8 9 10 11 12 13 14 15 16 17 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 731822 times and has 170 replies Next Thread
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


----------------------------------------
[Edit 2 times, last edit by nanoprobe at Jun 6, 2021 8:11:00 PM]
[Jun 6, 2021 8:08:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

Just today I found an Invalid task on a device that usually never returns any Invalid OPNG task. The original wingman had already returned it as an Invalid. So, a second, third and even a fourth wingman was called.
At first, the fourth wingman returned it as Pending Validation:
$ wcgstats -wQt2 -aOPN1 -m1 -sI
OPNG_0079439_00098_4-- Linuxmint 728 Pending Validation 8/27/21 02:17:50 8/27/21 04:03:35 0.14 1.2 / 0.0
OPNG_0079439_00098_3-- LinuxMint - In Progress 8/27/21 02:15:23 8/31/21 02:15:23 0.00 0.0 / 0.0
OPNG_0079439_00098_2-- Linux Ubuntu - In Progress 8/27/21 02:03:43 8/31/21 02:03:43 0.00 0.0 / 0.0
OPNG_0079439_00098_1-- Linux Fedora 728 Invalid 8/25/21 13:30:45 8/27/21 02:03:02 0.35 1.3 / 0.0
OPNG_0079439_00098_0-- Linux Fedora 728 Invalid 8/25/21 13:30:39 8/25/21 14:22:13 0.23 0.1 / 0.0
So there was a chance that the second or the third wingman would turn in something hopeful.

That's not what happened though.
Unexpectedly, wingman #3 got a Too Late (17:17:27), before wingman #2 (17:21:47), and the latter one's task was Server Aborted:
workunit 790933198:
OPNG_0079439_00098_4--   Linuxmint      728   Invalid                8/27/21 02:17:50    8/27/21 04:03:35    0.14       1.2 / 0.0
OPNG_0079439_00098_3-- LinuxMint 728 Too Late 8/27/21 02:15:23 8/27/21 17:17:27 0.10 0.9 / 0.0
OPNG_0079439_00098_2-- Linux Ubuntu 728 Server Aborted 8/27/21 02:03:43 8/27/21 17:21:47 0.00 0.0 / 0.0
OPNG_0079439_00098_1-- Linux Fedora 728 Invalid 8/25/21 13:30:45 8/27/21 02:03:02 0.35 1.3 / 0.0
OPNG_0079439_00098_0-- Linux Fedora 728 Invalid 8/25/21 13:30:39 8/25/21 14:22:13 0.23 0.1 / 0.0

----------------------------------------
[Edit 1 times, last edit by adriverhoef at Aug 27, 2021 9:33:44 PM]
[Aug 27, 2021 9:32:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

This is a very different type of problem from the one which prompted this thread. Those were badly-formed tasks which failed immediately they started running.

These are tasks, each of which generates real scientific results, but which fail to agree amongst themselves what the correct answer should be.

They are all assigned to Intel iGPUs - WCG requires each task to validate against another GPU from the same family. But iGPUs have evolved over the years, and different members of the family produce subtly different answers. That's especially true because the WCG iGPU app has been compiled using the -cl-mad-enable compiler directive: Intel themselves say "mad is intended to be used where speed is preferred over accuracy." It's the subtle differences between iGPUs of different generations which cause these problems.
[Aug 27, 2021 9:54:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

Thanks for your answer, Richard.
Like I said, that device which usually never returns any Invalid OPNG task mostly produces Valid results, like this one - between the two Invalids and the other Valid one -

workunit 791631382:
OPNG_0079722_00077_4--   Linux Ubuntu   728   Server Aborted         8/26/21 14:01:19    8/27/21 22:00:35    0.00       0.0 / 0.0
OPNG_0079722_00077_3-- Linux Debian 728 Invalid 8/26/21 10:11:28 8/26/21 14:01:04 0.10 1.0 / 1.0
OPNG_0079722_00077_2-- Linux Fedora 728 Valid 8/26/21 10:10:41 8/27/21 21:56:27 0.30 1.1 / 776.7
OPNG_0079722_00077_1-- Linux Ubuntu 728 Valid 8/26/21 04:17:13 8/26/21 10:03:09 0.33 1.0 / 728.6
OPNG_0079722_00077_0-- Linux Fedora 728 Invalid 8/26/21 04:16:22 8/26/21 04:39:44 0.21 0.1 / 0.1
So, if I understand what you're saying: sometimes you're out of luck when meeting other wingmen.
[Aug 28, 2021 12:29:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

So, if I understand what you're saying: sometimes you're out of luck when meeting other wingmen.
Yes, I think that's the implication of how WCG have put this particular app together. But it only applies to the Intel iGPU app.
[Aug 28, 2021 8:55:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

This is, however, then the first Invalid result for me on an NVIDIA card on which I've never seen any Invalids (according to my logfile) before:

workunit 794505424: ($ wcgstats -wIQ -t 0 -m '22' -a 'OPN1' -s 'I' -p '1' -l1+'0' -rr)
OPNG_0080168_00100_4--   Linux Ubuntu   728   Invalid       8/29/21 00:59:41    8/29/21 01:05:43    0.09       0.7 / 0.0
OPNG_0080168_00100_3-- Linux Ubuntu 728 Invalid 8/29/21 00:59:40 8/29/21 01:08:04 0.07 0.7 / 0.0
OPNG_0080168_00100_2-- Linux Ubuntu 728 Invalid 8/29/21 00:52:21 8/29/21 00:59:32 0.10 0.6 / 0.0
OPNG_0080168_00100_1-- Linuxmint 728 S. Aborted 8/29/21 00:52:19 8/29/21 01:09:50 0.00 0.0 / 0.0
OPNG_0080168_00100_0-- Linux Fedora 728 Invalid 8/29/21 00:47:55 8/29/21 00:52:11 0.05 0.6 / 0.0
---------------------------------------------------------------------------------------------------------------------------------------


Also worthwhile noticing is that also none of the other wingmen was able to produce a Valid result. nerd sad crying
----------------------------------------
[Edit 3 times, last edit by adriverhoef at Aug 29, 2021 10:38:03 AM]
[Aug 29, 2021 9:17:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

I've got two like that - one on Linux and one on Windows (both NVidia). Just some random glitch in the cosmos, I reckon.
[Aug 29, 2021 10:43:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 988
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

OPNG_0080168_00100_3 was mine!

As one doesn't need a wingman for OPNG I doubt they match results to decide on validity, so it has to be something else. Much as I like Richard's "random glitch in the cosmos", I'd like to know what really causes a task to be flagged Invalid -- perhaps the techs might tell us, though they probably aren't too concerned at such low levels of invalid tasks...

I suspect there's still something very near the boundaries of what can be solved properly in the occasional work-unit - if you think back to the tranches of failed tasks in late April there were quite a few "All Invalid" cases then. I also had a handful of "All invalid" tasks for the current receptor in early June (along with a few where some were Valid, some Invalid) but given past experience I just took it as being bound to happen sometimes.

The same receptor is also the current target for OPN1 tasks - I've seen the occasional "Some Valid, some Invalid" cases there too, but it's always been a single system with the problem! (And, unlike at some other projects, we can't look to see if it's always the same system(s) going Invalid [probably just as well, on thinking about it!...]) The odd thing is that I'd expect CPU tasks to be far more likely to be all-or-nothing cases (or produce errors instead of Invalid tasks if there's a hardware glitch), but it is what it is...

What I haven't seen, apart from that batch this month with the bad data, is Errors (other than SIGSEGV and other [probably] non-data-related cases for wingmen!) -- to me, that definitely tends to suggest that whatever is going on here is a case of some "this shouldn't happen" event being detected in the output.

Cheers - Al.
[Aug 30, 2021 2:41:36 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

Thanks Al. What coincidence that we 'met'. As an 'Invalid' status is something that shouldn't happen with these workunits on devices that normally produce Valid results, that is, one would say "That's odd!" if it occurs there, I guess reporting these cases is the only way to see if there's a larger underlying problem that hopefully can be solved by the techs.

Edit: Apart from the one task I already mentioned, I now have two more Invalid ones on two other devices, but the wingmen are still In Progress, of which only one is now showing Too Late. This is what I have thus far, both iGPU:

workunit 794302605:
OPNG_0080505_00146_4--   Linux Ubuntu   728   Too Late               8/30/21 04:27:03    8/30/21 16:06:15    0.37       0.2 / 0.0
OPNG_0080505_00146_3-- Linuxmint - In Progress 8/30/21 04:26:33 9/3/21 04:26:33 0.00 0.0 / 0.0
OPNG_0080505_00146_2-- Linuxmint 728 Invalid 8/30/21 04:26:25 8/30/21 08:47:47 0.20 0.8 / 0.0
OPNG_0080505_00146_1-- Linux Debian 728 Invalid 8/28/21 19:54:19 8/29/21 02:54:44 0.09 0.5 / 0.0
OPNG_0080505_00146_0-- Linux Fedora 728 Invalid 8/28/21 19:53:23 8/30/21 04:24:52 0.24 1.0 / 0.0
---------------------------------------------------------------------------------------------------------------------------------------
workunit 793695602:
OPNG_0080326_00358_3--   Linuxmint      -     In Progress            8/30/21 15:37:43    9/3/21 15:37:43     0.00       0.0 / 0.0
OPNG_0080326_00358_4-- Linux Ubuntu - In Progress 8/30/21 15:37:05 9/3/21 15:37:05 0.00 0.0 / 0.0
OPNG_0080326_00358_2-- Linux Fedora 728 Invalid 8/29/21 21:49:03 8/30/21 15:36:42 0.23 1.0 / 0.0
OPNG_0080326_00358_1-- Linux Ubuntu - In Progress 8/29/21 21:48:55 9/2/21 21:48:55 0.00 0.0 / 0.0
OPNG_0080326_00358_0-- Linux Fedora 728 Invalid 8/28/21 04:56:59 8/29/21 21:47:30 0.32 1.1 / 0.0
---------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------
[Edit 1 times, last edit by adriverhoef at Aug 30, 2021 6:34:43 PM]
[Aug 30, 2021 9:40:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Invalid GPU work units

Apart from the two I already mentioned in the previous post, workunit 794302605 and workunit 793695602, I now have two more Invalids, one on an NVIDIA card and one on an Intel GPU:

workunit 796256861:
OPNG_0080516_00188_4--   Linux Fedora   728   Invalid                8/31/21 02:29:20    8/31/21 02:33:36    0.05       0.6 / 0.0
OPNG_0080516_00188_3-- Linux Arch 728 Invalid 8/31/21 02:29:19 8/31/21 03:01:25 0.54 0.6 / 0.0
OPNG_0080516_00188_2-- Linux Ubuntu 728 Server Aborted 8/31/21 02:24:56 8/31/21 02:37:07 0.00 0.0 / 0.0
OPNG_0080516_00188_1-- Linux Fedora 728 Invalid 8/31/21 02:24:55 8/31/21 02:29:12 0.06 0.6 / 0.0
OPNG_0080516_00188_0-- ManjaroLinux 728 Invalid 8/30/21 18:37:21 8/31/21 02:24:48 0.22 0.6 / 0.0
---------------------------------------------------------------------------------------------------------------------------------------

workunit 795243760:
OPNG_0080841_00242_4--   Linux Ubuntu   728   Too Late               8/31/21 02:00:36    8/31/21 04:31:06    0.09       0.7 / 0.0
OPNG_0080841_00242_3-- LinuxMint 728 Valid 8/31/21 01:55:19 8/31/21 03:47:24 0.06 0.7 / 587.7
OPNG_0080841_00242_1-- Linux Fedora 728 Invalid 8/31/21 00:49:06 8/31/21 01:54:16 0.13 0.0 / 0.0
OPNG_0080841_00242_2-- ManjaroLinux 728 Valid 8/31/21 00:47:55 8/31/21 01:46:20 0.21 0.9 / 786.9
OPNG_0080841_00242_0-- Linux Fedora 728 Invalid 8/29/21 18:07:07 8/31/21 00:46:50 0.21 0.9 / 0.9
---------------------------------------------------------------------------------------------------------------------------------------

As can be seen, the second one has a mix of Valid and Invalid.

Usually, all my devices never return any Invalid tasks.
[Aug 31, 2021 7:39:41 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 171   Pages: 18   [ Previous Page | 8 9 10 11 12 13 14 15 16 17 | Next Page ]
[ Jump to Last Post ]
Post new Thread