Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 171
|
![]() |
Author |
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2179 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Just today I found an Invalid task on a device that usually never returns any Invalid OPNG task. The original wingman had already returned it as an Invalid. So, a second, third and even a fourth wingman was called.
----------------------------------------At first, the fourth wingman returned it as Pending Validation: $ wcgstats -wQt2 -aOPN1 -m1 -sI So there was a chance that the second or the third wingman would turn in something hopeful.That's not what happened though. Unexpectedly, wingman #3 got a Too Late (17:17:27), before wingman #2 (17:21:47), and the latter one's task was Server Aborted: workunit 790933198: OPNG_0079439_00098_4-- Linuxmint 728 Invalid 8/27/21 02:17:50 8/27/21 04:03:35 0.14 1.2 / 0.0 [Edit 1 times, last edit by adriverhoef at Aug 27, 2021 9:33:44 PM] |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
This is a very different type of problem from the one which prompted this thread. Those were badly-formed tasks which failed immediately they started running.
These are tasks, each of which generates real scientific results, but which fail to agree amongst themselves what the correct answer should be. They are all assigned to Intel iGPUs - WCG requires each task to validate against another GPU from the same family. But iGPUs have evolved over the years, and different members of the family produce subtly different answers. That's especially true because the WCG iGPU app has been compiled using the -cl-mad-enable compiler directive: Intel themselves say "mad is intended to be used where speed is preferred over accuracy." It's the subtle differences between iGPUs of different generations which cause these problems. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2179 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for your answer, Richard.
Like I said, that device which usually never returns any Invalid OPNG task mostly produces Valid results, like this one - between the two Invalids and the other Valid one - workunit 791631382: OPNG_0079722_00077_4-- Linux Ubuntu 728 Server Aborted 8/26/21 14:01:19 8/27/21 22:00:35 0.00 0.0 / 0.0 So, if I understand what you're saying: sometimes you're out of luck when meeting other wingmen. |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
So, if I understand what you're saying: sometimes you're out of luck when meeting other wingmen. Yes, I think that's the implication of how WCG have put this particular app together. But it only applies to the Intel iGPU app. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2179 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
This is, however, then the first Invalid result for me on an NVIDIA card on which I've never seen any Invalids (according to my logfile) before:
----------------------------------------workunit 794505424: ($ wcgstats -wIQ -t 0 -m '22' -a 'OPN1' -s 'I' -p '1' -l1+'0' -rr) OPNG_0080168_00100_4-- Linux Ubuntu 728 Invalid 8/29/21 00:59:41 8/29/21 01:05:43 0.09 0.7 / 0.0 Also worthwhile noticing is that also none of the other wingmen was able to produce a Valid result. ![]() ![]() ![]() [Edit 3 times, last edit by adriverhoef at Aug 29, 2021 10:38:03 AM] |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
I've got two like that - one on Linux and one on Windows (both NVidia). Just some random glitch in the cosmos, I reckon.
|
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 988 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
OPNG_0080168_00100_3 was mine!
As one doesn't need a wingman for OPNG I doubt they match results to decide on validity, so it has to be something else. Much as I like Richard's "random glitch in the cosmos", I'd like to know what really causes a task to be flagged Invalid -- perhaps the techs might tell us, though they probably aren't too concerned at such low levels of invalid tasks... I suspect there's still something very near the boundaries of what can be solved properly in the occasional work-unit - if you think back to the tranches of failed tasks in late April there were quite a few "All Invalid" cases then. I also had a handful of "All invalid" tasks for the current receptor in early June (along with a few where some were Valid, some Invalid) but given past experience I just took it as being bound to happen sometimes. The same receptor is also the current target for OPN1 tasks - I've seen the occasional "Some Valid, some Invalid" cases there too, but it's always been a single system with the problem! (And, unlike at some other projects, we can't look to see if it's always the same system(s) going Invalid [probably just as well, on thinking about it!...]) The odd thing is that I'd expect CPU tasks to be far more likely to be all-or-nothing cases (or produce errors instead of Invalid tasks if there's a hardware glitch), but it is what it is... What I haven't seen, apart from that batch this month with the bad data, is Errors (other than SIGSEGV and other [probably] non-data-related cases for wingmen!) -- to me, that definitely tends to suggest that whatever is going on here is a case of some "this shouldn't happen" event being detected in the output. Cheers - Al. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2179 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks Al. What coincidence that we 'met'. As an 'Invalid' status is something that shouldn't happen with these workunits on devices that normally produce Valid results, that is, one would say "That's odd!" if it occurs there, I guess reporting these cases is the only way to see if there's a larger underlying problem that hopefully can be solved by the techs.
----------------------------------------Edit: Apart from the one task I already mentioned, I now have two more Invalid ones on two other devices, but the wingmen are still In Progress, of which only one is now showing Too Late. This is what I have thus far, both iGPU: workunit 794302605: OPNG_0080505_00146_4-- Linux Ubuntu 728 Too Late 8/30/21 04:27:03 8/30/21 16:06:15 0.37 0.2 / 0.0 workunit 793695602:OPNG_0080326_00358_3-- Linuxmint - In Progress 8/30/21 15:37:43 9/3/21 15:37:43 0.00 0.0 / 0.0 [Edit 1 times, last edit by adriverhoef at Aug 30, 2021 6:34:43 PM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2179 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Apart from the two I already mentioned in the previous post, workunit 794302605 and workunit 793695602, I now have two more Invalids, one on an NVIDIA card and one on an Intel GPU:
workunit 796256861: OPNG_0080516_00188_4-- Linux Fedora 728 Invalid 8/31/21 02:29:20 8/31/21 02:33:36 0.05 0.6 / 0.0 workunit 795243760: OPNG_0080841_00242_4-- Linux Ubuntu 728 Too Late 8/31/21 02:00:36 8/31/21 04:31:06 0.09 0.7 / 0.0 As can be seen, the second one has a mix of Valid and Invalid. Usually, all my devices never return any Invalid tasks. |
||
|
|
![]() |