| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 35
|
|
| Author |
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Offline Project Badges:
|
As we are now 24 hours on from when most of the problem tasks were reported, some observations to date...
----------------------------------------The majority of the Validation errors seemed to be concentrated between 23:00 UTC on 2025-08-02 and 17:30 UTC on 2025-08-03, with very few after that. Interestingly, there are plenty of examples of results validating during that period as well, so whatever the problem was, it wasn't all-inclusive. So far we don't seem to have lost any WUs as a result, though there is one that looks to be at risk because of existing Error counts. About half of the reported WUs have managed to validate. As for the oddities regarding apparent cross-plafform validations and possible "anonymous platform", that's probably a matter for separate discussion. Cheers - Al. [Edit 1 times, last edit by alanb1951 at Aug 4, 2025 4:45:34 PM] |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
My device(s) have always been reliable and not give error, so I'm really confused why one of my work units (along with a wingman) went "invalid" while the re-sends were valid. That's so weird.
----------------------------------------Same with some Errors after the result was completed and reported with (0) errors. I don't necessarily think my device was the problem, but I'm a bit confused as to exactly what happened.
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Offline Project Badges:
|
I have also been hit by several of these "two for one" Validation errors on more than one client device on which the only past ARP1 errors have been download failures! There have also been instances in the past when I've processed a retry for a single ARP1 result that was marked as an Error after validation (and I'm sure I'm not the only person to have experienced this...), so the validator can cope if there is a single result that doesn't meet the criteria for a valid result (e.g. missing or ill-formed file(s)).
I think that it is unlikely that there was a sudden rash of work units that caused bad results to be returned by multiple clients, especially as many of the problem WUs seen by folks like Adri and myself actually got validated results as well -- that tends to suggest that in these cases the failed tasks returned something viable (and possibly should have validated!). The very fact that it takes out two results instead of flagging one then marking the other as Pending Verification suggests that there was some issue within the validator, and the way some tasks would validate at around the same time others were being marked Invalid suggests that it was a transient problem. Whether that indicates validator restarts or not is [of course] unknown! If the validator has problems contacting the database it should cope elegantly, so perhaps the issue was with accessing the result files themselves. If the file server is visible but [some of] the files aren't there (or are ill-formed), that's a reasonable Validation error, but if it can't see the file server at all who knows what might happen?!? I'm not sure if it can "fail gracefully" in that case (leaving the results still at Pending Validation), or whether it'll just mark both as Invalid (justified or not!). (I doubt that their ARP1 validator spins off the unzipping and/or checksum calculations to a sub-process, but if it does and that doesn't run properly, that's another place where a graceful failure may not be possible. Pure speculation, of course!) However, this is all "informed guesswork" and there are those who would [rightly] point out that the only correct source of information would be WCG tech team (even if the speculation turns out to contain the basics of the actual issue!) -- I am hoping the tech team will eventually tell us what happened on this occasion... Cheers - Al. |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
Yeah honestly I won't lose sleep over it since it's transient and affected most all of us, so it's not really a personal issue with the stability of our machines that bears too much time wasted looking into.
----------------------------------------I think you're right it may have been a server-side problem with the validators or something. I'm getting new ARP1 work (and also re-sends), and they're validating just fine, so all is normal again.
|
||
|
|
MJH333
Senior Cruncher England Joined: Apr 3, 2021 Post Count: 300 Status: Offline Project Badges:
|
I have similar: Just to confirm that this one validated in the end. https://www.worldcommunitygrid.org/contribution/workunit/750378338 (ARP1_0033245_149). Cheers, Mark |
||
|
|
|