| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 101
|
|
| Author |
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Soon I will be running out of SCC1-tasks, apart from faulty batch 0004176. Faulty? Not entirely. You can fix it yourself! I did that with two tasks and one of them went Valid (see post 686839), just by putting an extra space (the single space that was missing) between "ATOM" and "62" in a file.
So what I will do now is to repair the remaining erroneous tasks from batch 0004176 in my queue, in the hopes that someone else will do as I did, so that the two partnered tasks (wingmen) will match and both go Valid. If you are also running out of SCC1-tasks and are left with defective ones from batch 0004176, just give it a try. There is a tiny little, slight chance that you will find a wingman such as I. All you need to do is this as superuser: # cd ~boinc/projects/www.worldcommunitygrid.org Adri |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Adri,
----------------------------------------I wondered about doing that the first time round, but opted against it because I couldn't be sure there wasn't also something else wrong with the file that didn't show as a syntax error... So I just hope that is the only error in that data file :-) One side-effect of users fixing the data file might be to disguise the problem, and if [Edit:] I thought they had suspended SCC1 to do some clean-up as there didn't seem to be any new SCC1 of any type for quite a long time... However, new SCC1 tasks started turning up late this afternoon, so perhaps it was just an overnight precaution (their time, not UTC...) I'm more concerned about how a second bad batch got turned into active WUs after they'd had to deal with the first one -- if it had already been delivered by the scientists, could it not have been checked[1] (and either repaired before WU generation or suppressed. as appropriate!); if it was a new delivery, why hadn't the scientists checked the flex file and repaired it before shipping? And if there are still more bad batches already in the pipeline, I hope they get culled or cured in advance :-) Cheers - Al. [1] I don't know how automated the process of accepting SCC1 work and making WUs is, so that might not be as easy as it sounds :-( [Edited in light of the [apparent] resumption of SCC1 supply, including bad batch cases...] [Edit 1 times, last edit by alanb1951 at Jun 2, 2023 9:09:24 PM] |
||
|
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 2173 Status: Offline Project Badges:
|
Well, half a day later, as we are definitely heading into the weekend, WCG is still pushing out the new fault SCC1 batch. Just like the last time.
And from WCG Towers, still crickets. Makes me wonder if their strategy is to just run through that batch until they all have errored out at the users, instead of cancelling the batch on the server side before they are wasting anyone's bandwidth... Ralf |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
Clever workaround adriverhoef, but yeah, it won't address the root cause of what caused it in the first place, and the odds are very low that the whole batch won't be invalidated and re-issued.
----------------------------------------
|
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
I can't abort these 4176 tasks fast enough. Keep getting sent new ones. Are WCG techs asleep at the wheel*?
----------------------------------------* That's a joke. I'll be here all night.
|
||
|
|
Speedy51
Veteran Cruncher New Zealand Joined: Nov 4, 2005 Post Count: 1326 Status: Offline Project Badges:
|
I can't abort these 4176 tasks fast enough. Keep getting sent new ones. Are WCG techs asleep at the wheel*? * That's a joke. I'll be here all night. To save you being there all night and you thought about using Boinc Tasks this will allow you to cancel all tasks ready to start. I do recommend setting "no new tasks" before cancelling tasks waiting to start :-) ![]() |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
At the moment - as I see it - if you abort your faulty tasks (from batch 0004176), your wingmen's tasks will be Server Aborted:
<8> * SCC1_0004176_MyoD1-C_6035_0 Fedora Linux User Aborted 2023-06-02T03:22:07 2023-06-03T00:05:59 So there isn't much use anymore of fixing and getting these faulty tasks to work, since as soon as a repaired (and finished) task is returned, the server will Server Abort all wingmen's tasks (if they're not running yet), so that the mended task will be marked Too Late sooner or later: <15> * SCC1_0004176_MyoD1-C_1087_0 Fedora Linux Too Late 2023-06-01T21:40:10 2023-06-02T20:50:37 Adri |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Adri,
----------------------------------------Thanks for posting about those, as I'd been aborting any I spotted but hadn't followed up to see what happened to them! It looks as if something was done about these bad WUs some time between about 17:00 and 19:00 UTC on 2nd June (WCG afternoon shift?) as any tasks of mine that failed (or that I aborted if I spotted them) before that period ended up with retries (up to about the same time interval), whereas after then any tasks that were sent back (or aborted) didn't get retries! As for those two examples, I think "Too Late" may also appear for returned tasks that are for "Don't need" cases, and as retries don't seem to be going out for MyoD1-C tasks any longer and tasks already out there are being Server Aborted it looks as if they may have [finally] marked the bad work units as unwanted! An unwelcome current side-effect of whatever they've done is that the only available SCC1 work now seems to be retries for MyoD1-A/B work-units :-( -- I hope they post something about what is happening regarding the ongoing problems with MyoD1-C batches[1]... Cheers - Al. [1] And if that includes the information that the only thing wrong with the flex file was that missing space, it might legitimize your work-around :-) -- not that tampering with data files should ever be acceptable, even in what seems to be a good cause... :-) :-) [Edit 1 times, last edit by alanb1951 at Jun 3, 2023 5:58:26 AM] |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
I give the HOT FIX a try on a Win10 machine. It's a quorum 1 workunit and still running.
----------------------------------------https://www.worldcommunitygrid.org/contribution/workunit/312168628 EDIT: all in vain - Too Late / Quorum 1, Replication 2 ![]() [Edit 1 times, last edit by Crystal Pellet at Jun 3, 2023 10:06:20 AM] |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Al,
----------------------------------------An unwelcome current side-effect of whatever they've done is that the only available SCC1 work now seems to be retries for MyoD1-A/B work-units :-( That may be the result of (types A and B) tasks needing wingmen to resolve the "unreliable" status that you get after processing any, each and every task from the faulty batch. All faulty tasks together are creating a hausse (upturn) in tasks (of types A and B) needing verification. Also, type C is still being sent out at a slow pace, because of the resends for types A and B needing verification. Important: the system is holding up and still hasn't collapsed. Also, new workunits for types A and B are being distributed, albeit still scarcely. [if] the only thing wrong with the flex file was that missing space, it might legitimize your work-around :-) -- not that tampering with data files should ever be acceptable, even in what seems to be a good cause... :-) :-) Agreed. It seemed like a good idea at first, but in the end it only led to a lot of wasted cycles (and one Valid(*1)). It should probably never be acceptable in any way but to point out and document the error.Adri [*1] (Output generated by 'wcgstats -frrre* SCC1_0004176_MyoD1-C_0299') workunit 311931323 SCC1_0004176_MyoD1-C_0299_0 Fedora Linux Valid 2023-06-01T21:21:18 2023-06-02T09:46:32 0.77/0.78 69.0/69.0 PS Crystal Pellet, nice try! [Edit 2 times, last edit by adriverhoef at Jun 3, 2023 11:16:18 AM] |
||
|
|
|