Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Active Research Forum: Smash Childhood Cancer Thread: Defective Batch |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 35
|
Author |
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2089 Status: Offline Project Badges: |
And those who successfully abort them before running are causing everyone else's machines to be unreliable. Is this an accusation, Mike? In any case, it isn't a fair expression, because anyone who doesn't abort one, will have to run the faulty task, causing it to fail (and getting the unreliable status), whereupon in a normal case this also leads to more copies being sent out to be crunched by others.(*1) So, what is the difference, Mike? Adri [*1] Up to the point where the maximum of copies is reached: workunit 316493304 SCC1_0004174_MyoD1-C_6410_0 LinuxMint Error 2023-06-11T09:31:14 2023-06-11T14:16:26 |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1316 Status: Offline Project Badges: |
And those who successfully abort them before running are causing everyone else's machines to be unreliable. Weird conclusion First of all: It's a failing space in the pdbqt-files belonging to the SCC1 MyoD1-C workunits, causing the errors and so makes a lot of machines unreliable. The WCG-team seems not be able to fix it or ask the SCC-team to produce correct input files. Second: Aborting a task or returning an error task has the same effect on the workunit. If max replication is not reached, in both cases the system will send a resend to a reliable machine. One difference: Aborting a task does not make a host unreliable. Reliable hosts are important to process error tasks or tasks coming too late/never returning a result. As far as I could discover, 10 valid results in a row of 1 application makes your host reliable again for that application. |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
Thanks, Crystal Pellet! That's much better than the post I was working on and have now abandoned :-)
----------------------------------------The WCG-team seems not be able to fix it or ask the SCC-team to produce correct input files. The parallel, older, "Atom syntax" thread has some analysis of the possible difficulties regarding stopping the bad WUs after delivery. The very specific title of that thread may have left some folks unaware that it's about the same issue :-)Much of the first couple of pages is just reports of errors - it starts to get a bit more interesting after that... Spoilers based on that thread: just saying "Suspend/abort all WUs in this range" (where range covers many thousands of WU numbers) won't work for SCC1, and there are probably may tens (or hundreds) of thousands of duff WUs out there... Cheers - Al. {Edit to repair the URL for the referenced thread -- oops!] [Edit 2 times, last edit by alanb1951 at Jun 11, 2023 8:21:03 PM] |
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 366 Status: Offline Project Badges: |
@TigerLily,
Can we get an update on the defective SCC batches? Any ETA for a fix? Thanks, AgrFan |
||
|
NixChix
Veteran Cruncher United States Joined: Apr 29, 2007 Post Count: 1187 Status: Offline Project Badges: |
I am aborting the tasks whenever I see them. This is the 4th week of experiencing this problem without even an acknowledgement from Krembil.
----------------------------------------Cheers |
||
|
NixChix
Veteran Cruncher United States Joined: Apr 29, 2007 Post Count: 1187 Status: Offline Project Badges: |
And those who successfully abort them before running are causing everyone else's machines to be unreliable. Mike Not true at all Mike. By not addressing this issue, Kremble is allowing machines to be marked unreliable. They are solely responsible. There is nothing that I or any other cruncher can do to change that. Cheers |
||
|
roundup
Veteran Cruncher Switzerland Joined: Jul 25, 2006 Post Count: 831 Status: Offline Project Badges: |
All WU on different machines error out. Here is the last page of bad units:
SCC1_0004165_MyoD1-C_0388_1 SCC1_0004165_MyoD1-C_0386_1 SCC1_0004174_MyoD1-C_3586_3 SCC1_0004165_MyoD1-C_6593_4 SCC1_0004165_MyoD1-C_7160_3 SCC1_0004174_MyoD1-C_2920_4 SCC1_0004165_MyoD1-C_6689_2 SCC1_0004165_MyoD1-C_5427_2 SCC1_0004165_MyoD1-C_5128_4 SCC1_0004174_MyoD1-C_2291_2 SCC1_0004174_MyoD1-C_0692_3 SCC1_0004174_MyoD1-C_3548_3 SCC1_0004174_MyoD1-C_4032_4 SCC1_0004174_MyoD1-C_3835_2 SCC1_0004165_MyoD1-C_5454_3 SCC1_0004174_MyoD1-C_2985_0 SCC1_0004174_MyoD1-C_0097_0 SCC1_0004165_MyoD1-C_1049_0 |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2089 Status: Offline Project Badges: |
By not addressing this issue, Kremble is allowing machines to be marked unreliable. They are solely responsible. Not true, NixChix. The scientists behind SCC1 are responsible for uploading these faulty batches. Afterwards, after they were alerted to this matter, the WCG techs have been busy finding a solution to deal with the faulty batches. You probably missed that, but it was mentioned in the ATOM syntax incorrect thread, something to do with setting Replication to 0. Who knows what more they are trying to do to stop the faulty batches, and it's not an easy task as one can read in that thread. Adri |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12146 Status: Offline Project Badges: |
I was just being provocative to try to shame Krembil/scientists into doing something.for us long-suffering crunchers. Mike
|
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2089 Status: Offline Project Badges: |
Since 2023-06-12T14:51:02 I haven't received any MyoD1-C (in short, type C) tasks.
----------------------------------------Also, the one that I received at 2023-06-12T11:51:43, three hours earlier, was Server Aborted and its Replication value was set to 0: workunit 316926496 App: Smash Childhood Cancer Adri [Edit 2 times, last edit by adriverhoef at Jun 12, 2023 6:24:02 PM] |
||
|
|