Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 35
Posts: 35   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4829 times and has 34 replies Next Thread
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2089
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

And those who successfully abort them before running are causing everyone else's machines to be unreliable.

Is this an accusation, Mike?
In any case, it isn't a fair expression, because anyone who doesn't abort one, will have to run the faulty task, causing it to fail (and getting the unreliable status), whereupon in a normal case this also leads to more copies being sent out to be crunched by others.(*1)
So, what is the difference, Mike?

Adri

[*1] Up to the point where the maximum of copies is reached:

workunit 316493304
SCC1_0004174_MyoD1-C_6410_0  LinuxMint     Error                 2023-06-11T09:31:14  2023-06-11T14:16:26
SCC1_0004174_MyoD1-C_6410_1 Linux Ubuntu Error 2023-06-11T09:31:21 2023-06-11T09:32:19
SCC1_0004174_MyoD1-C_6410_2 Linux Ubuntu Error 2023-06-11T09:32:47 2023-06-11T09:36:47
SCC1_0004174_MyoD1-C_6410_3 Linux Debian Error 2023-06-11T09:37:22 2023-06-11T12:50:45
SCC1_0004174_MyoD1-C_6410_4 Fedora Linux Server Aborted 2023-06-11T12:51:14 2023-06-11T14:18:46

[Jun 11, 2023 5:43:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1316
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

And those who successfully abort them before running are causing everyone else's machines to be unreliable.
Weird conclusion confused

First of all: It's a failing space in the pdbqt-files belonging to the SCC1 MyoD1-C workunits, causing the errors and so makes a lot of machines unreliable. The WCG-team seems not be able to fix it or ask the SCC-team to produce correct input files.

Second: Aborting a task or returning an error task has the same effect on the workunit.
If max replication is not reached, in both cases the system will send a resend to a reliable machine.
One difference: Aborting a task does not make a host unreliable.

Reliable hosts are important to process error tasks or tasks coming too late/never returning a result.
As far as I could discover, 10 valid results in a row of 1 application makes your host reliable again for that application.
----------------------------------------

[Jun 11, 2023 5:45:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 873
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

Thanks, Crystal Pellet! That's much better than the post I was working on and have now abandoned :-)

The WCG-team seems not be able to fix it or ask the SCC-team to produce correct input files.
The parallel, older, "Atom syntax" thread has some analysis of the possible difficulties regarding stopping the bad WUs after delivery. The very specific title of that thread may have left some folks unaware that it's about the same issue :-)

Much of the first couple of pages is just reports of errors - it starts to get a bit more interesting after that...

Spoilers based on that thread: just saying "Suspend/abort all WUs in this range" (where range covers many thousands of WU numbers) won't work for SCC1, and there are probably may tens (or hundreds) of thousands of duff WUs out there...

Cheers - Al.

{Edit to repair the URL for the referenced thread -- oops!]
----------------------------------------
[Edit 2 times, last edit by alanb1951 at Jun 11, 2023 8:21:03 PM]
[Jun 11, 2023 8:04:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 366
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

@TigerLily,

Can we get an update on the defective SCC batches?

Any ETA for a fix?

Thanks,
AgrFan
[Jun 11, 2023 8:08:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

I am aborting the tasks whenever I see them. This is the 4th week of experiencing this problem without even an acknowledgement from Krembil.

Cheers
----------------------------------------

[Jun 11, 2023 8:42:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

And those who successfully abort them before running are causing everyone else's machines to be unreliable.

Mike

Not true at all Mike. By not addressing this issue, Kremble is allowing machines to be marked unreliable. They are solely responsible. There is nothing that I or any other cruncher can do to change that.

Cheers coffee
----------------------------------------

[Jun 12, 2023 5:00:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
roundup
Veteran Cruncher
Switzerland
Joined: Jul 25, 2006
Post Count: 831
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

All WU on different machines error out. Here is the last page of bad units:
SCC1_0004165_MyoD1-C_0388_1
SCC1_0004165_MyoD1-C_0386_1
SCC1_0004174_MyoD1-C_3586_3
SCC1_0004165_MyoD1-C_6593_4
SCC1_0004165_MyoD1-C_7160_3
SCC1_0004174_MyoD1-C_2920_4
SCC1_0004165_MyoD1-C_6689_2
SCC1_0004165_MyoD1-C_5427_2
SCC1_0004165_MyoD1-C_5128_4
SCC1_0004174_MyoD1-C_2291_2
SCC1_0004174_MyoD1-C_0692_3
SCC1_0004174_MyoD1-C_3548_3
SCC1_0004174_MyoD1-C_4032_4
SCC1_0004174_MyoD1-C_3835_2
SCC1_0004165_MyoD1-C_5454_3
SCC1_0004174_MyoD1-C_2985_0
SCC1_0004174_MyoD1-C_0097_0
SCC1_0004165_MyoD1-C_1049_0
[Jun 12, 2023 9:29:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2089
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

By not addressing this issue, Kremble is allowing machines to be marked unreliable. They are solely responsible.

Not true, NixChix. The scientists behind SCC1 are responsible for uploading these faulty batches. Afterwards, after they were alerted to this matter, the WCG techs have been busy finding a solution to deal with the faulty batches. You probably missed that, but it was mentioned in the ATOM syntax incorrect thread, something to do with setting Replication to 0. Who knows what more they are trying to do to stop the faulty batches, and it's not an easy task as one can read in that thread.

Adri
[Jun 12, 2023 11:17:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12146
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

I was just being provocative to try to shame Krembil/scientists into doing something.for us long-suffering crunchers. Mike
[Jun 12, 2023 12:28:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2089
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Defective Batch

Since 2023-06-12T14:51:02 I haven't received any MyoD1-C (in short, type C) tasks.
Also, the one that I received at 2023-06-12T11:51:43, three hours earlier, was Server Aborted and its Replication value was set to 0:

workunit 316926496
App: Smash Childhood Cancer
Workunit: SCC1_0004174_MyoD1-C_7962
Created: 2023-06-05T11:18:54
Quorum: 2
Replication: 0

SCC1_0004174_MyoD1-C_7962_0 Arch Linux Error 2023-06-12T09:04:31 2023-06-12T11:15:57
SCC1_0004174_MyoD1-C_7962_1 Linux Ubuntu Error 2023-06-12T09:04:36 2023-06-12T11:40:02
SCC1_0004174_MyoD1-C_7962_2 Linux Ubuntu Error 2023-06-12T11:16:10 2023-06-12T17:34:16
SCC1_0004174_MyoD1-C_7962_3 Linux Ubuntu User Aborted 2023-06-12T11:47:39 2023-06-12T11:51:34
SCC1_0004174_MyoD1-C_7962_4 Fedora Linux Server Aborted 2023-06-12T11:51:43 2023-06-12T17:36:37

Adri
----------------------------------------
[Edit 2 times, last edit by adriverhoef at Jun 12, 2023 6:24:02 PM]
[Jun 12, 2023 6:16:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 35   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread