| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 101
|
|
| Author |
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1296 Status: Offline Project Badges:
|
I got a few errors from "C" tasks, and am no longer getting SCC. I'm asking, but none are being sent.
Am I not getting SCC because of my errors, or are other people not seeing SCC without having errors. Is it me, or the system? |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Loads of 4174's and a 4165 here errored multiple times.
Mike |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
I notice that when I get a batch of ATOM 62 errors, they error in quick succession so a number of them get uploaded together. However, my cache is only replenished 1 at a time and spasmodically at that. In between I get the dread tasks committed to other platforms message.
Mike |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Unixchick, the number of new SCC1-tasks being distributed is dropping fast, as the server (a) needs to find reliable clients (because of the increasing amount of tasks that need to be verified and that is because of the increasing amount of unreliable clients, caused by tasks from the faulty batch (still with Replication > 0) that error out immediately) and (b) to abort (Server Abort) tasks that will be 'Too Late' anyway (from the faulty batch that has Replication > 0).
So, SCC1-tasks are still being distributed, but the system has difficulty finding reliable clients. This is the same situation as reported in post 686894. The good news is that the server is holding up. Still, in this situation I think that it is a good idea to abort (User Abort) the faulty tasks that you receive, for you will lose your reliability status if you execute a faulty task and as long as you have a reliable client your tasks don't need verification, giving the server more breathing room and more chance to send some tasks to you. Adri |
||
|
|
NixChix
Veteran Cruncher United States Joined: Apr 29, 2007 Post Count: 1187 Status: Offline Project Badges:
|
I don't understand why this problem is not being addressed by WCG staff.
----------------------------------------Cheers ![]() ![]() |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Adri,
From your post 687153 from a few hours back... Nice to see that somebody else (see task _3 below) also (probably automatically(*1) (see post 686915)) aborts incoming tasks from the 'new' faulty batch 0004174 In this case that would've been me :-) I've written a Python script that scans client_state.xml for tasks that could be from invalid work-units, finds the specific flex file, checks it for the fault and invokes boinccmd to abort the task if appropriate. Here's a sample from its log on one of my machines (times are BST [UTC+1][*1]): 2023-06-08 19:17:11 - SCC1_0004176_MyoD1-C_50786_0: aborted. If/when it sees a MyoD1-C task that doesn't have the bad flex file, the script will report "valid file!" and leave the task to run :-) Your logic for aborting parallels mine, and the effect is obvious... Tthe machine from which that log snippet is taken typically returns about 100 valid SCC1 tasks a day; since I introduced the script I've not had any Errors (as expected) so I still manage to keep my [small] cache topped up despite still seeing "Tasks are committed to other platforms" fairly regularly (for reasons stated frequently in this and other fhreads...) My other systems that run SCC1 are also getting consistent supplies of work (but they don't handle as many SCC1 tasks a day') Cheers - Al. [*1] The script is based on the daemon scripts I've written for various other aspects of watching WCG work flow; they all use Python's logger module for the output and I've never bothered to work out how to get it to use UTC instead of local time (if it even can...) |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1296 Status: Offline Project Badges:
|
Thanks for the replies. I had a short queue and got a couple of error WUs in a row, that ran before I could abort them. I'm guessing that I'm now deemed unreliable for SCC. I've added MCM to my mix for the moment.
I too am surprised about the lack of attention to this problem. |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
And 4099 But they will be off for the weekend now!
Mike |
||
|
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 2173 Status: Offline Project Badges:
|
And 4099 But they will be off for the weekend now! Well, yes, WCG Towers always does this right before the weekend. Nothing new here, beside that communication the last week has been even more abysmal than before...Mike But I can't confirm that SCC1 batch 4099 is bad per se, I just checked several hosts that have some of those and all of them are at least starting and running fine, though I didn't see any that had already finished. So if there is a problem with that particular batch, the it is different from the subject of this thread, for which I have seen WUs of the batches 4165, 4174, 4175 and 4176, and which will error out right when they are beings started. And I do not agree with Adri that they can't do anything about this, the question is rather if they KNOW how and where to cancel such jobs and more importantly, can be actual proactive and prevent the root cause of those faulty batches been created in the first place. But that's something that only WCG Towers could answer (if they are truthful and don't spread more platitudes), but right now, they once again ain't talking... Ralf |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Al, thanks for your response.
You wrote: I've written a Python script that scans client_state.xml for tasks that could be from invalid work-units, finds the specific flex file, checks it for the fault and invokes boinccmd to abort the task if appropriate. "And I wonder, still I wonder, who'll stop the ..."(*1) Great! And I wonder, if people are getting inquisitive and interested in your script. Still I wonder, how does that script handle the situation where a task is received that needs to be executed right away because its deadline is only 3 days instead of 6? (Occasionally I get a task that has a deadline of 3 days, so it gets a high priority to run and this will always lead to that task in Running state - unless I have enough (MCM1/SCC1) tasks with a 3 day deadline in the queue, which is probably never. )If/when it sees a MyoD1-C task that doesn't have the bad flex file, the script will report "valid file!" and leave the task to run :-) So, the task stays in the queue, unharmed. Good. The conceivable situation hasn't happened yet, I guess, but - I'm thinking along with you - what will happen when that script sees the same task? Will it report "valid file!" again? they all use Python's logger module for the output and I've never bothered to work out how to get it to use UTC instead of local time So it isn't as simple as searching for 'python date utc' on internet and then finding this: >>> from datetime import datetime, timezoneNevertheless, I think you should keep local time and just be aware of it. Logging, a nice feature of Python. [*1] faulty tasks/workunits/batches Adri PS I don't have a weekend puzzle ready at this time. ![]() |
||
|
|
|