| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 101
|
|
| Author |
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Adri.
[Drifting towards the margins of "off topic" here, perhaps, but...] I'll answer a couple of your questions by saying that the script keeps a catalogue of workunit names it has seen, assessed and either aborted or passed for execution. At the start of each pass through client_state it marks them as "not seen this time", and if it sees them again it marks them as "seen" but doesn't do anything else! At the end of the pass, any names not seen on that occasion get removed from the catalogue.[*1] If the script gets shut down, it dumps the current state of that catalogue, which it will re-read the next time it starts up; again, that should stop repeated efforts to abort in the unlikely event that it has taken a long time to report the aborted task! The script sleeps for 5 minutes between passes, so there's a fair chance that aborted units might've vanished already; as for the "urgent" tasks, they're unlikely to get priority over existing tasks on my systems as I only allow very small (<10) numbers of tasks for SCC1 (and MCM1, as it happens) at a time... As for hacking on the logging module(s) to get UTC time, I probably could if I had the time to spare, but... Cheers - Al. P.S. [Definitely off topic :-)] I haven't even looked at puzzle creation again yet -- too much else going on at the moment :-) [*1] All the techniques used for this script had already been employed for daemons I use to collect information on receptors and ligands for OPN1/G and SCC1, control parameters for MCM1 and task completion information for all WCG projects. The cataloguing technique described above is essential for the pre-run data collection scripts, as there's often a lot of files to check out and it should only be done once per task! The code of the daemons may not be optimal but it has a proven track record :-) |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
Occasionally I get a task that has a deadline of 3 days, so it gets a high priority to run and this will always lead to that task in Running state - unless I have enough (MCM1/SCC1) tasks with a 3 day deadline in the queue, which is probably never. When you get tasks with an earlier deadline, you have a bigger chance that tasks will run FIFO, when your buffer is set to 0 (zero) and your additional buffer to the max workbuffer you want e.g. 2. Reporting work is done at least 1 hour after a job has finished and will report it and request new work when your buffer is below the additional. |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
In an ultimate attempt to try to stop executing any tasks from faulty batches that error out straight away, resulting in an unreliable client, in the file app_config.xml I've tried setting <max_concurrent> for scc1 to -1. That worked! Now, any tasks from SCC1 aren't executing anymore, so that any tasks from faulty batches can get aborted (User Aborted) before they immediately start upon receipt.
Reason: as soon as you're reliable, you will have a better chance of receiving tasks from SCC1. In the meantime, has anyone noticed that there aren't any new tasks from faulty batch 0004176 around anymore? The last ones I received were SCC1_0004176_MyoD1-C_56409_0 and SCC1_0004176_MyoD1-C_56530_0, received at 2023-06-09T14:00:39. Something that I also noticed was that when you abort a faulty _0 task, two tasks are generated, one with a 6-day deadline and another with a 3-day deadline! See below. <1> * SCC1_0004165_MyoD1-C_4795_0 Fedora Linux User Aborted 2023-06-10T09:35:42 2023-06-10T09:37:50 Adri |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
Something that I also noticed was that when you abort a faulty _0 task, two tasks are generated, one with a 6-day deadline and another with a 3-day deadline! Maybe this is the case when we are still early into a batch. In batch 4176 I noticed that my aborted tasks did not get any resend, but in that batch we had already progressed into the second half of that batch. [Edit 1 times, last edit by Crystal Pellet at Jun 10, 2023 12:45:34 PM] |
||
|
|
Spiderman
Advanced Cruncher United States Joined: Jul 13, 2020 Post Count: 138 Status: Offline Project Badges:
|
I've not seen any additional SCC1_0004176 since about 24-hrs ago.
Unfortunately, (4) bad SCC1_0004174 's floated-in overnight and immediately error'd. One was across a brand new machine I just brought online -- hoping that box doesn't get put on the "bad list" that others previously noted. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Something that I also noticed was that when you abort a faulty _0 task, two tasks are generated, one with a 6-day deadline and another with a 3-day deadline! See below. <1> * SCC1_0004165_MyoD1-C_4795_0 Fedora Linux User Aborted 2023-06-10T09:35:42 2023-06-10T09:37:50 Adri Not quite -- judging by the sent time on wingman 1 I'd say that it had decided to send two initial tasks out because you weren't eligible for adaptive replication... Only wingman 2 seems to be a response to your User Abort, and I can find lots of evidence for genuine retries getting 3 day deadlines even if the initial failure/abort is almost instant... To verify the above statement, I sifted through my recent aborted SCC1 tasks. I actually struggled to find any within the last day or so where I was wingman 0 with Adaptive Replication -- I was getting a lot of retries so "first, solo" was quite rare :-) I followed up on all of the ones I could easily find, and noted that one or two had the replication set to zero as was noted upstream in this thread (so no retries!) -- that tallies with what Crystal Pellet has just commented on for batch 4176 and explains Spiderman's observation... Looking at the rest, I saw the same 3-day deadline pattern for all of them! If I have time (ha, ha!) I might try to look into all tasks, not just ones where I was wingman 0 and an AR candidate, but I suspect I'd find the same behaviour there too -- a random check on a handful of items tends to confirm that. I'm getting to the stage where I wish they'd just turn SCC1 off until the scientists and WCG folks sort this out properly :-( Cheers - Al. P.S. Given your trick with max_concurrent, I have to note that my busiest system got hit by the relative lack of work around 07:00 to 10:00 UTC today and hit the "arrived and started too fast to catch" issue that we discussed earlier (first time it has run out of SCC1 in a while!) -- however, it only seemed to take it about 4 or 5 hours to get back to reliable status, so I can live with that for now :-) [Edited to reference Spiderman's comment.] [Edit 1 times, last edit by alanb1951 at Jun 10, 2023 8:30:49 PM] |
||
|
|
sptrog1
Master Cruncher Joined: Dec 12, 2017 Post Count: 1592 Status: Offline Project Badges:
|
I just logged an error on a 4174 task with 5 entries (4 errors and 1 in progress, replication 2) in results. That in progress guy is going to be disappointed,
|
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Something that I also noticed was that when you abort a faulty _0 task, two tasks are generated, one with a 6-day deadline and another with a 3-day deadline! See below. <1> * SCC1_0004165_MyoD1-C_4795_0 Fedora Linux User Aborted 2023-06-10T09:35:42 2023-06-10T09:37:50 Adri Not quite -- judging by the sent time on wingman 1 I'd say that it had decided to send two initial tasks out Yikes! I haven't been paying attention in Mr. Alanb1951's class today. It was indeed a weird observation by me and this explains why I was wrong. Sorry! ![]() |
||
|
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 396 Status: Recently Active Project Badges:
|
@TigerLily,
----------------------------------------Can we get an update on the defective SCC batches? Any ETA for a fix? Thanks, AgrFan
|
||
|
|
NixChix
Veteran Cruncher United States Joined: Apr 29, 2007 Post Count: 1187 Status: Offline Project Badges:
|
@TigerLily, Can we get an update on the defective SCC batches? Any ETA for a fix? Thanks, AgrFan +1 - an acknowledgement of the problem would be great too. Cheers ![]() ![]() [Edit 1 times, last edit by NixChix at Jun 11, 2023 8:27:32 PM] |
||
|
|
|