Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 101
Posts: 101   Pages: 11   [ Previous Page | 2 3 4 5 6 7 8 9 10 11 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 20329 times and has 100 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

Adri.

[Drifting towards the margins of "off topic" here, perhaps, but...]

I'll answer a couple of your questions by saying that the script keeps a catalogue of workunit names it has seen, assessed and either aborted or passed for execution. At the start of each pass through client_state it marks them as "not seen this time", and if it sees them again it marks them as "seen" but doesn't do anything else! At the end of the pass, any names not seen on that occasion get removed from the catalogue.[*1]

If the script gets shut down, it dumps the current state of that catalogue, which it will re-read the next time it starts up; again, that should stop repeated efforts to abort in the unlikely event that it has taken a long time to report the aborted task!

The script sleeps for 5 minutes between passes, so there's a fair chance that aborted units might've vanished already; as for the "urgent" tasks, they're unlikely to get priority over existing tasks on my systems as I only allow very small (<10) numbers of tasks for SCC1 (and MCM1, as it happens) at a time...

As for hacking on the logging module(s) to get UTC time, I probably could if I had the time to spare, but...

Cheers - Al.

P.S. [Definitely off topic :-)] I haven't even looked at puzzle creation again yet -- too much else going on at the moment :-)

[*1] All the techniques used for this script had already been employed for daemons I use to collect information on receptors and ligands for OPN1/G and SCC1, control parameters for MCM1 and task completion information for all WCG projects. The cataloguing technique described above is essential for the pre-run data collection scripts, as there's often a lot of files to check out and it should only be done once per task! The code of the daemons may not be optimal but it has a proven track record :-)
[Jun 9, 2023 10:52:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

Occasionally I get a task that has a deadline of 3 days, so it gets a high priority to run and this will always lead to that task in Running state - unless I have enough (MCM1/SCC1) tasks with a 3 day deadline in the queue, which is probably never. sad
When you get tasks with an earlier deadline, you have a bigger chance that tasks will run FIFO,
when your buffer is set to 0 (zero) and your additional buffer to the max workbuffer you want e.g. 2.
Reporting work is done at least 1 hour after a job has finished and will report it and request new work when your buffer is below the additional.
[Jun 10, 2023 10:24:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

In an ultimate attempt to try to stop executing any tasks from faulty batches that error out straight away, resulting in an unreliable client, in the file app_config.xml I've tried setting <max_concurrent> for scc1 to -1. That worked! Now, any tasks from SCC1 aren't executing anymore, so that any tasks from faulty batches can get aborted (User Aborted) before they immediately start upon receipt.
Reason: as soon as you're reliable, you will have a better chance of receiving tasks from SCC1. blushing

In the meantime, has anyone noticed that there aren't any new tasks from faulty batch 0004176 around anymore? The last ones I received were SCC1_0004176_MyoD1-C_56409_0 and SCC1_0004176_MyoD1-C_56530_0, received at 2023-06-09T14:00:39.

Something that I also noticed was that when you abort a faulty _0 task, two tasks are generated, one with a 6-day deadline and another with a 3-day deadline! See below.

 <1> * SCC1_0004165_MyoD1-C_4795_0  Fedora Linux  User Aborted  2023-06-10T09:35:42  2023-06-10T09:37:50
<1> SCC1_0004165_MyoD1-C_4795_1 Linuxmint In Progress 2023-06-10T09:35:48 2023-06-16T09:35:48
<1> SCC1_0004165_MyoD1-C_4795_2 Linux Ubuntu In Progress 2023-06-10T09:38:15 2023-06-13T09:38:15

Adri
[Jun 10, 2023 10:42:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

Something that I also noticed was that when you abort a faulty _0 task, two tasks are generated, one with a 6-day deadline and another with a 3-day deadline!

Maybe this is the case when we are still early into a batch.
In batch 4176 I noticed that my aborted tasks did not get any resend, but in that batch we had already progressed into the second half of that batch.
----------------------------------------
[Edit 1 times, last edit by Crystal Pellet at Jun 10, 2023 12:45:34 PM]
[Jun 10, 2023 12:45:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Spiderman
Advanced Cruncher
United States
Joined: Jul 13, 2020
Post Count: 138
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

I've not seen any additional SCC1_0004176 since about 24-hrs ago.

Unfortunately, (4) bad SCC1_0004174 's floated-in overnight and immediately error'd. One was across a brand new machine I just brought online -- hoping that box doesn't get put on the "bad list" that others previously noted.
[Jun 10, 2023 12:54:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

Something that I also noticed was that when you abort a faulty _0 task, two tasks are generated, one with a 6-day deadline and another with a 3-day deadline! See below.

 <1> * SCC1_0004165_MyoD1-C_4795_0  Fedora Linux  User Aborted  2023-06-10T09:35:42  2023-06-10T09:37:50
<1> SCC1_0004165_MyoD1-C_4795_1 Linuxmint In Progress 2023-06-10T09:35:48 2023-06-16T09:35:48
<1> SCC1_0004165_MyoD1-C_4795_2 Linux Ubuntu In Progress 2023-06-10T09:38:15 2023-06-13T09:38:15

Adri

Not quite -- judging by the sent time on wingman 1 I'd say that it had decided to send two initial tasks out because you weren't eligible for adaptive replication... Only wingman 2 seems to be a response to your User Abort, and I can find lots of evidence for genuine retries getting 3 day deadlines even if the initial failure/abort is almost instant...

To verify the above statement, I sifted through my recent aborted SCC1 tasks. I actually struggled to find any within the last day or so where I was wingman 0 with Adaptive Replication -- I was getting a lot of retries so "first, solo" was quite rare :-)

I followed up on all of the ones I could easily find, and noted that one or two had the replication set to zero as was noted upstream in this thread (so no retries!) -- that tallies with what Crystal Pellet has just commented on for batch 4176 and explains Spiderman's observation...

Looking at the rest, I saw the same 3-day deadline pattern for all of them! If I have time (ha, ha!) I might try to look into all tasks, not just ones where I was wingman 0 and an AR candidate, but I suspect I'd find the same behaviour there too -- a random check on a handful of items tends to confirm that.

I'm getting to the stage where I wish they'd just turn SCC1 off until the scientists and WCG folks sort this out properly :-(

Cheers - Al.

P.S. Given your trick with max_concurrent, I have to note that my busiest system got hit by the relative lack of work around 07:00 to 10:00 UTC today and hit the "arrived and started too fast to catch" issue that we discussed earlier (first time it has run out of SCC1 in a while!) -- however, it only seemed to take it about 4 or 5 hours to get back to reliable status, so I can live with that for now :-)

[Edited to reference Spiderman's comment.]
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jun 10, 2023 8:30:49 PM]
[Jun 10, 2023 8:27:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sptrog1
Master Cruncher
Joined: Dec 12, 2017
Post Count: 1592
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

I just logged an error on a 4174 task with 5 entries (4 errors and 1 in progress, replication 2) in results. That in progress guy is going to be disappointed,
[Jun 10, 2023 10:35:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

Something that I also noticed was that when you abort a faulty _0 task, two tasks are generated, one with a 6-day deadline and another with a 3-day deadline! See below.

 <1> * SCC1_0004165_MyoD1-C_4795_0  Fedora Linux  User Aborted  2023-06-10T09:35:42  2023-06-10T09:37:50
<1> SCC1_0004165_MyoD1-C_4795_1 Linuxmint In Progress 2023-06-10T09:35:48 2023-06-16T09:35:48
<1> SCC1_0004165_MyoD1-C_4795_2 Linux Ubuntu In Progress 2023-06-10T09:38:15 2023-06-13T09:38:15

Adri

Not quite -- judging by the sent time on wingman 1 I'd say that it had decided to send two initial tasks out

Yikes! I haven't been paying attention in Mr. Alanb1951's class today. worried It was indeed a weird observation by me hypnotized and this explains why I was wrong. d oh

Sorry! praying
[Jun 10, 2023 11:40:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 396
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

@TigerLily,

Can we get an update on the defective SCC batches?

Any ETA for a fix?

Thanks,
AgrFan
----------------------------------------

  • i5-10400 (Comet Lake, 6C/12T) @ 2.9 GHz
  • i5-7400 (Kaby Lake, 4C/4T) @ 3.0 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3330 (Ivy Bridge, 4C/4T) @ 3.0 GHz

[Jun 11, 2023 8:09:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: [Error] ATOM syntax incorrect: "62 " is not a valid atom number

@TigerLily,

Can we get an update on the defective SCC batches?

Any ETA for a fix?

Thanks,
AgrFan

+1 - an acknowledgement of the problem would be great too.

Cheers coffee
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by NixChix at Jun 11, 2023 8:27:32 PM]
[Jun 11, 2023 8:26:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 101   Pages: 11   [ Previous Page | 2 3 4 5 6 7 8 9 10 11 | Next Page ]
[ Jump to Last Post ]
Post new Thread