| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 101
|
|
| Author |
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Why the techs and scientists have not purged the 4176 batch from the system is a mystery. You would think, by now, that someone would have noticed that this batch is defective, probably in its entirety. Any little blurb of news acknowledging the problem would certainly be appreciated.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 396 Status: Offline Project Badges:
|
Why the techs and scientists have not purged the 4176 batch from the system is a mystery. You would think, by now, that someone would have noticed that this batch is defective, probably in its entirety. Any little blurb of news acknowledging the problem would certainly be appreciated. Cheers They're busy manually adding new devices to My Contribution pages.
[Edit 1 times, last edit by AgrFan at Jun 3, 2023 2:03:35 PM] |
||
|
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 2173 Status: Offline Project Badges:
|
Why the techs and scientists have not purged the 4176 batch from the system is a mystery. You would think, by now, that someone would have noticed that this batch is defective, probably in its entirety. Any little blurb of news acknowledging the problem would certainly be appreciated. Cheers They're busy manually adding new devices to My Contribution pages. Ralf |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Why the techs and scientists have not purged the 4176 batch from the system is a mystery. You would think, by now, that someone would have noticed that this batch is defective, probably in its entirety. Any little blurb of news acknowledging the problem would certainly be appreciated. Cheers May I correct you, Sgt.Joe? They do have noticed that batch 0004176 is defective. It's just not that easy to decide how to tackle the problem, else they probably would have used an easy method. I've seen several 'methods' passing by. The latest method seems to be to just let the task get executed (which will take less than a second) and after its return to the server have it marked Error and then refrain from releasing a _1 task.(*1) [*1] Since 18:30 UTC last Friday all my 100 returned tasks (BTW, all _0s) from the faulty batch were marked as Error (apart from 13 _0s that were User Aborted) and never got a resend (_1). Of course this is all based on empirical data: getting results, making an observation, developing an idea, testing the idea, and making a conclusion. Adri |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Recently Active Project Badges:
|
As Adri and I have both observed, there's evidence that they've been working on it, but it would appear to be a non-trivial task (see below). As that's the case, some information would've been welcome but I suspect it's a case of fighting the fire rather than talking about it (too few staff to do otherwise, I fear...) -- that said, I find it a bit disappointing that we don't even get "Yes, we know there's a problem, and we're working on it"[1]...
For information (with apologies to anyone who already knows this...): A standard BOINC set-up has an Ops web page which harvests parameters to cancel jobs. In the source I've looked at, there are three ways of telling the system what to cancel:
If the IDs of the unwanted work units are sequential and uninterrupted, the first option seems like an easy method! However, there is some evidence that this is not always the case -- I don't get enough SCC1 work on any given day to be likely to get a run of consecutive work unit IDs but what I do see suggests that there may be lots of [relatively] short interleaved sequences for the individual targets[2]. And, of course, there might be some non-SCC1 work within the overall sequence as well... So it may take a lot of ID ranges to do things that way (and, of course, they'd have to find out what said ranges were in the first place!) The ID list method would probably only be useful to kill off a handful of WUs, and we obviously aren't talking such small numbers of problem tasks here! So the most elegant solution would be to craft an SQL where-clause that picks up work units for the correct application and work-unit name structure (to pick the right "batch" and "target") and ignores WUs that have a canonical result... I think the form offers a list of targeted WUs before submitting the cancellation request, and I suspect that might restrict the number of items that can be done on each pass! The above relates to "standard BOINC" -- who knows what changes might have been made by IBM for WCG :-) By the way, I find it interesting that the "within batch" numbers at the end of the work unit names are scattered around, rather than increasing with rising work unit ID... I suspect Adri may have noted this when looking at his data sets, and it sticks out like a sore thumb in my database online displays :-) Cheers - Al. [1] Perhaps the response of some users to that sort of message has put them off? It would certainly irritate me if I still had a job that included a support role!... [2] I use "small" as relative to the total amount of WUs being created -- I've seen evidence of sequences of well under 1000... |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
May I correct you, Sgt.Joe? Please do. I am always willing to learn more. I do note that Al says" I find it a bit disappointing that we don't even get "Yes, we know there's a problem, and we're working on it" It would take less than 30 seconds to type this and put it in the forum or a news release. Apparently Al is correct that it is not a trivial matter to purge a particular batch because I just got another one. True, they end almost immediately with an error, but that affects the reliability of the machine and causes some queue anomalies until the reliability status is restored. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Al:
----------------------------------------By the way, I find it interesting that the "within batch" numbers at the end of the work unit names are scattered around, rather than increasing with rising work unit ID... I suspect Adri may have noted this when looking at his data sets, and it sticks out like a sore thumb in my database online displays :-) Indeed, it has always been this way as far as I can remember and it's the reason why the thread Weekend Puzzles was created by SekeRob, or so it seems. Well, almost. SekeRob was showing a list of ResultIDs and their individual WorkunitIDs. Of course, if you look at it that way, a WorkunitID can appear multiple times as it can 'contain' multiple ResultIDs. Let's have a look at two tasks that I've just received from faulty batch 0004176: workunit 312357528 SCC1_0004176_MyoD1-C_22087_0 Fedora Linux U.Aborted 2023-06-03T23:59:53 2023-06-04T00:02:00 workunit 312357529 SCC1_0004176_MyoD1-C_22096_0 Fedora Linux U.Aborted 2023-06-03T23:59:53 2023-06-04T00:02:00 Both tasks are part of two separate workunits, each with only one result. Although the sequences are 9 numbers apart (the sequence of the first one is 22087, the other one's sequence is 22096), their WorkunitIDs are neighbours, 312357528 and 312357529. Makes you wonder what their neighbours are, isn't it? Here is the answer: (Output generated by 'wcgstats -frSS= 312357527') workunit 312357527 SCC1_0004099_MyoD1-A_1044_0 Darwin In Progr. 2023-06-03T23:59:52 2023-06-09T23:59:52(Output generated by 'wcgstats -frSS= 312357530') workunit 312357530 SCC1_0004099_MyoD1-A_1043_0 Fedora Linux In Progr. 2023-06-03T23:59:53 2023-06-09T23:59:53 Oh! Look at the coloured taskname. It means that that coloured task was received on one of my own devices, said the WU hog. (I was sheltering 25 SCC1-tasks on that device at that moment, 2023-06-03T23:59:53.)Anyway. It's interesting to see that (in batch 0004099 with type MyoD1-A) sequence 1043 from workunit 312357530 and sequence 1044 from workunit 312357527 are 3 workunits apart, while the sequences 1043 and 1044 are adjacent. Makes you curious what's up with workunits 312357526 and 312357525, adjacent to 312357527 above. This is what I see: (Output generated by 'wcgstats -frSS= 312357526') workunit 312357526 SCC1_0004176_MyoD1-C_22097_0 Linuxmint Error 2023-06-03T23:59:51 2023-06-04T00:01:58(Output generated by 'wcgstats -frSS= 312357525') workunit 312357525 SCC1_0004159_MyoD1-B_19222_0 MSWin 10 In Progr. 2023-06-03T23:59:50 2023-06-09T23:59:50 So, in any case, it must be clear at this point that there is a scatter of types A, B and C if you 'follow' the WorkunitIDs incrementally. It's a mix of sequences within a batch, too, as Al already noted. Adri [Edit 1 times, last edit by adriverhoef at Jun 4, 2023 9:58:09 AM] |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
I am always willing to learn more. (...) I just got another one. True, they end almost immediately with an error, but that affects the reliability of the machine and causes some queue anomalies until the reliability status is restored. If you have enough SCC1-tasks in your queue, the faulty tasks don't start immediately when you receive them, so you can write a script that aborts them automatically within a few minutes (hence 'sleep 120' below), so they don't start running and affect your machine's reliability. This will help: (cd ~boinc/projects/www.worldcommunitygrid.org/ && If you don't have 'wcgresults' installed, you'd have to use this piece of code(*1) instead of the former a= assignment above: a=$(boinccmd --get_tasks | sed -n /SCC1_0004176_MyoD1-C_/s/WU.name://p | sed s/$/_0/) Adri EDIT: [*1] NB: this last piece of code only works for suffix _0; with a little tweak you can make it work for any suffix ![]() [Edit 4 times, last edit by adriverhoef at Jun 4, 2023 7:56:19 AM] |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Al:
----------------------------------------I find it a bit disappointing that we don't even get "Yes, we know there's a problem, and we're working on it" Let's suppose that the techs would have to post something like that. There are things that you can explain and things that you can't. Standing in the shoes of the techs, you could post in the forums saying that you're working on it, but as we all know, this will provoke reactions like "When will it be solved?" (which can't have an exact answer(*1)), or worse: "Why haven't you (so and so)?" and "Why couldn't you have (this and that)?". Then they would have to react to that; it will never end, because answering questions like these aren't productive. What's more, "why"-questions can't be answered logically when it comes to humane behaviour. You'll never know all that's playing: a team meeting, working hours (it never ends), how bad is the problem, aren't there other fires to extinguish first, assigning people to do the job, sick people, etc. [*1] And even if answered with an estimate, it can overrun its time or even get out of hand and then they would have to post another message. And another, which will elicit even more reactions. Like I said, it's not productive. Maybe you say: that's what TigerLily is here for. Then you would have to address TigerLily first (maybe to ask to closely follow all forums or to ask politely if there is an answer?). Nobody did. (Or I must have missed it.)(*2) You should ask TigerLily, really.[*2] Yes I know there is a need for (quick) answers from the WCG Team, but it just doesn't work that way (especially if there isn't a question ). In general, they can't answer questions that aren't directed to them or questions that aren't fair or plain mean ("You should have done so and so, why didn't you?"). Apart from that, many users expect a reaction within a short period of time when they have a problem. It just doesn't work that way, especially when the problem has to be investigated upon first, how big the size of the problem is, what the impact of the problem is, how to tackle the problem, are there people available, who should do this, etc.Adri EDIT: All IMHO, of course. ![]() [Edit 1 times, last edit by adriverhoef at Jun 4, 2023 11:15:37 AM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Adri:
----------------------------------------All good points. However, a little transparency goes a long way. Not every volunteer will be mollified by platitudes, but at least they know the problem has been acknowledged. I don't feel a short update of a sentence or two on a daily basis is asking too much. OK, I will stop my bellyaching as the point is now made. On another note, in my results I show 495 completed SCC units with 103 of them listed as "error." I checked a couple of them and they have creation dates of June 3, 2023. So, the faulty work units are still being created. At least they do not take any time so this will be solved eventually. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
|