| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 101
|
|
| Author |
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Al,
----------------------------------------It's easy to read a post and forget about the time that the author spent before typing it, studying the material, thinking over writing it and during the writing up of the article. Your article was impressive, not too long, concise enough to fit on one page of reading material, and understandable for most people (I hope), and it spoke about the internals of a BOINC server. My compliments. It must have cost you quite some time to delve into (this specific part of) the sources. Thanks for the clarification! Sgt.Joe: in my results I show 495 completed SCC units with 103 of them listed as "error." While I understand this is not a contest of any sorts my results (about 65-70 pages) show a mix of 874 S(uccess)(*1), 108 E(rror)(*2) and 27 W(orkunit error)(*3). Sgt.Joe: the faulty work units are still being created. That's a correct observation, Sgt.Joe. They might want to let it blow out while the system is holding up. They are probably also looking for a way to assess the loss of everyone's reliable status while the storm is blowing over. By the pace of the receipt of workunits we could make an estimate when this (batch 0004176) should - or better, - could all be over. From my observations, the situation seemed to 'stabilize' (FWIW), or is stable enough to be called 'stable', at 00:00 UTC this Sunday (morning). So that would mean that that time would make a good starting point. Great. Then we would like to know what sequence was distributed at that time. My records say they were sequences 22087 and 22096 (see post 686913, returned to the server at 2 minutes past 00:00 UTC this Sunday). While writing this it is almost 15:00 UTC and I am still seeing a slow pace: 24833 at 12:03 UTC, 25207 at 13:23 UTC, 25602 at 14:51 UTC. That's roughly, optimistically, 400 sequences in 80 minutes, 5 per minute. At 15:00 UTC, looking back at 14:51, this would mean we would reach sequence 25602 + (9 minutes left till 15:00 * 5 per minute) = 25602 + 45 = 25647. Does that match the past 15 hours? Let's see (computing the difference between the amounts at 15:00 and at 00:00): 25647 - 22087 = 3560. Now, 3560 sequences in 15 hours is a pace of 237⅓ per hour (about 4 per minute) = 5696 per day. Expectations are that 99999 will be the last sequence of this batch. So there's still 99999 - 25647 = 74352 sequences to go. That's more than 12 days (12 * 6000 per day = 72000). Let's say the pace will increase a bit to 6 per minute (pure speculation, of course), that's 6/min. * 60min. * 24hours = 8640 / day, then 74352 / 8640 = 8.6 days. So speeding up the distribution a little bit could(*4) considerably reduce the time it will take to complete this whole faulty batch.[*1] incl. Pendings [*2] Server Aborted, User Aborted and Computation Error [*3] Too Late (thanks to the 'HOT FIX') [*4] Again, pure speculation Adri EDIT: It is Sunday today, not Saturday - so I should have written "this Sunday", not "this Saturday" - I've corrected it now[Edit 2 times, last edit by adriverhoef at Jun 4, 2023 7:07:34 PM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Adri:
----------------------------------------Interesting analysis. I hadn't really thought about how long it would take for this batch to go through the system, but based on your figures it should be less than 2 weeks. You may have hit on the rationale for the way they are dealing with the problematic batch, just letting all of them blow through the system and then correct the entire run all at once after the last item crashes out. Potentially then we will know the answer in due time. On the bright side about 80% are doing just fine. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Adri,
----------------------------------------Regarding your recent reply about communication [this is for information, not a complaint!]... The main point of my post to which you responded was meant to be the technical stuff, not the comment about WCG and information :-) In fact, I cut out quite a large section about the problems inherent in a project team knowing there might be problems, the specific case of WCG's messy forum structure not really having a single obvious place that is specific to reporting failing tasks, and whether there was an easy solution... If I'd retained that section, the post would've failed your "one page of reading material" test by quite a lot :-) However, your observations about users playing their part in the communication process made up for that (and were probably better phrased than some of mine!) - thanks for that :-) By the way, my footnote [1] in the post in question was addressing the same point you made at various places in the reply -- perhaps it lost something when I culled the bulk of that subject... Cheers - Al. P.S. When still employed, I sometimes used in-house stuff where I knew about problems (as an end user) long before the folks in an adjacent office who were responsible for that particular system became aware! Fault detection at the service end isn't always easy :-) [Edit 1 times, last edit by alanb1951 at Jun 4, 2023 6:13:04 PM] |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Adri,
Interesting analysis of the recent flow of WUs. Nice to know I wasn't hallucinating about the scrambled nature of work-unit ID allocation for SCC1, even if it does suggest that killing a set of WUs for a specific target is a job for the Mission Impossible team :-) Cheers - Al. |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Al:
----------------------------------------The main point of my post to which you responded was meant to be the technical stuff, not the comment about WCG and information :-) Acknowledged, not to worry, all understood.And now for something completely different(*1), or rather, something almost completely on-topic. We've recently seen batches of type C, numbered 0004175 and 0004176. While keeping this at the back of our minds, I was - by chance - looking at some output from my scripts, just two hours ago, and couldn't help but notice some 'unseen' (read: new) batches, which are now looming on the horizon:workunit 314811854 SCC1_0004174_MyoD1-C_1315_0 Waiting to be sent... and while unseen batch 0004174 may be only 1 step away from 0004175 (see the start of this thread), just like the current batch 0004176, there is an even bigger distance to another unseen batch, like this one: workunit 314684240 SCC1_0004165_MyoD1-C_1293_0 Waiting to be sent... (and perhaps there are more unseen, new batches, I can't tell yet at this moment). The situation may be that the SCC1-scientists have already uploaded more faulty batches, so that these have already (at least partly) been injected into the 'bloodstream' of BOINC, server-side. We can only wait for them, wait on what's gonna happen with them, or perhaps ask TigerLily if it's possible to let the techs examine this situation - only one task each from the new batches is enough: two 'unseen' batches at the moment, so examining two separate tasks is enough. Anyway, the difference between batches 0004165 and 0004175 is ten. Who knows what lies in-between. Or perhaps a better phrasing (c|sh|w)ould be: are these also faulty batches? Adri [*1] I've tried to include some video footage, but couldn't quickly find a suitable fragment. PS I've already adjusted my script to abort any faulty task from any batch from type (MyoD1-)C, should they arrive. Now it's not "SCC1_0004176_MyoD1-C_.*_.$" anymore, the pattern has changed into: "SCC1_000...._MyoD1-C_.*_.$" Just as a precaution. ![]() [Edit 1 times, last edit by adriverhoef at Jun 5, 2023 9:53:38 AM] |
||
|
|
Hans Sveen
Veteran Cruncher Norge Joined: Feb 18, 2008 Post Count: 984 Status: Offline Project Badges:
|
Hi!
----------------------------------------Adri, if You are interested, just got this wu from batch 4176: SCC1_0004176_MyoD1-C_28836 created May. 29, 2023 - 15:08 UTC, so they are still around ready to error out!! Hans S. PS. And because of the error, no new SCC received!! [Edit 1 times, last edit by Hans Sveen at Jun 5, 2023 7:58:24 AM] |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Hans:
And because of the error, no new SCC received!! That's probably caused by a lack of sufficient supply. When I went to bed, on the computer where I'm aborting any faulty task, my queue was still growing, but when I woke up, the size of my SCC1-queue had shrunk by more than 50%. On the computers where I don't abort the faulty tasks there was hardly any decrease in the number of SCC1-tasks (with a 0.7 day queue). The number of SCC1-tasks seems to be stabilizing at the moment. It's a matter of having sufficient MCM1-tasks in my queue(s). PS In this phase it would be more interesting to notice when a C-type task didn't error out. ![]() |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
145 errored WUs on my side; batch: SCC1_0004176_MyoD1-C
----------------------------------------Cheers, Yves |
||
|
|
sptrog1
Master Cruncher Joined: Dec 12, 2017 Post Count: 1592 Status: Offline Project Badges:
|
4 more batch 004176 errors received today. Is this because of an error in the program of the batch?
|
||
|
|
yoro42
Ace Cruncher United States Joined: Feb 19, 2011 Post Count: 8979 Status: Offline Project Badges:
|
ATOM 62 Sounds like an old TV show/
----------------------------------------Rumming Windows 11 Pro approx 25GB memory available at time of failure... Result log & Properties follow: Results log <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> INFO: result number = 0 INFO: No state to restore. Start from the beginning. [21:29:21] Number of tasks = 1 [21:29:21] Running task 0,CPU time at start of task 0 was 0.000000 [21:29:21] ./cmpd-1130725.pdbqt size = 19 3 ../../projects/www.worldcommunitygrid.org/scc1.MyoD1-C.pdbqt size = 1268 0 Parse error on line 190 in file "..\..\projects\www.worldcommunitygrid.org\60fef8d136128d73bc38a1c07d4b6f66.pdbqt": ATOM syntax incorrect: "62 " is not a valid atom number VINA failed. rc = 1. Exiting </stderr_txt> ]]> Peoperties: Application Smash Childhood Cancer 7.18 Name SCC1_0004176_MyoD1-C_28842 State Computation error Received 6/5/2023 3:23:30 AM Report deadline 6/11/2023 3:23:30 AM Estimated computation size 36,225 GFLOPs CPU time --- Elapsed time --- Executable wcgrid_scc1_vina_7.18_windows_x86_64 ![]() [Edit 1 times, last edit by yoro42 at Jun 6, 2023 6:57:51 AM] |
||
|
|
|