| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3596
|
|
| Author |
|
|
AllanDavie
Cruncher Joined: Nov 17, 2004 Post Count: 1 Status: Offline Project Badges:
|
I am also getting the transient HTTP errors on downloading my 5 work units (Windows 11).
Allan 20/07/2022 15:31:57 | World Community Grid | Temporarily failed download of f2f6ff34c20f6edd4b577dd6d8523a4b.: transient HTTP error 20/07/2022 15:31:57 | World Community Grid | Backing off 05:23:49 on download of f2f6ff34c20f6edd4b577dd6d8523a4b. 20/07/2022 15:31:58 | World Community Grid | Started download of 5901a82848c8be42fce30c0abcac77cb.7z 20/07/2022 15:32:01 | World Community Grid | Finished download of arp1.RRTMG_LW_DATA 20/07/2022 15:32:01 | World Community Grid | Started download of f9230bb47629061ae1dca64676dcdda3. 20/07/2022 15:32:01 | World Community Grid | Starting task ARP1_0002370_126_0 20/07/2022 15:32:05 | World Community Grid | Finished download of f9230bb47629061ae1dca64676dcdda3. 20/07/2022 15:32:05 | World Community Grid | Started download of bdd8658fcb67bf4aadaafd9ba0d7caae. 20/07/2022 15:32:06 | World Community Grid | Starting task ARP1_0007855_127_1 20/07/2022 15:33:35 | World Community Grid | Finished download of 5901a82848c8be42fce30c0abcac77cb.7z 20/07/2022 15:33:35 | World Community Grid | Started download of 5ca79521f4078b94daaadcf11c79ebb2.7z 20/07/2022 15:33:39 | World Community Grid | Finished download of bdd8658fcb67bf4aadaafd9ba0d7caae. 20/07/2022 15:33:40 | World Community Grid | Starting task ARP1_0035375_127_1 20/07/2022 15:33:45 | World Community Grid | Started download of f2f6ff34c20f6edd4b577dd6d8523a4b. 20/07/2022 15:33:48 | World Community Grid | Temporarily failed download of f2f6ff34c20f6edd4b577dd6d8523a4b.: transient HTTP error 20/07/2022 15:33:48 | World Community Grid | Backing off 05:22:43 on download of f2f6ff34c20f6edd4b577dd6d8523a4b. 20/07/2022 15:33:54 | World Community Grid | Started download of f2f6ff34c20f6edd4b577dd6d8523a4b. 20/07/2022 15:33:57 | World Community Grid | Temporarily failed download of f2f6ff34c20f6edd4b577dd6d8523a4b.: transient HTTP error |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1328 Status: Offline Project Badges:
|
More Extreme work unit info: I've just received ARP1_0033947_108_1 and ARP1_0034243_102_1.
Unfortunately, I was out when the tasks were requested so I didn't know that the downloads had stalled :-( -- I've just completed the downloads and got started on these but that's nearly 7 hours of the 36-hour deadline gone before starting!... As it happens, I won't have any problems turning these around; however, it does add to the delay before validation and assimilation. If these issues are indicative of a lack of bandwidth or of file-server stress, progress is likely to slow down, especially if/when they resolve the OPNG work issues... Cheers - Al |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1328 Status: Offline Project Badges:
|
There are only 3 units still identifiable from the 6 restarted by Kevin. Your one & another currently in 011 plus 1 in 008. Mike,Mike I presume the six units you are referring to are those first mentioned in Kevin's post of 25th January 2022, in which 6 items were identified as very problematic, three being candidates for restarting from scratch and 3 not resolving at 24-second time step. Later he identified the three that were to restart from scratch, and we've seen evidence of these. However, I couldn't find any further reference to the other three (though I only looked in this forum, and I may have missed a key message...) Do we know what Delft decided to do about them? Did they get unstuck somehow or are they still stalled? Or am I just confused? :-) If I'm not confused and there are still three units of indeterminate state, perhaps someone on the new WCG team will be able to clarify once they aren't quite as busy "fire-fighting"... Cheers - Al P.S. That message also had an interesting comment about catching problem results and re-running with a revised time-step. The implication was that it is a manual process, so I wonder if the new WCG folks know about that and what to do about it should problems recur. |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Al
----------------------------------------As far as I am aware all were restarted either from zero or from shortly before they got stuck. That included the 6 Ultras and all others that had stuck further on. The later ones were just before Kevin finished on WCG. Some took several attempts and had to have their time steps changed, I suspect due to mountains in their patches. Mike [Edit 1 times, last edit by Mike.Gibson at Jul 20, 2022 8:59:29 PM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Just a brief summary of the current situation.
There were 2,844 units validated in the last 24 hours and 1,949,569 remain to the end of the project. At that daily rate, my forecast end date for the project is 4 June 2024. However, we expect the rate to pick up. This assumes that ARP1 will finish with a full generation 182. Mike |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1328 Status: Offline Project Badges:
|
Al Thanks, Mike.As far as I am aware all were restarted either from zero or from shortly before they got stuck. That included the 6 Ultras and all others that had stuck further on. The later ones were just before Kevin finished on WCG. Some took several attempts and had to have their time steps changed, I suspect due to mountains in their patches. Mike I was thinking specifically of the three units of which Kevin said 3 cannot be processed even with changing the more granular step_size of 24 so yes, perhaps going back a few generations and running those with step_size 24 might have got them past the breakdown point. I would've been interested to know what was actually needed to get those three jobs to move on, but the only later reference I found was I have one final workunit that I'm rerunning clean jobs on that will get submitted into the grid tomorrow. At that point all of the units will be back running on the grid. in the post that gave the full unit identifiers of the three that were restarted from zero; that may or may not have been one of the three of interest. However, I suspect Kevin had far more important things on his mind at the time! Ah, well, unsatisfied curiousity... :-)Cheers - Al. P.S. It would also be interesting to know precisely what caused the issues in the first place; your suggestion of mountains is a good candidate... Perhaps the project scientists might include such problems in their write-up? |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2351 Status: Offline Project Badges:
|
Quoting some of Kevin's text from the posting that Al directed us to:
(we will still have to periodically re-run the jobs with a smaller step size). Another interesting read from … knreed is this post, where we … can read [*1]:We had a meeting with the research team today and those units that had the time step changed to 24 can be moved back to 36 now that they have moved passed the challenging conditions. This will be a technique that we use going forward that when a given unit on a given generation cannot successfully complete the run we will lower the time step and retry the run and then bring it back to 36 for subsequent generations. Coincidentally, especially after finding this error, it might help the admins/techs correcting the error and/or solving the problem, where each wingman has their task ending up with Computation Error after a SIGSEGV. Adri (*1) sorry for this feeble attempt at linguistical humour. ![]() |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1328 Status: Offline Project Badges:
|
Coincidentally, especially after finding this error, it might help the admins/techs correcting the error and/or solving the problem, where each wingman has their task ending up with Computation Error after a SIGSEGV. Adri,Thanks for the heads-up! We've flagged the error in the forums, and presumably there's also a way the Admins can check for broken tasks too (but is there an auto-notification mechanism or automatic daily report?) So now we wait to see what happens next. (Cue someone saying that it'll just be ignored...) Cheers - Al. P.S. We should probably encourage people to report tasks where every wingman gets SIGSEGV, either here or in your specific thread . [Edit 2 times, last edit by alanb1951 at Jul 21, 2022 5:01:56 AM] |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2351 Status: Offline Project Badges:
|
P.S. We should probably encourage people to report tasks where every wingman gets SIGSEGV, either here or in your specific thread . Yeah, Al, let's roll the promotional video for that. In the meantime, some more Extremes (generation of tasks <= 120) have arrived here. They are: ARP1_0033715_111_0 (from generation 111) ARP1_0033795_107_2 ARP1_0033796_113_2 ARP1_0034251_104_1 ARP1_0034322_114_2 ARP1_0034391_101_2 ARP1_0035156_115_0 Adri [Edit 1 times, last edit by adriverhoef at Jul 21, 2022 1:59:25 PM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
I am monitoring the extremes in particular so should be able to spot any of them that get stuck within a few days.
Mike |
||
|
|
|