Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 3596
Posts: 3596   Pages: 360   [ Previous Page | 204 205 206 207 208 209 210 211 212 213 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 5922468 times and has 3595 replies Next Thread
AllanDavie
Cruncher
Joined: Nov 17, 2004
Post Count: 1
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I am also getting the transient HTTP errors on downloading my 5 work units (Windows 11).
Allan

20/07/2022 15:31:57 | World Community Grid | Temporarily failed download of f2f6ff34c20f6edd4b577dd6d8523a4b.: transient HTTP error
20/07/2022 15:31:57 | World Community Grid | Backing off 05:23:49 on download of f2f6ff34c20f6edd4b577dd6d8523a4b.
20/07/2022 15:31:58 | World Community Grid | Started download of 5901a82848c8be42fce30c0abcac77cb.7z
20/07/2022 15:32:01 | World Community Grid | Finished download of arp1.RRTMG_LW_DATA
20/07/2022 15:32:01 | World Community Grid | Started download of f9230bb47629061ae1dca64676dcdda3.
20/07/2022 15:32:01 | World Community Grid | Starting task ARP1_0002370_126_0
20/07/2022 15:32:05 | World Community Grid | Finished download of f9230bb47629061ae1dca64676dcdda3.
20/07/2022 15:32:05 | World Community Grid | Started download of bdd8658fcb67bf4aadaafd9ba0d7caae.
20/07/2022 15:32:06 | World Community Grid | Starting task ARP1_0007855_127_1
20/07/2022 15:33:35 | World Community Grid | Finished download of 5901a82848c8be42fce30c0abcac77cb.7z
20/07/2022 15:33:35 | World Community Grid | Started download of 5ca79521f4078b94daaadcf11c79ebb2.7z
20/07/2022 15:33:39 | World Community Grid | Finished download of bdd8658fcb67bf4aadaafd9ba0d7caae.
20/07/2022 15:33:40 | World Community Grid | Starting task ARP1_0035375_127_1
20/07/2022 15:33:45 | World Community Grid | Started download of f2f6ff34c20f6edd4b577dd6d8523a4b.
20/07/2022 15:33:48 | World Community Grid | Temporarily failed download of f2f6ff34c20f6edd4b577dd6d8523a4b.: transient HTTP error
20/07/2022 15:33:48 | World Community Grid | Backing off 05:22:43 on download of f2f6ff34c20f6edd4b577dd6d8523a4b.
20/07/2022 15:33:54 | World Community Grid | Started download of f2f6ff34c20f6edd4b577dd6d8523a4b.
20/07/2022 15:33:57 | World Community Grid | Temporarily failed download of f2f6ff34c20f6edd4b577dd6d8523a4b.: transient HTTP error
[Jul 20, 2022 5:39:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1328
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

More Extreme work unit info: I've just received ARP1_0033947_108_1 and ARP1_0034243_102_1.

Unfortunately, I was out when the tasks were requested so I didn't know that the downloads had stalled :-( -- I've just completed the downloads and got started on these but that's nearly 7 hours of the 36-hour deadline gone before starting!... As it happens, I won't have any problems turning these around; however, it does add to the delay before validation and assimilation. If these issues are indicative of a lack of bandwidth or of file-server stress, progress is likely to slow down, especially if/when they resolve the OPNG work issues...

Cheers - Al
[Jul 20, 2022 7:41:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1328
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

There are only 3 units still identifiable from the 6 restarted by Kevin. Your one & another currently in 011 plus 1 in 008.

Mike
Mike,

I presume the six units you are referring to are those first mentioned in Kevin's post of 25th January 2022, in which 6 items were identified as very problematic, three being candidates for restarting from scratch and 3 not resolving at 24-second time step.

Later he identified the three that were to restart from scratch, and we've seen evidence of these. However, I couldn't find any further reference to the other three (though I only looked in this forum, and I may have missed a key message...) Do we know what Delft decided to do about them? Did they get unstuck somehow or are they still stalled? Or am I just confused? :-)

If I'm not confused and there are still three units of indeterminate state, perhaps someone on the new WCG team will be able to clarify once they aren't quite as busy "fire-fighting"...

Cheers - Al

P.S. That message also had an interesting comment about catching problem results and re-running with a revised time-step. The implication was that it is a manual process, so I wonder if the new WCG folks know about that and what to do about it should problems recur.
[Jul 20, 2022 8:08:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Al

As far as I am aware all were restarted either from zero or from shortly before they got stuck. That included the 6 Ultras and all others that had stuck further on. The later ones were just before Kevin finished on WCG. Some took several attempts and had to have their time steps changed, I suspect due to mountains in their patches.

Mike
----------------------------------------
[Edit 1 times, last edit by Mike.Gibson at Jul 20, 2022 8:59:29 PM]
[Jul 20, 2022 8:34:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Just a brief summary of the current situation.

There were 2,844 units validated in the last 24 hours and 1,949,569 remain to the end of the project. At that daily rate, my forecast end date for the project is 4 June 2024. However, we expect the rate to pick up.

This assumes that ARP1 will finish with a full generation 182.

Mike
[Jul 20, 2022 8:56:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1328
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Al

As far as I am aware all were restarted either from zero or from shortly before they got stuck. That included the 6 Ultras and all others that had stuck further on. The later ones were just before Kevin finished on WCG. Some took several attempts and had to have their time steps changed, I suspect due to mountains in their patches.

Mike
Thanks, Mike.

I was thinking specifically of the three units of which Kevin said
3 cannot be processed even with changing the more granular step_size of 24
so yes, perhaps going back a few generations and running those with step_size 24 might have got them past the breakdown point.

I would've been interested to know what was actually needed to get those three jobs to move on, but the only later reference I found was
I have one final workunit that I'm rerunning clean jobs on that will get submitted into the grid tomorrow. At that point all of the units will be back running on the grid.
in the post that gave the full unit identifiers of the three that were restarted from zero; that may or may not have been one of the three of interest. However, I suspect Kevin had far more important things on his mind at the time! Ah, well, unsatisfied curiousity... :-)

Cheers - Al.

P.S. It would also be interesting to know precisely what caused the issues in the first place; your suggestion of mountains is a good candidate... Perhaps the project scientists might include such problems in their write-up?
[Jul 20, 2022 9:45:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2351
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Quoting some of Kevin's text from the posting that Al directed us to:
(we will still have to periodically re-run the jobs with a smaller step size).


Another interesting read from … thinking knreed is this post, where we … thinking can read [*1]:
We had a meeting with the research team today and those units that had the time step changed to 24 can be moved back to 36 now that they have moved passed the challenging conditions. This will be a technique that we use going forward that when a given unit on a given generation cannot successfully complete the run we will lower the time step and retry the run and then bring it back to 36 for subsequent generations.


Coincidentally, especially after finding this error, it might help the admins/techs correcting the error and/or solving the problem, where each wingman has their task ending up with Computation Error after a SIGSEGV.

Adri


(*1) sorry for this feeble attempt at linguistical humour. wink
[Jul 21, 2022 1:18:21 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1328
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Coincidentally, especially after finding this error, it might help the admins/techs correcting the error and/or solving the problem, where each wingman has their task ending up with Computation Error after a SIGSEGV.
Adri,

Thanks for the heads-up! We've flagged the error in the forums, and presumably there's also a way the Admins can check for broken tasks too (but is there an auto-notification mechanism or automatic daily report?) So now we wait to see what happens next. (Cue someone saying that it'll just be ignored...)

Cheers - Al.

P.S. We should probably encourage people to report tasks where every wingman gets SIGSEGV, either here or in your specific thread .
----------------------------------------
[Edit 2 times, last edit by alanb1951 at Jul 21, 2022 5:01:56 AM]
[Jul 21, 2022 4:46:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2351
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

P.S. We should probably encourage people to report tasks where every wingman gets SIGSEGV, either here or in your specific thread .
Yeah, Al, let's roll the promotional video for that. biggrin

In the meantime, some more Extremes (generation of tasks <= 120) have arrived here. They are:
ARP1_0033715_111_0 (from generation 111)
ARP1_0033795_107_2
ARP1_0033796_113_2
ARP1_0034251_104_1
ARP1_0034322_114_2
ARP1_0034391_101_2
ARP1_0035156_115_0

Adri
----------------------------------------
[Edit 1 times, last edit by adriverhoef at Jul 21, 2022 1:59:25 PM]
[Jul 21, 2022 1:57:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I am monitoring the extremes in particular so should be able to spot any of them that get stuck within a few days.

Mike
[Jul 21, 2022 5:07:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 3596   Pages: 360   [ Previous Page | 204 205 206 207 208 209 210 211 212 213 | Next Page ]
[ Jump to Last Post ]
Post new Thread