| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 26
|
|
| Author |
|
|
JCMarsh [U.S. Army]
Cruncher Joined: Feb 8, 2012 Post Count: 5 Status: Offline |
I have a problem with HCMD2 failing to download work units. It tries but then the WUs get stuck at 0% downloaded, time out, and get stuck in Project Backoff. I've tried restarting BOINC, restarting the machine, abort transfer and task, project reset, cursing and throwing things, but all to no avail.
----------------------------------------Windows 7 Pro (64 bit) BOINC 7.0.28(x64) I have another machine (32 bit XP pro with 7.0.28) that is crunching and downloading just fine, also only on HCMD2. From Event Log of offending machine... 9/17/2012 9:24:53 AM | World Community Grid | Temporarily failed download of hcmd2.2QOV_P.clustersOccur.pdb.gzb: transient HTTP error 9/17/2012 9:24:53 AM | World Community Grid | Backing off 7 min 4 sec on download of hcmd2.2QOV_P.clustersOccur.pdb.gzb 9/17/2012 9:24:57 AM | | Project communication failed: attempting access to reference site 9/17/2012 9:24:58 AM | | Internet access OK - project servers may be temporarily down. 9/17/2012 9:25:53 AM | World Community Grid | Started download of hcmd2.2QOV_G.clustersOccur.pdb.gzb 9/17/2012 9:25:55 AM | World Community Grid | Temporarily failed download of hcmd2.2QOV_G.clustersOccur.pdb.gzb: transient HTTP error 9/17/2012 9:25:55 AM | World Community Grid | Backing off 4 min 58 sec on download of hcmd2.2QOV_G.clustersOccur.pdb.gzb 9/17/2012 9:25:58 AM | | Project communication failed: attempting access to reference site 9/17/2012 9:25:59 AM | | Internet access OK - project servers may be temporarily down. 9/17/2012 9:30:54 AM | World Community Grid | Started download of hcmd2.2QOV_G.clustersOccur.pdb.gzb 9/17/2012 9:30:56 AM | World Community Grid | Temporarily failed download of hcmd2.2QOV_G.clustersOccur.pdb.gzb: transient HTTP error 9/17/2012 9:30:56 AM | World Community Grid | Backing off 13 min 53 sec on download of hcmd2.2QOV_G.clustersOccur.pdb.gzb 9/17/2012 9:30:59 AM | | Project communication failed: attempting access to reference site 9/17/2012 9:31:01 AM | | Internet access OK - project servers may be temporarily down. ____UPDATE___ I reset project twice and it kept drawing the same series WU and got stuck in same manner. Left it alone for a bit and was about to add HFCC so I could crunch something, then it finally woke up and started downloading a different series of WU. All is fine now on my end, so I suppose that series of WU may have been pulled. Happy crunching! ![]() [Edit 1 times, last edit by JCMarsh [U.S. Army] at Sep 17, 2012 3:15:01 PM] |
||
|
|
BobCat13
Senior Cruncher Joined: Oct 29, 2005 Post Count: 295 Status: Offline Project Badges:
|
All of those files are 2QOV_? input files. I also could not get the 2QOV_? files to download, so I just aborted those downloads and all is working fine now. It appears the 2QOV_? input files may be corrupt on the server.
----------------------------------------[Edit 1 times, last edit by BobCat13 at Sep 17, 2012 2:50:24 PM] |
||
|
|
52 Aces
Cruncher United States Joined: Sep 19, 2009 Post Count: 29 Status: Offline Project Badges:
|
Sounds different from what I saw, so this is just fyi for others:
----------------------------------------I had a situation about a week ago where my machine was idle as 1 actual FILE was trying to download for a few hours, but it somehow mapped to TWO WU's --- but I didn't study it (ie: maybe I was my own wingman), instead I just quickly aborted the WU's directly, and immediately the system grabbed & DL'd a new batch of WU's and began crunching again. Soooo, what looked like a server DL problem where the servers 'looked' unresponsive was something else going on entirely. And the DL being wedged prevented my system for asking for other WU's. [Edit 1 times, last edit by 52 Aces at Sep 17, 2012 7:00:54 PM] |
||
|
|
Paul Schlaffer
Senior Cruncher USA Joined: Jun 12, 2005 Post Count: 279 Status: Offline Project Badges:
|
I had the same issue today with a 2516_2QOV work unit. Thanks to this post I aborted the WU and the problem was resolved. Unfortunately it idled 11 processors until then. Thanks for the info.
----------------------------------------
“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Started crunching again after a idle summer due to cooling issues I been getting these corrupted work units too.. same issue had one I was at work all 12 threads idled out cause of it. aborted 12 new clean ones.. crunching again.. had to clear another one this morning on another box.. hope there is a fix on these soon..
|
||
|
|
KWSN - A Shrubbery
Master Cruncher Joined: Jan 8, 2006 Post Count: 1585 Status: Offline |
Don't expect a fix as this project is also ending. I've had one stuck file on a different machine every day the past three days. Just abort it and move on.
----------------------------------------Keep a (relatively) close eye on your machines and this becomes a non-issue. Since it (usually) only affects one file it shouldn't bring the machine to a halt. ![]() Distributed computing volunteer since September 27, 2000 |
||
|
|
Paul Schlaffer
Senior Cruncher USA Joined: Jun 12, 2005 Post Count: 279 Status: Offline Project Badges:
|
I encountered another one of these today (CMD2_2533_MYH6.clusterOccur_2QOV).
----------------------------------------Again I aborted the problem WU and the downloads resumed to normal.
“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)
|
||
|
|
Byteball_730a2960
Senior Cruncher Joined: Oct 29, 2010 Post Count: 318 Status: Offline Project Badges:
|
I had the same issue today with a 2516_2QOV work unit. Thanks to this post I aborted the WU and the problem was resolved. Unfortunately it idled 11 processors until then. Thanks for the info. I have the same issue (although not 12 cores). I have a number of machines that are left unattended around the world (family and friends). My buffer on these machines is 0 days and a stuck download, idles cores that I don't want to lose. I have no idea which computers have stuck workunits as the computers are on so randomly that I cannot detect a pattern. Will these WUs eventually timeout and the computer comes back or will I be losing these machines? |
||
|
|
Paul Schlaffer
Senior Cruncher USA Joined: Jun 12, 2005 Post Count: 279 Status: Offline Project Badges:
|
These work units should eventually be aborted by the server due to the expired return deadline (or earlier if detected).
----------------------------------------
“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)
----------------------------------------[Edit 1 times, last edit by Paul Schlaffer at Sep 21, 2012 1:07:14 AM] |
||
|
|
Byteball_730a2960
Senior Cruncher Joined: Oct 29, 2010 Post Count: 318 Status: Offline Project Badges:
|
Nice. Thanks for that.
|
||
|
|
|