| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 352
|
|
| Author |
|
|
RCC_Survivor
Veteran Cruncher USA Joined: Apr 28, 2007 Post Count: 1337 Status: Offline Project Badges:
|
IMHO
----------------------------------------it would make sense that there was an increase in load after CFSW was accelerated 4-6 times. The time remaining on the project dropped significantly without the crunchers being forewarned and the WUs are going fast. http://i137.photobucket.com/albums/q210/Sekerob/WCGYearsPi1Project.png Everyone without a badge in CFSW is probably running it exclusively creating a significant increase in the load. Lowering the priority of CFSW may help the problem subside.
Be kinder than necessary, for everyone you meet is fighting some battle.
Please join the team The survivors ![]() Bilateral Renal, Melanoma, and Squamous Cell cancers |
||
|
|
wplachy
Senior Cruncher Joined: Sep 4, 2007 Post Count: 423 Status: Offline |
I'm not sure whether this question belongs in this thread, but I recall Sekerob saying something about upload problems being counted against a machine, so I believe it does. Starting yesterday evening, one of my machines is always being assigned a wingman for CFSW (for over 20 consecutive WUs). This is after correctly processing hundreds of WUs on that science, and there are no Errors or Invalids showing in Results Status. Is that likely to be an effect of the server problems? My other machines are still happily doing single validation, except when they get repair jobs. Did you have any "Server Aborts"? What has been happening to me is that the Server Aborts" are being treated as errors in that once a device gets one the following WUs are assigned wingmen, and the single quorum in-progress WUs are sent to Pending Verification and assigned a wingman. After a while it goes back to single quorum until another server abort and then back in the quorum 2 loop. Odd thing is that the device continues to pull repair WUs even though it's assigned quorum 2 requirement.
Bill P
----------------------------------------![]() [Edit 1 times, last edit by wplachy at Aug 15, 2012 9:34:57 PM] |
||
|
|
kateiacy
Veteran Cruncher USA Joined: Jan 23, 2010 Post Count: 1027 Status: Offline Project Badges:
|
Thanks for the feedback, Bill. I vaguely recall noticing a server abort recently (I don't remember whether it wason this machine), but there are none hanging around in Results Status at this point. The machine has returned to single quorum at this point.
----------------------------------------![]() |
||
|
|
PecosRiverM
Veteran Cruncher The Great State of Texas Joined: Apr 27, 2007 Post Count: 1054 Status: Offline Project Badges:
|
Starting yesterday evening, one of my machines is always being assigned a wingman for CFSW (for over 20 consecutive WUs). This is after correctly processing hundreds of WUs on that science, and there are no Errors or Invalids showing in Results Status. Is that likely to be an effect of the server problems? My other machines are still happily doing single validation, except when they get repair jobs. Is it possiable your getting the second WU as the validator for the first WU? ![]() ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Maybe it would be useful if you (we) post OS's you (we) are shrubbing (aka crunching) on :-) Cheers and NI! Windows XP (32 bit): 8/15/2012 4:10:36 PM World Community Grid Reporting 1 completed tasks, requesting new tasks 8/15/2012 4:10:41 PM World Community Grid Scheduler request completed: got 0 new tasks 8/15/2012 4:10:41 PM World Community Grid Message from server: No tasks sent 8/15/2012 4:10:41 PM World Community Grid Message from server: No tasks are available for Computing for Sustainable Water 8/15/2012 4:10:41 PM World Community Grid Message from server: No tasks are available for Computing for Clean Water 8/15/2012 4:10:41 PM World Community Grid Message from server: No tasks are available for Discovering Dengue Drugs - Together - Phase 2 8/15/2012 4:10:41 PM World Community Grid Message from server: No tasks are available for Discovering Dengue Drugs - Together - Phase 2 (Type A) 8/15/2012 4:10:41 PM World Community Grid Message from server: No tasks are available for the applications you have selected. [Edit 1 times, last edit by Former Member at Aug 16, 2012 3:42:42 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
One can almost sync clock on this... around 13:30 CET [DST], the transients kick in [when something in the North Americas slowly awakens]:
----------------------------------------92626 World Community Grid 16-8-2012 13:27:59 Starting task cfsw_12336_12336802_0 using cfsw version 612 in slot 5 92627 World Community Grid 16-8-2012 13:28:02 Started upload of cfsw_12335_12335753_0_0 92628 World Community Grid 16-8-2012 13:28:04 [error] Error reported by file upload server: Maintenance underway: file uploads are temporarily disabled. 92629 World Community Grid 16-8-2012 13:28:04 Temporarily failed upload of cfsw_12335_12335753_0_0: transient upload error 92630 World Community Grid 16-8-2012 13:28:04 Backing off 3 min 15 sec on upload of cfsw_12335_12335753_0_0 92631 World Community Grid 16-8-2012 13:29:34 [checkpoint] result cfsw_12335_12335747_0 checkpointed 92632 World Community Grid 16-8-2012 13:29:42 [checkpoint] result cfsw_12335_12335742_0 checkpointed 92633 World Community Grid 16-8-2012 13:29:44 Computation for task cfsw_12335_12335742_0 finished 92634 World Community Grid 16-8-2012 13:29:44 Starting task cfsw_12336_12336445_0 using cfsw version 612 in slot 3 92635 World Community Grid 16-8-2012 13:29:46 Started upload of cfsw_12335_12335742_0_0 92636 World Community Grid 16-8-2012 13:30:10 Temporarily failed upload of cfsw_12335_12335742_0_0: connect() failed 92637 World Community Grid 16-8-2012 13:30:10 Backing off 3 min 9 sec on upload of cfsw_12335_12335742_0_0 92638 16-8-2012 13:30:15 Project communication failed: attempting access to reference site 92639 16-8-2012 13:30:18 Internet access OK - project servers may be temporarily down. 92640 World Community Grid 16-8-2012 13:30:19 [checkpoint] result cfsw_12335_12335681_0 checkpointed 92641 World Community Grid 16-8-2012 13:30:31 [checkpoint] result cfsw_12335_12335340_0 checkpointed 92642 World Community Grid 16-8-2012 13:31:19 Started upload of cfsw_12335_12335753_0_0 92643 World Community Grid 16-8-2012 13:31:25 [error] Error reported by file upload server: Maintenance underway: file uploads are temporarily disabled. 92644 World Community Grid 16-8-2012 13:31:25 Temporarily failed upload of cfsw_12335_12335753_0_0: transient upload error 92645 World Community Grid 16-8-2012 13:31:25 Backing off 4 min 36 sec on upload of cfsw_12335_12335753_0_0 P.S. Surprising it may be to many, but most of South America is east of Florida. We have depending on DST, only 3 hours difference between Rome and Buenos Aires. Workdays in the NAs start 5 to 6 [Edit 1 times, last edit by Former Member at Aug 16, 2012 11:43:15 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
References:
kateiacy [Aug 15, 2012 6:09:58 PM] post wplachy [Aug 15, 2012 9:25:03 PM] post ... Is that likely to be an effect of the server problems? My other machines are still happily doing single validation, except when they get repair jobs. •snip from kateiacy Did you have any "Server Aborts"? What has been happening to me is that the Server Aborts" are being treated as errors... •snip from wplachy One thing is that reliable/trusted machines have a high chance of getting a lot of repair jobs; those machines are, after all, reliable/trusted machines. Another thing is that if the server-issue ever causes a foul up in the validation process, I can imagine that it would occur only in cases where deadlines were missed due to delays injected from sync-difficulties surrounding the server-issue; but it doesn't seem fair, in any case, that the machines involved take a hit in their reliability/trust rating as a result of the said foul up. Unless the No-Reply_bug is fixed*, this bug lies in wait to greet the doneWU incorrectly deemed (due to the said injected delay) as having a status of No-Reply. Validation-processing at this point can only be a mess. Notes: *See also Ingleside [Jul 29, 2012 1:44:21 PM] post ; |
||
|
|
kateiacy
Veteran Cruncher USA Joined: Jan 23, 2010 Post Count: 1027 Status: Offline Project Badges:
|
Thanks for the additional explanation and the link to Ingleside's message. I think I'm seeing a combination of all those factors. I am getting some repair jobs, because all my machines are "reliable."
----------------------------------------Mostly these are full-10-day-deadline tasks, though. Many times it's that I'm getting the second WU to validate somebody else's. I imagine lots of people are putting machines onto CFSW that haven't been running it before, in order to try to get badges before the science ends. I just got another slew of these overnight. ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Notably if you have a 64 bit machine, and the ZR task assignment is _1, of CFSW, and of version 6.11, it's a 32 bit version. Then you'd see 10 days. Long as it's 10 days exactly, it *not* a repair task [in case there were any doubt]
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Had to work the update button quite a bit this AM to get all my overnight update's in...
|
||
|
|
|