Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 26
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Let me take this one as example to show why the Server Abort took place (sorted by order of generation), default HCC deadline 7 days exact **:
----------------------------------------1) This original of 2 copies was overdue at 03/03/11 06:07:58, causing the _2 copy to be send out upon ""No Reply"". It still reported at 07:08:59, about an hour late: X0000063880291200601271046_ 0-- 608 Valid 21/02/11 06:07:58 03/03/11 07:08:29 2.91 46.4 / 42.0 2) This copy of the 2 original came back in 4 days... not swift, but no issue: X0000063880291200601271046_ 1-- 608 Valid 21/02/11 06:08:39 25/02/11 00:44:50 3.09 37.5 / 42.0 3) This is the repair job sent out due the ""No Reply"" of 1) at 06:09:32. It was not started immediately and when 1) above still reported, albeit late, it set the flag for the Repair wingman, trusted client, to not process the task, thus "Server Abort" was signalled to this client at 11:59:55 when the host talked to the servers. X0000063880291200601271046_ 2-- 640 Server Aborted 03/03/11 06:09:32 03/03/11 11:59:55 0.00 0.0 / 0.0 All is well, some extra intertube bandwidth was used, but no redundant crunch time went to the waist (pun intended). ** hmmm, did the standard deadline for HCC change back from 7 to 10days? Hope this clarifies the mud a bit --//-- [Edit 1 times, last edit by Former Member at Mar 4, 2011 3:21:07 PM] |
||
|
rilian
Veteran Cruncher Ukraine - we rule! Joined: Jun 17, 2007 Post Count: 1460 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
SekeRob, 32 of my cores were sitting dry for several hours.... There are only 512MB ram per 16 cores so i had only HCC ticked on and not "load work from other machines"
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
rilian, you can trick the client a little. HCMD2 is also small as is C4CW. By limiting the permitted BOINC RAM on the host for both idle and work and setting the "if there is no work..." the clients would go to only fetch the small/lighter sciences, but not the biggies... so is the theory. I'd be interested to know if that would work by volunteering you to test this ;P
----------------------------------------If the trick works I'll be porting this into an FAQ --//-- PS: "load work from other machines" I've not found yet as option, but we know what you meant ;o) [Edit 1 times, last edit by Former Member at Mar 4, 2011 3:29:16 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for the clarification SekeRob. Just a quick question is there any info in the result name that tell you it is a repair wu or just the fact that there is a third wu enough to tell you that it is a repair wu
cheers |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
And a suggestion for the techs: Have the "if there is no work..." only send work when the assigned number of tasks "In Progress" sinks below or equals 1 per core. This way, upon restore of preferred work availability, those hosts would return short order to those sciences the member has elected for the host(s)/profile(s).
Would that work? Think this increases the willingness by members to select this "recommended" alternate work supply. --//-- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for the clarification SekeRob. Just a quick question is there any info in the result name that tell you it is a repair wu or just the fact that there is a third wu enough to tell you that it is a repair wu cheers Assuming you know the standard quorum size, you can tell by the suffix and of course the 4 days or shorter ''rush''deadline. A suffix of _2 for HCC is a dead give away. For Zero Redundant sciences, those that have normally no wingman, it's less obvious as sometimes an extra wingman is send out at the same time, so I'd say the shorter deadline is the more reliable indicator. Because I run with a near 2 day cache, remaining reliable, yet receiving repair jobs and without pushing them in the queue, the Slow boat to China cruncher has a little bit more grace time to complete the task and the servers sending me hosts an "it's redundant, don't bother, server abort'' Much preferring the BOINCTasks utility because it also has the assignment date column which the standard BOINC Manager does not. This way it's more obvious what short deadline tasks are, to me. --//-- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
rilian, you can trick the client a little. HCMD2 is also small If the trick works I'll be porting this into an FAQ --//-- PS: "load work from other machines" I've not found yet as option, but we know what you meant ;o) Testing this idea, indeed it works. Took my duo and reduced work/idle memory to 5% which set it down to 102Mb RAM use allowance. Selected HFCC & DDDT2 and the "If there is no work...", then upped the cache. The log below messages that not enough memory was assigned for HFCC & DDDT2 (simultaneous revealing the difference between the System Requirement spec pages and the hard RAM minimum). Only the small footprint alternate sciences were fetched (HCC/HCMD2. 11580 05-03-2011 13:47 Preferences: 11581 05-03-2011 13:47 max memory usage when active: 102.31MB 11582 05-03-2011 13:47 max memory usage when idle: 102.31MB 11583 05-03-2011 13:47 max disk usage: 10.00GB 11584 05-03-2011 13:47 (to change preferences, visit the web site of an attached project, or select Preferences in the Manager) 11585 WCG 05-03-2011 13:47 [sched_op] Starting scheduler request 11586 WCG 05-03-2011 13:47 Sending scheduler request: To fetch work. 11587 WCG 05-03-2011 13:47 Requesting new tasks for CPU 11588 WCG 05-03-2011 13:47 [sched_op] CPU work request: 31872.23 seconds; 0.00 CPUs 11589 WCG 05-03-2011 13:47 Scheduler request completed: got 3 new tasks 11590 WCG 05-03-2011 13:47 [sched_op] Server version 601 11591 WCG 05-03-2011 13:47 No work can be sent for the applications you have selected 11592 WCG 05-03-2011 13:47 No work is available for Discovering Dengue Drugs - Together - Phase 2 (Type A) 11593 WCG 05-03-2011 13:47 Help Fight Childhood Cancer needs 119.21 MB RAM but only 102.31 MB is available for use. 11594 WCG 05-03-2011 13:47 Discovering Dengue Drugs - Together - Phase 2 needs 750.00 MB RAM but only 102.31 MB is available for use. 11595 WCG 05-03-2011 13:47 You have selected to receive work from other applications if no work is available for the applications you selected 11596 WCG 05-03-2011 13:47 Sending work from other applications 11597 WCG 05-03-2011 13:47 Project requested delay of 11 seconds 11598 WCG 05-03-2011 13:47 [sched_op] estimated total CPU task duration: 50682 seconds 11599 WCG 05-03-2011 13:47 [sched_op] Deferring communication for 11 sec 11600 WCG 05-03-2011 13:47 [sched_op] Reason: requested by project 11601 WCG 05-03-2011 13:47 Started download of wcg_hcc1_img_6.40_windows_intelx86 11602 WCG 05-03-2011 13:47 Started download of wcg_hcc1_img_graphics_6.40_windows_intelx86 11603 WCG 05-03-2011 13:47 Finished download of wcg_hcc1_img_graphics_6.40_windows_intelx86 11604 WCG 05-03-2011 13:47 Started download of hcc1_image04_6.40.tga 11605 WCG 05-03-2011 13:47 Finished download of wcg_hcc1_img_6.40_windows_intelx86 11606 WCG 05-03-2011 13:47 Finished download of hcc1_image04_6.40.tga 11607 WCG 05-03-2011 13:47 Started download of hcc1_image03_6.40.tga 11608 WCG 05-03-2011 13:47 Started download of hcc1_image02_6.40.tga 11609 WCG 05-03-2011 13:48 Finished download of hcc1_image03_6.40.tga 11610 WCG 05-03-2011 13:48 Finished download of hcc1_image02_6.40.tga 11611 WCG 05-03-2011 13:48 Started download of hcc1_image01_6.40.tga 11612 WCG 05-03-2011 13:48 Started download of X0000065240731200602242024_X0000065240731200602242024.jp2 11613 WCG 05-03-2011 13:48 Finished download of hcc1_image01_6.40.tga 11614 WCG 05-03-2011 13:48 Finished download of X0000065240731200602242024_X0000065240731200602242024.jp2 11615 WCG 05-03-2011 13:48 Started download of X0000065240427200602242029_X0000065240427200602242029.jp2 11616 WCG 05-03-2011 13:48 Started download of 583efc9bc28523c3f2e0a9647b3b8936.dat.gzb 11617 WCG 05-03-2011 13:48 Finished download of X0000065240427200602242029_X0000065240427200602242029.jp2 11618 WCG 05-03-2011 13:48 Finished download of 583efc9bc28523c3f2e0a9647b3b8936.dat.gzb 11619 WCG 05-03-2011 13:48 Started download of cbfbb23e5ae6f9c81628dad4bab38e8d.dat.gzb 11620 WCG 05-03-2011 13:48 Started download of 5be7af669ed63f33b54771c812958e27.pdb.gzb 11621 WCG 05-03-2011 13:48 Finished download of cbfbb23e5ae6f9c81628dad4bab38e8d.dat.gzb 11622 WCG 05-03-2011 13:48 Finished download of 5be7af669ed63f33b54771c812958e27.pdb.gzb 11623 WCG 05-03-2011 13:48 Started download of 93f4e8307bf057ebd259191837a37a6c.pdb.gzb 11624 WCG 05-03-2011 13:48 Started download of b416fac6f940515859406b3d7fb2f4dd.dat.gzb 11625 WCG 05-03-2011 13:48 Finished download of 93f4e8307bf057ebd259191837a37a6c.pdb.gzb 11626 WCG 05-03-2011 13:48 Finished download of b416fac6f940515859406b3d7fb2f4dd.dat.gzb In the interim to learn what the hard limit is for other sciences expanded the science selection and got this: 11675 WCG 05-03-2011 14:16 The Clean Energy Project - Phase 2 needs 750.00 MB RAM but only 102.31 MB is available for use. 11676 WCG 05-03-2011 14:16 Help Fight Childhood Cancer needs 119.21 MB RAM but only 102.31 MB is available for use. 11677 WCG 05-03-2011 14:16 Computing for Clean Water needs 384.00 MB RAM but only 102.31 MB is available for use. 11678 WCG 05-03-2011 14:16 Human Proteome Folding - Phase 2 needs 171.66 MB RAM but only 102.31 MB is available for use. 11679 WCG 05-03-2011 14:16 FightAIDS@Home needs 119.21 MB RAM but only 102.31 MB is available for use. 11680 WCG 05-03-2011 14:16 You have selected to receive work from other applications if no work is available for the applications you selected So effectively, by controlling memory permission to 118MB and selecting HCC plus "If there is no work..." the control is to only receive HCMD2 as alternate. Set it to 380 MB and only medical sciences would be received, regrettably then DDDT2 not coming along, which chance is currently anyway slim. Now that the exact minima were printed for most sciences, have to see if these sciences will run concurrent in any pairing without invoking a "waiting for memory" state, the weakness in the strategy I fear. ttyl |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
PS, The HCC temporary shortage knocked onto Saturday validations... some 35,000 down from Friday, so is the projection based on morning data. All other sciences point up, but that could be an after effect of the server troubles of past days... delayed returns and validations.
--//-- |
||
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Now that the exact minima were printed for most sciences, have to see if these sciences will run concurrent in any pairing without invoking a "waiting for memory" state, the weakness in the strategy I fear. Yes, with low memory-settings you'll successfully download the work, but you can't run on all the cores, since not enough memory for this. If rillian has a 16-core system like his message atleast indicates, would expect he'll need to set memory to atleast 1 GB, even if runs the smallest-memory-WCG-project, just to keep all cores loaded. Also, if starts to hit the memory-limit, there's a chance some tasks will be removed from memory, regardless of LAIM being on, something that can lose significant amount of time if long time between checkpoints. ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
sk..
Master Cruncher http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif Joined: Mar 22, 2007 Post Count: 2324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I would put these problems down to the server troubles. I ran out of WCG tasks on several systems and for some time. Typically Boinc either got no new work (left me dry), backed off trying to download and got resends (which also backed off) for lost tasks. It might also have messed with Boinc as some other projects were not automatically downloading at times. Upping the cache and manual updates got there for the other projects but not for WCG tasks (varying multiples of projects selected on different systems).
|
||
|
|
![]() |