| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 107
|
|
| Author |
|
|
gb077492
Advanced Cruncher Joined: Dec 24, 2004 Post Count: 96 Status: Offline |
First off, many thanks to cleanenergy for the detailed feedback. It was extremely helpful. It makes me feel a lot better that my slow old machines really are contributing something useful.
I, too, share concerns about the future if molecule size will grow and calculation time is extended. I certainly don't have a problem with tasks that run much longer (some DDT2 jobs run over 30 hours on my slow crunchers), but what does really worry me is the time which would be lost if a long-running step is interrupted. Many "average joes" (he says, taking his wife as an example) like to turn a machine on, use it for a while, and turn it off again. Maybe the idea of 8 hours being a sweet spot is related to the length of the office day (though that does assume a machine running at 100% efficiency)? But if individual steps can run for many hours, and it seems that there is no checkpointing within a step, then that's a long way to go back after a reboot. And if we're not in a office environment, but at home, could we find ourselves in the position where a step can never complete because machines don't stay on that long and each time the machine is rebooted it goes back to the same checkpoint and just keeps going round and round the same calculations? As step sizes increase, this problem must get more and more likely. I've been watching some WUs on my slowest machine and I've seen over 11 hours (CPU time, let alone wall-clock) between checkpoints. So I think we're potentially there already. Just a thought. Won't stop me crunching for this project though! Mike |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
There is actually a soft time-out after 10h (i.e., no new job gets started) and a hard time-out after 12h (i.e., the running job gets interrupted). Hmm, does the soft time-out really work? E.g. my WU E200807_903_A.26.C21H13NS3Se.3.4.set1d06_0 was killed after 12 hours computing within job 15, but the result log says ... [08:30:44] Starting job 14,CPU time has been restored to 31467.133311. [10:31:06] Finished Job #14 [10:31:06] Starting job 15,CPU time has been restored to 38213.224955. Killing job because cpu time has been exceeded. Subjob start time = 852808706, Subjob current time = 1088596135 [11:55:01] Finished Job #15 11:55:14 (6408): called boinc_finish Job 15 ran from 10:31 until 11.55 when it was killed after 12 hours runtime. If I calculate correctly it ran for 1:24 hours, i.e. it was started 36 minutes after the soft time-out. Or did I miss something? Matthias |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi everybody,
A few follow-ups on this discussion: a while back I purpose an idea of pairing slow machines with fast machines. Yes, there are a number of ways to optimize grid computing, e.g., smart wingman pairing, smart matching of client and wu (i.e., more powerful computers get more challenging wus), customized wus depending on individual hardware setups, etc.. IBM is aware of these ideas and the fact that a one-glove-fits-all approach is not particularly efficient. Unfortunately, smart and individual solutions can add considerable complexity to maintaining and running WCG, so not everything desirable is also practical on such a massive scale. Simplicity is (to some degree) a quality in itself. We had briefly considered a pairing strategy during the early beta phase, and we could kick it around with IBM some more now – maybe it can be done after all.I wonder what that does to the memory requirements. Disc and memory requirements also increase with system size. Our library currently contains about 10 million candidate molecules which are for now sorted by size. They get sent to the grid in (more or less) continuously ascending order (the current max size is 499 of some complicated unit). The next batch of 2.7 million candidates is made up of recently generated molecules of a size between ~250 and 499. Hence, there will be smaller ones again which are out of order but will go up to the same size as the currently biggest molecules. So there will be no big jump in size anytime soon. We will be able to make some more quantitative statements when we compile our internal statistics early next year. I, too, share concerns about the future if molecule size will grow and calculation time is extended. The situation is not that dramatic because we have designed our library such that the bulk of molecules has about the size we think is doable on the grid (at the same time we crunch away the molecules from the large end on our cluster). If we feel that the molecule size gets too big for the grid we have the machinery in place to produce more work of the size that can efficiently be performed on the WCG. But again, considering the current make-up of the library we should have plenty of warning time before things get dicey.and it seems that there is no checkpointing within a step The checkpointing in CEP2 is indeed a tricky business since it adds more I/O which already is an Achilles’ heel. We currently make due with a suboptimal solution. Keeping the application in memory fixes this, but we are also still aiming at a more efficient way of checkpointing within a job. does the soft time-out really work? Well, it should… in theory… we hope… Maybe we have another look .Best wishes and – as always – thanks for crunching CEP2. Your Harvard CEP team |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
cleanenergy,
A few follow-ups on this discussion: I certainly want to thank you for your further explanations. I, for one, have no problem with WUs that run much longer than 12 hours. If sufficient information can be obtained from the later jobs (14, 15), I would like to have the WU run to full completion. Of course, this means matching up with a wingman who has the same specifications since doing otherwise would result in too many WUs being placed in Pending Validation status; it may also increase the number of No Reply results. My only other comment has to do with another reason for providing a time limit in which to complete any WU. Should RAM or recording to the hard drive get messed up, you want to have a way to terminate a WU so that new processing may take place. I would like to see an enhancement made to BOINC which would provide a message saying BOINC is being being shut down for recovery from an error situation and the only way to continue is to reboot. Thanks, Dave |
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
Hi cleanenergy, with regards to the soft timeout of 10 hrs, that's something that I didn't know/realise until you mentioned it - as I'd never seen it work. Indeed, I've seen numerous WU's start on a new step after the 10 hr mark - so it may be something to investigate...
----------------------------------------As a couple of examples, the top 2 WU's in this image should have (I believe), have cut off at the completion of the prior step; [IMG]http://i1195.photobucket.com/albums/aa385/FBeardsell/WCG-image02.jpg[/IMG] ![]() [Edit 2 times, last edit by gb009761 at Dec 20, 2010 8:33:32 PM] |
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
After now reaching my first major goal in this project (Gold), I've got some additional data to go with the set contained in the opening post. Obviously, the scientists/WCG techs will have access to a far wider range of results than I could possibly have, although the sample set of 80 which I have captured, should give some food for thought...
----------------------------------------Both pairs completed in the time limit = 13 Only 1 of the pairs completed in the time limit = 47 Neither pair completed in the time limit = 20 Of the 80 WU's my computer processed; Completed fully = 11 Aborted = 10 Killed in step #15 = 7 Killed in step #16 = 52 Of my wingmen (of which, 4 WU's had repair units sent); Completed fully = 45 No Replies = 2 Aborted = 11 Killed in step #04 = 1 Killed in step #08 = 1 Killed in step #09 = 1 Killed in step #10 = 1 Killed in step #11 = 3 Killed in step #12 = 1 Killed in step #13 = 2 Killed in step #14 = 2 Killed in step #15 = 6 Killed in step #16 = 8 At no time did I see any WU's complete in a controlled manner via the 'soft timeout' option after the 10 hr mark - and therefore, this may also be something which needs further investigation/testing. The remainder of the sample data (these don't include those already posted in the opening post), are; Both completed okay E200765_988_A.27.C21H13N3S3.669.1.set1d06_0-- me = 11.65 (completed okay) wingman = 9.72 (completed okay) E200767_399_A.28.C20H11N5OS2.401.1.set1d06_1-- me = 8.16 (aborted in step #13) wingman = 4.42 (aborted in step #13), 10.50 (aborted in step #13) E200768_696_A.27.C22H13NOS3.540.4.set1d06_1-- me = 8.42 (aborted in step #13) wingman = 4.62 (aborted in step #13) E200770_277_A.27.C22H14N2OS2.154.2.set1d06_1-- me = 7.72 (cut short in step #13) wingman = 6.00 (cut short in step #13) E200770_404_A.25.C22H16SSeSi.22.4.set1d06_0-- me = 11.60 (completed okay) wingman = 8.25 (completed okay) E200771_223_A.26.C22H14N2SSe.96.1.set1d06_1-- me = 7.61 (aborted in step #13) wingman = 3.89 (aborted in step #13) E200816_989_A.27.C21H13NOS3Si.410.4.set1d06_0-- me = 7.65 (aborted in step #13) wingman = 5.10 (aborted in step #13) E200820_058_A.28.C20H11N5S3.869.2.set1d06_0-- me = 6.96 (aborted in step #13) wingman = 4.71 (aborted in step #13) E200820_507_A.27.C22H13NS4.354.0.set1d06_1-- me = 11.68 (completed okay) wingman = 7.25 (completed okay) Only 1 completed okay E200761_476_A.27.C22H13NOS3.254.3.set1d06_0-- me = 12.00 (killed in step #16) wingmen = 7.12 (completed okay), 'No Reply' E200763_599_A.26.C22H13NS2Se.2.3.set1d06_1-- me = 12.00 (killed in step #15) wingman = 8.51 (completed okay) E200763_918_A.27.C22H13NOS3.535.1.set1d06_0-- me = 12.00 (killed in step #15) wingman = 7.97 (completed okay) E200765_156_A.28.C19H11N7S2.73.2.set1d06_0-- me = 12.00 (killed in step #16) wingman = 9.48 (completed okay) E200765_254_A.27.C22H13NOS3.617.3.set1d06_1-- me = 12.00 (killed in step #16) wingman = 8.42 (completed okay) E200765_339_A.28.C20H11N5OS2.319.2.set1d06_0-- me = 12.00 (killed in step #16) wingman = 6.97 (completed okay) E200765_493_A.27.C21H13N3S3.602.0.set1d06_0-- me = 12.00 (killed in step #16) wingman = 9.85 (completed okay) E200765_512_A.27.C21H13N3S3.618.1.set1d06_1-- me = 12.00 (killed in step #16) wingman = 10.85 (completed okay) E200765_836_A.28.C21H11N3O2S2.200.4.set1d06_0-- me = 12.00 (killed in step #15) wingman = 8.79 (completed okay) E200765_910_A.28.C20H11N5OS2.400.0.set1d06_0-- me = 12.00 (killed in step #16) wingman = 8.81 (completed okay) E200766_170_A.27.C21H13N3S3.523.0.set1d06_0-- me = 11.72 (completed okay) wingman = 4.50 (aborted in step #14) E200766_176_A.27.C22H13NOS3.608.1.set1d06_0-- me = 12.00 (killed in step #16) wingman = 7.24 (completed okay) E200766_274_A.26.C20H13N3S2Se.16.2.set1d06_1-- me = 9.93 (aborted in step #15) wingman = 12.00 (killed in step #13) E200766_350_A.26.C20H15NOS2Si2.100.0.set1d06_0- me = 12.00 (killed in step #16) wingman = 9.81 (completed okay) E200766_464_A.26.C20H13N3S2Se.12.3.set1d06_0-- me = 12.00 (killed in step #16) wingman = 7.01 (completed okay) E200766_583_A.26.C21H15NS3Si.293.4.set1d06_0-- me = 12.00 (killed in step #15) wingman = 11.91 (completed okay) E200767_394_A.28.C20H11N5OS2.367.2.set1d06_1-- me = 12.00 (killed in step #16) wingman = 11.58 (completed okay) E200768_042_A.25.C21H13NSSe2.42.1.set1d06_1-- me = 11.64 (completed okay) wingman = 12.00 (killed in step #12) E200768_064_A.26.C21H15NS3Si.309.3.set1d06_0-- me = 12.00 (killed in step #16) wingman = 9.50 (completed okay) E200768_542_A.27.C20H13N3OS2Si.247.3.set1d06_1- me = 12.00 (killed in step #16) wingman = 11.11 (completed okay) E200768_676_A.26.C20H15NOS2Si2.74.2.set1d06_1-- me = 12.00 (killed in step #16) wingmen = 7.19 (completed okay), 'No Reply' E200769_461_A.26.C22H16OS2Si.33.1.set1d06_1-- me = 12.00 (killed in step #16) wingman = 11.81 (completed okay) E200770_050_A.26.C23H14OSeSi.4.set1d06_0-- me = 12.00 (killed in step #15) wingman = 11.05 (completed okay) E200770_331_A.26.C22H16OS2Si.45.1.set1d06_0-- me = 12.00 (killed in step #16) wingman = 7.49 (completed okay) E200818_337_A.28.C20H11N5S3.779.1.set1d06_0-- me = 12.00 (killed in step #16) wingman = 8.33 (completed okay) E200819_328_A.28.C20H11N5S3.735.3.set1d06_1-- me = 11.47 (completed okay) wingman = 12.00 (Killed in step #16) E200835_333_A.27.C22H14N2S3.713.4.set1d06_1-- me = 12.00 (killed in step #16) wingman = 6.65 (completed okay) Neither completed okay E200709_749_A.26.C22H15NSSi2.22.3.set1d06_1-- me = 12.00 (killed in step #15) wingman = 12.00 (killed in step #16) E200761_454_A.28.C21H11N3O2S2.30.3.set1d06_1-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #16) E200763_144_A.28.C21H11N3O2S2.128.0.set1d06_0-- me = 6.38 (aborted in step #13) wingman = 12.00 (killed in step #08) E200763_330_A.28.C20H11N5OS2.59.3.set1d06_1-- me = 12.00 (killed in step #16) wingmen = 12.00 (killed in step #11), 12.00 (killed in step #16) E200764_308_A.28.C21H11N3O2S2.167.4.set1d06_1-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #11) E200765_293_A.27.C19H13N5S2Si.53.4.set1d06_1-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #14) E200765_525_A.27.C22H13NOS3.624.3.set1d06_0-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #15) E200765_859_A.28.C21H11N3O2S2.182.2.set1d06_0-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #15) E200766_297_A.26.C21H15NS3Si.301.2.set1d06_1-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #10) E200766_351_A.26.C21H13NOS2Se.173.2.set1d06_0-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #16) E200767_969_A.27.C21H13N3S3.533.4.set1d06_0-- me = 12.00 (killed in step #16) wingman = 4.85 (aborted in step #14) E200768_620_A.27.C20H13N3OS2Si.234.2.set1d06_0- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #15) E200771_698_A.26.C21H16N2S2Si.4.4.set1d06_1-- me = 12.00 (Killed in step #16) wingman = 12.00 (killed in step #13) E200835_158_A.28.C21H12N4OS2.670.1.set1d06_1-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #11) E200842_986_A.27.C22H15NO2SSi.80.0.set1d06_0-- me = 12.00 (killed in step #16) wingman = 12.00 (killed in step #16) ![]() |
||
|
|
gb077492
Advanced Cruncher Joined: Dec 24, 2004 Post Count: 96 Status: Offline |
A few observations and comments, prompted by the beta but probably more relevant to this thread.
I have a slow old P4 with HT on that has 1GB RAM and a reasonably fast hard disk that I regularly de-frag. It crunches CEP2 without limit. It does stutter when CEP2 tasks start and if two start at the same time and pre-empt already running CEP2 tasks it can blow the 30-second heart-beat check, but fortunately that doesn't happen often. The soft time-out after 10 hours definitely doesn't work. However, that doesn't worry me! What I would suggest is that the soft time-out be fixed, but that it should also depend on getting through enough of the steps for the result to be useful on its own. From what I read I think that means not until step 8 is complete. Personally I would be happy to extend to the hard time-out to 15 hours or even more. I have observed some WUs where step 2 has run really long, even beyond the 10 hour mark, and it seems silly to kill a WU that soon. Just my 2p'th. Mike |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Quote:
----------------------------------------At no time did I see any WU's complete in a controlled manner via the 'soft timeout' option after the 10 hr mark - and therefore, this may also be something which needs further investigation/testing. On first mention (reading), I did go through all the logs and kept eye out for that soft end after 10 hours and confirm, it's never done so... in almost all cases where results ran the full 12:00 hours, jobs had been started after the 10th hour. Some had a kill where the job started before the 10th hour. Personally, if that's the case and the task is already beyond the critical minimum, I'd much prefer for the job to end proper, for it gives a strong dis-satisfactory feel that any job segment was not completed, i.e. ditched. [Edit 1 times, last edit by Former Member at Jan 11, 2011 12:41:35 PM] |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
There has been some misunderstanding on the soft timeout, there is no soft timeout in the current release. At one point there was disucssion of having such a feature but this was not implemented. Sorry for the misunderstanding. We have started discussing adding some sort of a soft timeout again. Stay tuned.
Thanks, armstrdj |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Sorry about the misunderstanding - this was entirely my fault. I had missed that the soft-timeout did not make it in the release version. Mea culpa...
Best, your Harvard CEP team |
||
|
|
|