Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 10
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi!
I have a small problem with the WU length. In particular my "Castor" computer fail to return the WU in time. The server respond with "Too Late", BOINC Credit Claimed 218, Granted 0. The WU was sent to another computer at the WU deadline. Now, this was a WU that no one else was waiting for, so the server simply discarded the result. The "Castor" computer also have the property that it have NEVER calculated even a single WU in error, ever. I don't wish to argue for longer dead-line period or shorter WU length. This should be set to optimise overall performance. But I would like you to consider the following: If a WU time-out, and only a single computer (Quorum 1) is attached, could the Server "hang around" a while, and not be so aggressive in sending the WU to another computer. The server could add the computation time for a former computation, to the clock, before issuing the WU to another computer. ![]() If we have 2 computers (Quorum 2), so another computer need the result for confirmation/credit calculation, then the issue is different. Now, someone might suggest that this kind of slow computers should be taken to recycling! But if the Castor continue to calculate for an another 3 years, the collected work is not that insignificant. A more practical issue is that, If I shall disconnect computers from the Grid, I will start with fast units where I can save power by keeping the CPU idle. I have 110 WU:s in the queue, on my machines. I have contributed approx 24 Yrs to the FightAIDS@Home project. faah17438_ZINC17213409_EN1md02420CTP_01 faah17792_ZINC08637384_xMut_md18750_00 |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Hi,
----------------------------------------You don't give an example of how many CPU hours a FAAH task needs on your Castor to get a picture or how many hours per day the computer is on. Really slow devices actually get extra time to complete. WCG tracks return times and compensates and have recently seen a wingman that was given 12 days, so it seems the computer is not that slow. The shorter you set the buffer/cache, the more time your Castor has to complete a task with a Deadline standard of 10 days which is 240 hour computing time if running 24/7. If the "connect every..." is set to zero days and the "additional buffer" is set to 0.01 days, the client will fetch a new task just shortly before completing the currently running task i.e. then you practically have the full 10 days to complete a FAAH task. Let us know and also please then post a copy of the client startup message log portion so we can get an idea of the setup. --//--
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Dec 12, 2010 7:53:04 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The machine run 24/7, and about 96% of time is available for this project. The "Results" return time was 268.59 hours. Some clock-hours was unfortunately lost when the BOINC was restarted during the calculation. The WU length went up significantly some weeks ago, before the computer usually met the deadline. There is a 0.05 day buffer approx.
This is not so much on the Grid's production, it is much more about that this computer was set up in 2007, with more RAM, new-installed OS and configuration, and I would like to see how much can be accomplished before it dies. It is some kind of emotional thing. My problem is merely a symptom on a much more complex issue, where many alternative solutions exist. I just thought that: If a WU fail to report, with the WU sent out to only one single computer, then the computer is dead, or the WU will be returned later. A cable may be unplugged, or the computer was simply switched Off by mistake. In any case I think that the server shall sit around and wait a while, as this is disk storage at the server. The Server should reasonably know if the unit communicates: "World Community Grid, Sending scheduler request: Requested by user", but that is not the same as some Server-code can make any intelligent decision on this. Kindly note again, that if another computer wait (Pending Validation) things are different. In extension I'll think that, with an option for the Slow computers, the Fast computers could do with a much shorter deadline.
... possibly the slowest computer in the project? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
For that particular machine you could try other less demanding projects, like HCC and/or HCMD2. I have an old Pentium-3 933MHz crunching HCMD2, with Windows 2000 and just 256MB of RAM. I don't remember it returning a single error in many years.
|
||
|
anhhai
Veteran Cruncher Joined: Mar 22, 2005 Post Count: 839 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
268.59 hrs? Now that is a monster WU. Personally I don't think this is your standard WU, it is extra, extra long. Unfortunately if you crunch HFCC, FAAH, or HPF2 you may encounter WU that happen to take a very long time. I personally doubt that WCG will be changing their policy. Only recommendation is do what bono_vox suggest. Run HCC, HCMD2, or C4CW. They are relatively faster and are less demanding.
----------------------------------------FYI... C4CW maybe getting longer soon when target 3 gets release. Here is a link to another thread where I mention the problem of WU getting longer. https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,30487 I will bring it up again after the holidays, hopefully then the admins/tech can look into it. ![]() |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
TNRG98,
----------------------------------------Very sorry to see that your Castor now needs 268 hours to complete a FAAH task and unless WCG resumes flexing their ears and listen up, don't see that you'll be able to have that device participate in this science on a forward basis, though from your single sapphire badge covering 24 CPU years, it appears to be your focussed interest. Per one of the charts I do below, the past few months the mean power per device as read from the very stable HCC and C4CW has increased by some 7 percent, but the task upsizes has been a several magnitude of that, FAAH now the leader in the pack. "Bring it on" we read some months ago about 1 million tasks per day... something may have gone down the windpipe instead, going by the name "Reality" :|
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Dec 13, 2010 11:47:59 AM] |
||
|
seippel
Former World Community Grid Tech Joined: Apr 16, 2009 Post Count: 392 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
On October 12 we increased the runtime for new work units to be around 9 hours to alleviate some disk usage concerns. Those concerns are behind us, so I've updated the build program to target about 7 hours for newly built work units. This won't affect work units that are already built and in the queue (and there are about 2 weeks of those), so it will be a couple of weeks before you should start seeing the shorter work units.
Seippel |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello TRNG98,
Statistics - By Project does not show Average Time by Result for Yesterday, just for Overall, but it does give enough information to calculate it. I would consider running HCC if my CPU were having trouble finishing projects. HCC runs a lot of fast integer loops rather than just floating point, so that speeds it up on a slow CPU. At least, that is my opinion. Lawrence |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The above was an ALGORITHM SPECIFICATION; I have worked Software for 30 years, and know that in this kind of projects, only very small av effective code improvements will be considered.
The above suggestion has been very carefully selected for saving a large number of WU:s while be of only 2 lines of code, should you be lucky. The Political thing is that if Server discard an OK WU; this is not good policy for a volunteer project. The ALGORITHM helps every time a computer hold completed, or near completion WU:s, and go down for maintenance on a "human" time-scale of a handful of extra days. Simply cut the "deadline" time period a few hours to compensate ! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi,
On rereading this thread from last year, the 268 hours was the time you receive a task to returning, not the actual computing time. Still think that if you downsize the cache / additional buffer to not hold 110 tasks as you noted on Dec.12 but just 5-10, your Castor is well able to compute a task in time for it not to return ''too late''. Uptime of WCG is so high that any major off-line local store is not needed. Generally, if the ''too late'' is not ''too too late'', then even 11 or 12 days still works... long as the backup copy is still on the life system and not moved to the master database. Less aggressive... extending the deadline substantially delays batch completion, builds server storage requirements. It just needs 1 result in a batch of e.g. 10,000 to hog the space until the last one is declared canonical. Then discussing readiness for 1 million results per day, we actually surpassed that number last week, so any batch that can be moved on will be useful. Then the life DB is cleared at which time any more late results cant be recognised and becomes ''too late''. --//-- PS, if a device is known to take longer to compute, not to report for that is a caching issue, the WCG algorithm already allocates extra deadline time. |
||
|
|
![]() |