Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 10
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1419 times and has 9 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
The Work Unit length

Hi!
I have a small problem with the WU length. In particular my "Castor" computer fail to return the WU in time.
The server respond with "Too Late", BOINC Credit Claimed 218, Granted 0.
The WU was sent to another computer at the WU deadline.

Now, this was a WU that no one else was waiting for, so the server simply discarded the result. The "Castor"
computer also have the property that it have NEVER calculated even a single WU in error, ever.

I don't wish to argue for longer dead-line period or shorter WU length. This should be set to optimise overall
performance. But I would like you to consider the following:

If a WU time-out, and only a single computer (Quorum 1) is attached,
could the Server "hang around" a while, and not be so aggressive in sending the WU to another computer.
The server could add the computation time for a former computation, to the clock, before issuing
the WU to another computer. applause

If we have 2 computers (Quorum 2), so another computer need the result for confirmation/credit calculation,
then the issue is different.

Now, someone might suggest that this kind of slow computers should be taken to recycling! But if the Castor continue
to calculate for an another 3 years, the collected work is not that insignificant. A more practical issue is that,
If I shall disconnect computers from the Grid, I will start with fast units where I can save power by keeping the CPU idle.

I have 110 WU:s in the queue, on my machines.
I have contributed approx 24 Yrs to the FightAIDS@Home project.



faah17438_ZINC17213409_EN1md02420CTP_01
faah17792_ZINC08637384_xMut_md18750_00
[Dec 12, 2010 7:12:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: The Work Unit length

Hi,

You don't give an example of how many CPU hours a FAAH task needs on your Castor to get a picture or how many hours per day the computer is on.

Really slow devices actually get extra time to complete. WCG tracks return times and compensates and have recently seen a wingman that was given 12 days, so it seems the computer is not that slow.

The shorter you set the buffer/cache, the more time your Castor has to complete a task with a Deadline standard of 10 days which is 240 hour computing time if running 24/7. If the "connect every..." is set to zero days and the "additional buffer" is set to 0.01 days, the client will fetch a new task just shortly before completing the currently running task i.e. then you practically have the full 10 days to complete a FAAH task.

Let us know and also please then post a copy of the client startup message log portion so we can get an idea of the setup.

--//--
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Dec 12, 2010 7:53:04 PM]
[Dec 12, 2010 7:52:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: The Work Unit length

The machine run 24/7, and about 96% of time is available for this project. The "Results" return time was 268.59 hours. Some clock-hours was unfortunately lost when the BOINC was restarted during the calculation. The WU length went up significantly some weeks ago, before the computer usually met the deadline. There is a 0.05 day buffer approx.

This is not so much on the Grid's production, it is much more about that this computer was set up in 2007, with more RAM, new-installed OS and configuration, and I would like to see how much can be accomplished before it dies. It is some kind of emotional thing. My problem is merely a symptom on a much more complex issue, where many alternative solutions exist.

I just thought that:
If a WU fail to report, with the WU sent out to only one single computer, then the computer is dead, or the WU will be returned later. A cable may be unplugged, or the computer was simply switched Off by mistake.
In any case I think that the server shall sit around and wait a while, as this is disk storage at the server.
The Server should reasonably know if the unit communicates: "World Community Grid, Sending scheduler request: Requested by user", but that is not the same as some Server-code can make any intelligent decision on this. Kindly note again, that if another computer wait (Pending Validation) things are different.

In extension I'll think that, with an option for the Slow computers, the Fast computers could do with a much shorter deadline.

<client_state>
<host_info>
<timezone>3600</timezone>
<domain_name>Castor</domain_name>
<ip_addr>192.168.30.80</ip_addr>
<host_cpid>c937688383e88569138aeadc3e4e44df</host_cpid>
<p_ncpus>1</p_ncpus>
<p_fpops>176736649.666957</p_fpops>
<p_iops>243538455.613788</p_iops>
<p_membw>1000000000.000000</p_membw>
<p_calculated>1291770956.684486</p_calculated>
<m_nbytes>301518848.000000</m_nbytes>
<m_cache>1000000.000000</m_cache>
<m_swap>4473044992.000000</m_swap>
<os_name>Microsoft Windows 2000</os_name>
<os_version>Professional Edition, Service Pack 4, (05.00.2195.00)</os_version>
<accelerators>S3 Trio64V2</accelerators>
</host_info>
<time_stats>
<on_frac>0.942140</on_frac>
<connected_frac>-1.000000</connected_frac>
<active_frac>0.999476</active_frac>
<cpu_efficiency>0.966606</cpu_efficiency>
<last_update>1292192087.530903</last_update>
</time_stats>

... possibly the slowest computer in the project?
[Dec 13, 2010 1:52:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: The Work Unit length

For that particular machine you could try other less demanding projects, like HCC and/or HCMD2. I have an old Pentium-3 933MHz crunching HCMD2, with Windows 2000 and just 256MB of RAM. I don't remember it returning a single error in many years.
[Dec 13, 2010 2:47:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
anhhai
Veteran Cruncher
Joined: Mar 22, 2005
Post Count: 839
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The Work Unit length

268.59 hrs? Now that is a monster WU. Personally I don't think this is your standard WU, it is extra, extra long. Unfortunately if you crunch HFCC, FAAH, or HPF2 you may encounter WU that happen to take a very long time. I personally doubt that WCG will be changing their policy. Only recommendation is do what bono_vox suggest. Run HCC, HCMD2, or C4CW. They are relatively faster and are less demanding.

FYI... C4CW maybe getting longer soon when target 3 gets release. Here is a link to another thread where I mention the problem of WU getting longer. https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,30487
I will bring it up again after the holidays, hopefully then the admins/tech can look into it.
----------------------------------------

[Dec 13, 2010 4:06:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: The Work Unit length

TNRG98,

Very sorry to see that your Castor now needs 268 hours to complete a FAAH task and unless WCG resumes flexing their ears and listen up, don't see that you'll be able to have that device participate in this science on a forward basis, though from your single sapphire badge covering 24 CPU years, it appears to be your focussed interest.

Per one of the charts I do below, the past few months the mean power per device as read from the very stable HCC and C4CW has increased by some 7 percent, but the task upsizes has been a several magnitude of that, FAAH now the leader in the pack. "Bring it on" we read some months ago about 1 million tasks per day... something may have gone down the windpipe instead, going by the name "Reality" :|


----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Dec 13, 2010 11:47:59 AM]
[Dec 13, 2010 11:38:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
seippel
Former World Community Grid Tech
Joined: Apr 16, 2009
Post Count: 392
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The Work Unit length

On October 12 we increased the runtime for new work units to be around 9 hours to alleviate some disk usage concerns. Those concerns are behind us, so I've updated the build program to target about 7 hours for newly built work units. This won't affect work units that are already built and in the queue (and there are about 2 weeks of those), so it will be a couple of weeks before you should start seeing the shorter work units.

Seippel
[Dec 13, 2010 9:27:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: The Work Unit length

Hello TRNG98,

Statistics - By Project does not show Average Time by Result for Yesterday, just for Overall, but it does give enough information to calculate it. I would consider running HCC if my CPU were having trouble finishing projects. HCC runs a lot of fast integer loops rather than just floating point, so that speeds it up on a slow CPU. At least, that is my opinion.

Lawrence
[Dec 13, 2010 11:18:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: The Work Unit length

The above was an ALGORITHM SPECIFICATION; I have worked Software for 30 years, and know that in this kind of projects, only very small av effective code improvements will be considered.

The above suggestion has been very carefully selected for saving a large number of WU:s while be of only 2 lines of code, should you be lucky.

The Political thing is that if Server discard an OK WU; this is not good policy for a volunteer project.

The ALGORITHM helps every time a computer hold completed, or near completion WU:s, and go down for maintenance on a "human" time-scale of a handful of extra days.

Simply cut the "deadline" time period a few hours to compensate !
[May 10, 2011 12:26:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: The Work Unit length

Hi,

On rereading this thread from last year, the 268 hours was the time you receive a task to returning, not the actual computing time. Still think that if you downsize the cache / additional buffer to not hold 110 tasks as you noted on Dec.12 but just 5-10, your Castor is well able to compute a task in time for it not to return ''too late''. Uptime of WCG is so high that any major off-line local store is not needed.

Generally, if the ''too late'' is not ''too too late'', then even 11 or 12 days still works... long as the backup copy is still on the life system and not moved to the master database.

Less aggressive... extending the deadline substantially delays batch completion, builds server storage requirements. It just needs 1 result in a batch of e.g. 10,000 to hog the space until the last one is declared canonical.

Then discussing readiness for 1 million results per day, we actually surpassed that number last week, so any batch that can be moved on will be useful. Then the life DB is cleared at which time any more late results cant be recognised and becomes ''too late''.

--//--

PS, if a device is known to take longer to compute, not to report for that is a caching issue, the WCG algorithm already allocates extra deadline time.
[May 10, 2011 12:49:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread