Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 13
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4133 times and has 12 replies Next Thread
yoerik
Senior Cruncher
Canada
Joined: Mar 24, 2020
Post Count: 413
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
3 validations - quorum of 2.

I've been carefully watching the results I return over the last week. I've seen at least a dozen WUs (I complete between 20-30 across my devices, across projects, each day) and have seen a troubling pattern. The second a WU is labelled "No reply" - an additional replication is sent out. That's normal, expected behaviour.

The problem is this: that changes to Valid later, after the unnecessary additional replication is made. But Android devices, by definition - keep smaller queues. For 2 of my 3 androids - they simply don't have the storage to store a queue. So what happens is that I receive that replication, and start it immediately, so it isn't aborted by the project... and my devices run the task for no reason.

I either need to manually babysit this, a hastle - especially with such a small number of WUs being processed - or the project/BOINC itself changes this. Because since I've joined, I'd estimate that my android devices have done at least 50 WUs, hundreds of hours of CPU time, of unnecessary work, merely replicating an additional time, unnecessarily. It usually only happens on my android devices - even the one with a queue, but it has happened where I receive a WU on my windows laptops and the same things occur. The original deadline passes - my system prioritizes the resend because of the shorter deadline - and the "no reply" becomes valid. What can I, or the project do to fix this?

It's one thing to invest in a project via volunteering - but it's another to ask me to do hundreds of hours of unnecessary CPU work and electricity use on unnecessary work. No clue what the solution is, but needed to start this conversation - I don't know what to do.
----------------------------------------

[Jul 3, 2020 4:10:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

This will happen, particularly with slower devices such as Android. WCG allocate the same deadline to all platforms. The problem would reduce significantly if units allocated to android machines were given longer deadlines. Then there would be fewer missing the deadlines.

When a unit misses its deadline, as you say, another replication is sent out. If the first one then reports its completion before the later one has started, that gets aborted by server. If, however, that later one has already started then that is allowed to finish and act on a check for the first one. It is only if the later one reports before the first that the first gets recorded as 'Too Late'.

It could also be that the android machines which miss the deadlines have been holding too large a cache of units, so can't complete them in time. That might be because they aren't switched on 24/7, which WCG cannot control. Those machines would not be classified as 'reliable' so would not be sent re-sends that have halved deadlines.

Mike
[Jul 3, 2020 4:59:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
yoerik
Senior Cruncher
Canada
Joined: Mar 24, 2020
Post Count: 413
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

This will happen, particularly with slower devices such as Android. WCG allocate the same deadline to all platforms. The problem would reduce significantly if units allocated to android machines were given longer deadlines. Then there would be fewer missing the deadlines.

When a unit misses its deadline, as you say, another replication is sent out. If the first one then reports its completion before the later one has started, that gets aborted by server. If, however, that later one has already started then that is allowed to finish and act on a check for the first one. It is only if the later one reports before the first that the first gets recorded as 'Too Late'.

It could also be that the android machines which miss the deadlines have been holding too large a cache of units, so can't complete them in time. That might be because they aren't switched on 24/7, which WCG cannot control. Those machines would not be classified as 'reliable' so would not be sent re-sends that have halved deadlines.

Mike

Yeah. It's reasonable to expect some. But - there's gotta be some sort of fix. Reducing waste on the grid overall however possible, will ultimately benefit the grid by more effectively using resources. I don't know what the solution is - there is no magical one - but I do wish there was one. We need to figure out some sort of solution - because this is getting frustrating. It's just a complete waste.
----------------------------------------

[Jul 3, 2020 7:20:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

I'm sure techs have statistics on how good or bad this is with No Reply turned valid after late reporting. If you don't like it, increase your cache so those _1 , _2 can be cancelled by the server when the NR still shows up.

In that a mechanism could be if a task is started before deadline the client signals the server and internally adjusts the deadline based on past execution times the device had for the science. That way, less extra copies, but it would reduce the incentive to process timely and even further up the buffer.

WCG is lenient, other projects abort a computing task after deadline or symply don't give credit. You'd be seeing a bunch more upset members over the few that get upset over computing a redundant task.
[Jul 3, 2020 8:13:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
yoerik
Senior Cruncher
Canada
Joined: Mar 24, 2020
Post Count: 413
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

Not here to complain or whine - I'm just a tad frustrated. No offence is meant, lavaflow.

And increasing the queue isn't doable on 2 of my 3 androids as stated. the 3rd is my daily use phone. And doesn't apply to windows devices without an issue about storage. It still occurs on my windows laptops at a lower frequency, but it does occur - and the queue is over 1.5 days for both, and has Rosetta@home in the queue.

I make next to no impact on the project overall - just made a comment because if it's happening to my devices, it surely is occurring to others, more frequently.

A small part of me was hopeful that there was a magic solution, and that I was pulling a noob cruncher moment again; my cpu time wasn't being wasted, it was just a misunderstanding, etc.
----------------------------------------

[Jul 3, 2020 8:40:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
jackielan2000
Advanced Cruncher
China
Joined: Dec 31, 2005
Post Count: 115
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

I've been carefully watching the results I return over the last week. I've seen at least a dozen WUs (I complete between 20-30 across my devices, across projects, each day) and have seen a troubling pattern. The second a WU is labelled "No reply" - an additional replication is sent out. That's normal, expected behaviour.

The problem is this: that changes to Valid later, after the unnecessary additional replication is made. But Android devices, by definition - keep smaller queues. For 2 of my 3 androids - they simply don't have the storage to store a queue. So what happens is that I receive that replication, and start it immediately, so it isn't aborted by the project... and my devices run the task for no reason.

I think Mike.Gibson is right. You probably picked up one of my WUs that passed the deadline. I saw the status changed to No Response and a replica sent out. But later I pluged THE phone and it finished in about 3 hours. The status then showed Valid. So I think the replica should be aborted by the system.
----------------------------------------
AMD Athlon64X2 5400+ 2.8G | 2c
MT6735 1.4G | 4c
Helio G85 1.8G |8c
Allwinner H2 1G | 4c
SnapDragon 810 2.1G | 8c
SnapDragon 801 2.5G | 4c
[Jul 4, 2020 6:56:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
yoerik
Senior Cruncher
Canada
Joined: Mar 24, 2020
Post Count: 413
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

It's been an issue for awhile, jackie. It's not just one time - it occurs far more frequently than I am comfortable - hence you cannot take the blame onto yourself.
----------------------------------------

[Jul 4, 2020 7:08:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

An option might be to extend the deadline by the average runtime in hours for the WU that missed the deadline. It would only get one extension. This would allow any WU still in a client's queue to cancel and report in as "missed the deadline". The replica would then be sent out. If the WU didn't report back in a reasonably short time (minutes), the assumption would be, it is still running on the client but late. The additional hours might allow it to report in before the next iteration is sent out. Is it fool-proof? NO. But it might reduce the number of needless replicas sent out. It won't catch the situation where a machine was abandoned with work still in the queue, but in that case, there wouldn't be needless work as the original work will never come back.
[Jul 4, 2020 7:39:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

What about trickle messaging?
It allows work to be completed quicker and for the researchers to receive valuable results sooner.
[Jul 4, 2020 9:31:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 3 validations - quorum of 2.

What about trickle messaging?
It allows work to be completed quicker and for the researchers to receive valuable results sooner.



That's what climateprediction.net uses because their work units can take days or weeks even on modern CPU's. If the Android devices take that long to complete a task it might be useful because the time between trickles could server as a measure for the odds the a given WU will be completed on time.
[Jul 4, 2020 9:42:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread