Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 9
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3469 times and has 8 replies Next Thread
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Question re single-redundancy protocol after Invalid result

I thought that after a computer returns an invalid result for a single-redundancy WU it has to return 15 (?) valid results before being trusted to run single-redundancy again. Yesterday, one of my 2 4-core computers dropped an invalid dddt WU due to overcrowding of its 1GB of memory, probably not helped by the slowness of paging on its 7 year-old 8.4GB HDD. Dddt and flu WUs received after the return of the invalid result have the normal ratio of single- to double-redundancy. I am reporting this in case the single-redundancy protocol has a problem.

Fortunately the WCG system did detect the invalid result, and more copies were sent out after my result was returned. The error log contained:
| Result Name: dddt1702a0023_ 100393_ 0--
| <core_client_version>6.2.28</core_client_version>
... near the top of the log:
| Finished Docking number 3
| No heartbeat from core client for 30 sec - exiting
| ERROR: could not initialize graphics pointer in shared memory.
| AG Check: Found receptor.A.map
| Beginning AutoDock...
. ...
Scattered throughout the remainder of the log were about 37 instances of:
| ERROR: Failed to save graphics state during checkpoint. Not critical. Continuing.
The end of the log looked normal.
Should my computer have been put on probation before being allowed to run single-redundancy again?
[Jul 19, 2009 1:14:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Question re single-redundancy protocol after Invalid result

hmmm, how many were still in buffer at point of the invalid result? Was the invalid immediate or was it only rated hours later (validation time), till which you'd still be receiving work as if nothing had changed in rating? Whether this smart I don't know, but if more than >=15 in buffer, than you have that many to proof worthiness again, then why waste extra's on quorum 2?

Just 2 cents
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 19, 2009 1:44:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Question re single-redundancy protocol after Invalid result

how many were still in buffer at point of the invalid result? I don't know, but there are 28 right now.
Was the invalid immediate or ...? I don't know. I only found the problem when I did one of my sporadic checks for invalid/error results about 12 hrs later.
Before posting the above, I examined only results with sent times after the return time of the invalid result. I have just re-checked about 1 in 3 of all Autodock results for the device since the error WU was sent, and the only ones with a quorum of 2 are repair units (names end in 1) and the 3 Inconclusives (out of 81 results).
In case it's useful, I just uploaded the entire results summary list (not individual result pages) to http://www.2shared.com/file/6764188/92df305f/Tasks_9650A_090720.htm .
I think the WCG system has missed sending me to the sin bin in this case.
----------------------------------------
[Edit 1 times, last edit by Rickjb at Jul 20, 2009 11:09:57 AM]
[Jul 20, 2009 6:51:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Question re single-redundancy protocol after Invalid result

May I recommend you use a pastebin rather than a free (ad and popup ridden) "download" service?

I use pastebin.ca mostly, but there are others.

It doesn't follow that names ending in _1 are repair tasks - you would need to check the sent time and deadline (and even that's not 100% reliable).
[Jul 20, 2009 7:11:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Question re single-redundancy protocol after Invalid result

@Didactylos: Thanks for the tip re pastebins - I didn't know they existed. 2shared seems OK, but I've only used it a couple of times.

To readers who are wondering what that is about: I wanted to make an 80-line text file available here, but without wasting everyone's bandwidth to download it from WCG plus their time to scroll past it in the forum thread. So, I uploaded it to a website where it is publicly accessible, and included a link in my post. Only people who want it need download it. [Hint] wink

Point taken re the "_1" WUs not all being repair units. Anyway, there were only 2 for the results from single-redundancy projects ... my work cache plus micromanaging CEP WUs to run 1 at a time seems to be keeping me just outside Fast Returner status.
I wondered whether the system might have sent out 2nd copies of WUs that were still In Progress when the invalid result was returned, but there is no sign of that in the earlier results that are still accessible. Some of them would still be waiting for the wingmen anyway.
[Edit]: Checking all of the next 16 Autodock WUs sent immediately after the bad one was returned, I found 2 with names ending in "_0" and a quorum of 2. Not 15, and within normal check-WU proportions I think.
I can't figure it out, so over to you guys.
----------------------------------------
[Edit 1 times, last edit by Rickjb at Jul 20, 2009 11:17:41 AM]
[Jul 20, 2009 11:03:42 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Question re single-redundancy protocol after Invalid result

Rickjb, you really need to string it together, first by sorting on return time, then look in each quorum at the validation times annotate next to the sorted list, single or duo, transmission times and when quorum validated. At least one job I see uploaded in the same moment as the invalid being valid, but that is a quorum 2 anyhow. Also, see a bunch of jobs that were downloaded before the invalid upload, returned afterwards, that are all marked valid. Any of these that had an additional copy sent out after your return time?

Anyway till now see nothing that would make me doubt the system (defensive posture ;>)
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 20, 2009 11:23:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Question re single-redundancy protocol after Invalid result

I don't know how to find validation times, but I have gone through all current results pages for the device, doing on each page: (1) Sort on Return Time, (2) Copy list and paste into a text file, (3) Examine every result, and put an extra entry on the end of the result line in my file, denoting the quorum size - Q1 = 1, Q2 = 2, etc.
Edited my file to make it a bit more readable, replacing space-tab column separators with " | ", deleting device name (all the same), etc.
Uploaded it to http://pastebin.ca/1501511 with expiry in 1 year.
HTH - Rick
[Jul 20, 2009 11:49:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Question re single-redundancy protocol after Invalid result

Needle in haystack, also because the validators at times get paused automatically, to let other processes that have backlog catch up. In those periods work is still send out. Because actual validation time is not printed, we assume the time stamp of the last result in a quorum. Thus if yours gets in at 10:48am and the quorum completer at e.g. 12:15 pm, latter is when the rating is considered to have taken place. Could though have been at any unknown time afterwards, hour(s) later, whilst your quad was still pumping out more tasks and requesting more, even harder to track with a multi (part) day buffer. Does not compromise the actual validation, and that's the important part. See enough Q2's there for ZR projects to think the system did at least the random 're-rating'. By the letter we mortals know the client should have lost it's 'rating', but given it was a one-off, not worth further head scratching.

BTW, interesting mix of hard to get CEP, FAAH (presently), DDDT and FLU. Fair amount of CEP in that mix.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 21, 2009 7:19:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Question re single-redundancy protocol after Invalid result

Thanks for taking the time to look into this, Sek. One thing that I hadn't thought about is the question of what specifically went wrong inside wcg_dddt_autodock_6.07_windows_intelx86 when the problem occurred, and why it continued executing to give an invalid result instead of aborting, and whether this behaviour could be fixed.
Maybe I could repeat the circumstances ... 4 cores, 1GB RAM, limited pagefile space on the HDD, about 4 tasks suspended in memory (maybe I was trying for a 5th confused ) just when BOINC started reporting & downloading more data, and then I tried to get BOINC Manager to swap in so I could see what was going on ...

Yeah, the project mix ... I think it's good on a multi-core, especially an Intel LGA775 quad, to run a mix of projects, preferably with 1 or 2 that have high memory cache success rates, so there's less crowding on the memory bus. On my AMD, which has only 512kb cache/core, Perfmonitor shows that FAAH, HCC and rice(?) get lots of cache misses, while CEP and HCMD2 do not. So I tried some CEP and HCMD2 on the Intels. It turns out that neither HCMD2 or CEP score good Points Awarded on the Intels. Autodock jobs score best, possibly because the 3MB cache/core is higher than the fleet average, and I think I'll revert to running only/mostly those on the Intels.
On the other hand, the AMD likes HCMD2 and CEP, which have high cache success rates, and Points Awarded for these seem to be only about 20% behind the Intels, clock for clock.
[Edit] - Forgot to add: I hope the CHARMM page-faulting problem gets fixed before DDDT-2 gets under way. If not, I might run it on 1 core of the AMD, but not on the Intel quads. Sorry, but I figure that running my devices on projects that maximise the points they score maximises my contribution to WCG. I'm doing some 'flu jobs for you though, Stan.
----------------------------------------
[Edit 1 times, last edit by Rickjb at Jul 22, 2009 5:09:30 AM]
[Jul 21, 2009 9:39:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread