Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Discovering Dengue Drugs - Together - Phase 2 Forum Thread: Changes to distribution of error work units |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 20
|
Author |
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
Greetings all,
For Discovering Dengue Drugs - Together Phase 2, we are going to change the max errors for a work unit to 3. In the past this was set to 5. From the errors we have seen with the application, almost all of them have been consistent errors. Meaning the false positives that the researchers expect. This will decrease the number of copies that are sent out with the 16MB input file but fail quickly. The main reason for doing this is to decrease the number of large downloads with quick errors for the members. This will increase the speed in which we can return batches to the researchers as well. Thanks, -Uplinger PS: If you have any questions, feel free to ask. |
||
|
pirogue
Veteran Cruncher USA Joined: Dec 8, 2008 Post Count: 685 Status: Offline Project Badges: |
PS: If you have any questions, feel free to ask. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Apres nous....
|
||
|
JSYKES
Senior Cruncher Joined: Apr 28, 2007 Post Count: 200 Status: Offline Project Badges: |
Thanks for the update Uplinger, but will that have any effect on the distribution of the WU's away from crunchers who have had high percentages of errors to positive returns? I guess everyone has had errors but the distribution seems to have been very uneven and hence some of us have only had a very small number of WU's in total (I've had less than 20 with only 14 validated) which could slew the stats against us for a while (or how long?) despite having v quick PC's. What's the score with this?
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for the update Uplinger, but will that have any effect on the distribution of the WU's away from crunchers who have had high percentages of errors to positive returns? I guess everyone has had errors but the distribution seems to have been very uneven and hence some of us have only had a very small number of WU's in total (I've had less than 20 with only 14 validated) which could slew the stats against us for a while (or how long?) despite having v quick PC's. What's the score with this? It needn't be DDDT-2 results returned to make up for the error WUs... HCC gives the smallest downloads, but HCMD-2 appear to have the lowest average turnaround time for a good result. For each error WU you had, grab a dozen or so WU's from either of those tasks on each machine and (assuming they're returned without error) your machine(s) should be rated 'reliable' again within a day and eligible for A and B types. At least that's what I did after every one of my machines got hit with at least 1 bad WU; A couple are now crunching A's and B's as I type. |
||
|
JSYKES
Senior Cruncher Joined: Apr 28, 2007 Post Count: 200 Status: Offline Project Badges: |
Thanks ZoSo - I've been crunching loads of other stuff as the DDDT2 is so unreliable for continuity, all others without a single error so it sounds as though it should be self correcting in due course.
---------------------------------------- |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
The good, now old news is, you really only need 1:1 for errors to stay in the RR class once arriving there, after having done 77 good ones! Case in point, my client flunked a DDDT2 job and shortly after got a HCC repair job.
----------------------------------------Be Happy, Crunch Happy.
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: |
uplinger: "If you have any questions, feel free to ask" ...
----------------------------------------1. When does the change happen to the max no of errors on a WU that prevents further copies being sent out? Perhaps it is already in place for the next batch of new WUs, but does not apply to the ones that have already started. For example, the 7th copy (name *_06) has just been sent out for the WU that I described in The Bad Type A WUs Thread. 2. Within a quorum, should the values of all of the data in each of the members' running science programs be bit-for-bit identical throughout the run, so that if errors occur, they occur at the same place? In the abovementioned TS05/ps WU, 3 copies terminated at pctComplete=0.688000 and 1 at 0.447600. In thread DDDT2 - now an Intermittent project , mweisensee says that ts05_a193_ps0000 gave 3 error exits with different % completion. JmBoullier mentions more WUs that terminated at different points, including one where 2 members completed the WU successfully. Furthermore, mweisensee thinks that forcing periodic restarts from checkpoints avoids the errors. Do these things make sense? Does CHARMM use Monte Carlo methods deep inside its ancient FORTRAN machinery? [Edit 5 times, last edit by Rickjb at Apr 17, 2010 12:59:42 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
PS: If you have any questions, feel free to ask. I submitted the 3rd error for ts05_a256_ps0000 this morning about 2:15, yet at 3:24 another copy of that WU was sent out. How are the 3 errors being counted? Thanks. [edit1 - added screen grab] All were exit code 29 (0x1d), by the way. [edit2 - added exit code] [Edit 2 times, last edit by Former Member at Apr 17, 2010 1:20:47 PM] |
||
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3715 Status: Offline Project Badges: |
JmBoullier mentions more WUs that terminated at different points, including one where 2 members completed the WU successfully. Rick,Sorry if my wording has been confusing but, in my sentence "the most consistent one is the only one which completed fine for both my wingman and me. " "one" stands for "quorum", and obviously this sentence applies to a particular WU which was valid for both wingmen, not to a WU with two valid and x errors. |
||
|
|