Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 15
Posts: 15   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4910 times and has 14 replies Next Thread
Sabrina Tarson
Advanced Cruncher
United States
Joined: Jun 27, 2012
Post Count: 149
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Large amounts of Invalid Results

Hello all. Recently this summer I upgraded my main system to use a Ryzen Threadripper 2950x and have been happily crunching away on all projects on World Community Grid.

However, I started noticing a trend with the amount of Invalid results the Threadripper machine will return. While it's not a super accurate ratio, it seems that 1 in every 100 results turns out to be Invalid.

I have done no overclocking to the system, and have not touched any of the precision boost settings.

This is the only computer that has this issue in my grid, and it's weird that it seems to be completely inconsistent when it decides to go Invalid. No other project on World Community Grid has this issue, nor any other BOINC project I've run outside of World Community Grid, just Mapping Cancer Markers.

I can provide more information if needed.

Computer Settings:
OS: Windows 10 Professional 1903 Build (18362.239)
CPU: AMD Ryzen Threadripper 2950X @ Stock Settings

One Invalid Result Log:

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.41_windows_x86_64 -SettingsFile MCM1_0153488_2055.txt -DatabaseFile dataset-curatedOvarian_EarlyLate_v1.0.txt
Settings File
DateOfDesign = 08/05/2014
Designer = PMCC_OCI_0.1
WorkOrderID = 0153488_2055
DatasetID = curatedOvarian_EarlyLate_v1.0
NumberOfGenesInStartingSignature = 30
NumberOfGenesInSignatureMin = 30
NumberOfGenesInSignatureMax = 30
GroupVectorValues = {A}{B}{C}{D}{E}{F}
ExplicitStartingGeneSignatures = A B D F
StartingGeneSignatureAlgorithm = randomFixedLengthSearch
SearchAlgorithmNumberToCreate = 60
SearchAlgorithmSequentialStartPosition = 5
RunPermutationAlgorithm = 0
PermutationGroups = A
PermutationGroupsForReplacement = G
PermutationAlgorithm = replaceFromRandomlyToRandomlyGreedy
PermutationsNumIterations = 0
OptimizationAlgorithmFrequency = 0 0 1
FBeta = 1.5
SimAnnealIMax = 20000
SimAnnealAlpha = 0.9996
FitnessFn = 0
MinFitness = -1.0
NReps = 10
TrainFrac = 0.7
NFolds = 10
VMethod = LOO
ModelType = SVM
SvmArgs = "-v 0 -c 0.1 -t 1 -d 2 -r 0"

SvmLearnLimit = 500000
RSeed = 491292056


[14:35:19] Initializing
[14:35:27] Running
[14:35:27] EvaluateFitnessOfStartingGeneSignatures 60
[18:24:47] Writing final output
[18:24:47] Closing Output Stream
[18:24:47] Cleaning up
Result.out = 16745.000000
Run complete, CPU time: 12532.656250
18:24:47 (27584): called boinc_finish(0)

</stderr_txt>
]]>

----------------------------------------
[Aug 28, 2019 7:35:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

How does a valid result differ from an invalid result ? If you post bot an invalid and a valid we could compare.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Aug 29, 2019 1:28:45 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sabrina Tarson
Advanced Cruncher
United States
Joined: Jun 27, 2012
Post Count: 149
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

It was my understanding that an Invalid result occurs when the results of two computations don't match. In this case, I don't understand why the same machine would run some correctly and others not.

Invalid Result:

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.41_windows_x86_64 -SettingsFile MCM1_0153553_8347.txt -DatabaseFile dataset-curatedOvarian_EarlyLate_v1.0.txt
Settings File
DateOfDesign = 08/05/2014
Designer = PMCC_OCI_0.1
WorkOrderID = 0153553_8347
DatasetID = curatedOvarian_EarlyLate_v1.0
NumberOfGenesInStartingSignature = 60
NumberOfGenesInSignatureMin = 60
NumberOfGenesInSignatureMax = 60
GroupVectorValues = {A}{B}{C}{D}{E}{F}
ExplicitStartingGeneSignatures = A B D F
StartingGeneSignatureAlgorithm = randomFixedLengthSearch
SearchAlgorithmNumberToCreate = 576
SearchAlgorithmSequentialStartPosition = 5
RunPermutationAlgorithm = 0
PermutationGroups = A
PermutationGroupsForReplacement = G
PermutationAlgorithm = replaceFromRandomlyToRandomlyGreedy
PermutationsNumIterations = 0
OptimizationAlgorithmFrequency = 0 0 1
FBeta = 1.5
SimAnnealIMax = 20000
SimAnnealAlpha = 0.9996
FitnessFn = 0
MinFitness = -1.0
NReps = 10
TrainFrac = 0.7
NFolds = 10
VMethod = LOO
ModelType = SVM
SvmArgs = "-v 0 -c 0.1 -t 1 -d 2 -r 0"

SvmLearnLimit = 500000
RSeed = 491948348


[08:16:43] Initializing
[08:16:50] Running
[08:16:50] EvaluateFitnessOfStartingGeneSignatures 576
[11:44:52] Writing final output
[11:44:52] Closing Output Stream
[11:44:52] Cleaning up
Result.out = 244194.000000
Run complete, CPU time: 12056.203125
11:44:52 (31968): called boinc_finish(0)

</stderr_txt>
]]>


Valid Result:

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.41_windows_intelx86 -SettingsFile MCM1_0153532_3477.txt -DatabaseFile dataset-curatedOvarian_EarlyLate_v1.0.txt
Settings File
DateOfDesign = 08/05/2014
Designer = PMCC_OCI_0.1
WorkOrderID = 0153532_3477
DatasetID = curatedOvarian_EarlyLate_v1.0
NumberOfGenesInStartingSignature = 15
NumberOfGenesInSignatureMin = 15
NumberOfGenesInSignatureMax = 15
GroupVectorValues = {A}{B}{C}{D}{E}{F}
ExplicitStartingGeneSignatures = A B D F
StartingGeneSignatureAlgorithm = randomFixedLengthSearch
SearchAlgorithmNumberToCreate = 76
SearchAlgorithmSequentialStartPosition = 5
RunPermutationAlgorithm = 0
PermutationGroups = A
PermutationGroupsForReplacement = G
PermutationAlgorithm = replaceFromRandomlyToRandomlyGreedy
PermutationsNumIterations = 0
OptimizationAlgorithmFrequency = 0 0 1
FBeta = 1.5
SimAnnealIMax = 20000
SimAnnealAlpha = 0.9996
FitnessFn = 0
MinFitness = -1.0
NReps = 10
TrainFrac = 0.7
NFolds = 10
VMethod = LOO
ModelType = SVM
SvmArgs = "-v 0 -c 0.1 -t 1 -d 2 -r 0"

SvmLearnLimit = 500000
RSeed = 491733478


[13:20:13] Initializing
[13:20:23] Running
[13:20:23] EvaluateFitnessOfStartingGeneSignatures 76
[18:08:18] Writing final output
[18:08:18] Closing Output Stream
[18:08:18] Cleaning up
Result.out = 15587.000000
Run complete, CPU time: 15771.171875
18:08:18 (8752): called boinc_finish(0)

</stderr_txt>
]]>

----------------------------------------
[Aug 29, 2019 2:23:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

The only thing that really looks any different are the lines:
SearchAlgorithmNumberToCreate = 576
SearchAlgorithmNumberToCreate = 76

The first one is the invalid one and the second is valid one. Not having looked at any other work units of MCM in this depth, I wonder if the value in the first one is too big.
Did the wingman complete the invalid unit to a valid conclusion ? If for all or most of your invalids, if the wingman is completing them to a valid state, then the problem is with your machine. Just what it might be I am clueless.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Aug 30, 2019 1:33:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sabrina Tarson
Advanced Cruncher
United States
Joined: Jun 27, 2012
Post Count: 149
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

On almost all of them (I believe maybe 1 or 2 of them all were invalid, but it's still such a small sample size of them), when the workunit is sent to a third person, it comes back valid.

What's also weird is, sometimes there's a couple days where there are no invalid results, followed by a few days where 6 or 7 of the 450 results done that day were invalid.
----------------------------------------
[Aug 30, 2019 6:41:31 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

On almost all of them (I believe maybe 1 or 2 of them all were invalid, but it's still such a small sample size of them), when the workunit is sent to a third person, it comes back valid.

What's also weird is, sometimes there's a couple days where there are no invalid results, followed by a few days where 6 or 7 of the 450 results done that day were invalid.

Transient heat, memory, or voltage variations perhaps ????
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Aug 30, 2019 4:16:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sabrina Tarson
Advanced Cruncher
United States
Joined: Jun 27, 2012
Post Count: 149
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

Until I can get it sorted out, or until it's found that something else is going on, I've decided to not run Mapping Cancer Markers on that machine, as I don't like the idea of wasting hours working on a workunit for it to come out invalid.

It's very weird that it only affects some workunits and not others, and only for Mapping Cancer Markers. Regardless, there are other machines in my grid that can crunch the project, and have done so without issue in the past.
----------------------------------------
[Sep 4, 2019 5:27:39 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

Just in the last couple of days I am seeing re-sends of MCM unit in a greater proportion than previously. I might see 1 or 2 a day normally, but now am seeing perhaps 10 to 12. The majority of them are "no reply." Just an observation.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 6, 2019 6:31:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ThreadRipper
Veteran Cruncher
Sweden
Joined: Apr 26, 2007
Post Count: 1324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

The problem you describe is something I have seen on my ThreadRipper 2990WX as well (before I adjusted down the RAM frequency).

What is your RAM running at? Is XMP profile enabled in BIOS?
In my case, XMP 3466Mhz did not work. Most probably my Motherboard did not like that RAM kit I have (not on QVL list) since I upgraded from ThreadRipper 1950X to 2990WX and the RAM was not stable above 2933Mhz. However, I was able to lower the latencies from 16-18-18-36-1T to 14-13-13-28-1T and I have no invalid WUs anylonger.

So, since CPU is at stock I would check the RAM first. Just dial in a speed that is one step lower than the one you are running now and see if the frequency of Invalid results lowers or disappears altogether.
----------------------------------------

Join The International Team: https://www.worldcommunitygrid.org/team/viewTeamInfo.do?teamId=CK9RP1BKX1

AMD TR2990WX @ PBO, 64GB Quad 3200MHz 14-17-17-17-1T, RX6900XT @ Stock
AMD 3800X @ PBO
AMD 2700X @ 4GHz
[Nov 7, 2019 9:54:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sabrina Tarson
Advanced Cruncher
United States
Joined: Jun 27, 2012
Post Count: 149
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Large amounts of Invalid Results

This is a late bump, but I wanted to post that I figured out what the problem was.

Despite being marketed as 2666MHz, the system became more unstable as time went on from my last post. Since my last post here, I had gone around other projects experimenting away from World Community Grid. During my adventures, I crunched a couple numbers for GIMPS, and using Prime95 found that the system was unstable, and would error out on results.

Sure enough, if I brought the RAM down to 2400MHz, the system passed Prime95. So I must have gotten a incorrectly binned kit of RAM.

Anyway, since coming back to World Community Grid a couple days ago, I have yet to return an Invalid MCM Workunit. So mystery solved. Thanks for those above that reached out.
----------------------------------------
[Mar 7, 2020 1:41:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 15   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread