This might be more of a math question than a stock market or C# question.

I've been running a lot of back tests with various combinations of parameters to try to find the best configuration such as Moving Average periods. Well, I'm finding the back tests with the highest balances in the end aren't always the ones that perform best in live trading. When reviewing the differences between the ones that do best in back testing (Sample A) versus the ones that do better in live trading (Sample B), I'm starting to see a trend.

Sample A tests have fewer trades with high up and down spikes, whereas Sample B tests have many more trades with much smaller spikes. Sample A tests have higher balances because of a few really good trades but it seems more like luck than accuracy.

So, lets say I have these result sets, where the numbers represent the balance for each period:

Set 1: [ 10, 8, 22, 22, 22, 12, 42, 42, 42, 38, 55 ] (Few big spikes up and down)

Set 2: [ 10, 10, 10, 10, 53, 53, 53, 53, 53, 53, 53 ] (Single huge spike up)

Set 3: [ 10, 13, 14, 13, 16, 19, 22, 20, 23, 25, 28 ] (gradual and more or less consistent incline)

Set 4: [ 10, 12, 14, 17, 21, 25, 30, 35, 41, 46, 52 ] (growing but consistent incline)

How can I statistically determine Sets 3 and 4 to be better than Sets 1 and 2? An Average doesn't work well because it doesn't factor in consistency.