Sample size?

Colin Phillips · Jun 14, 2011

Away from Ascot and a question for the statisticians on here (it hadn't been a subject when I was in school).

When does a sample size become significant?

I suppose that is like asking how long is a piece of string, so some context.

I'm looking at trainers for courses and have found a trainer that has a five-year record at a track of 14 2-y-o winners from 37 2-y-o runners (38%) and a profit of approaching 50 points. Is that a big enough sample?

Is there a way of calculating a significant sample size?

As you can see, I'm a dunce on this subject and, any help would be gratefully received.

jft2005 · Jun 14, 2011

who is the trainer?!

Grey · Jun 14, 2011

There are others who know a lot more than me about this subject, Colin, but I would say that you have a statistic which is starting to be meaningful, provided that the winners were not all concentrated in one period and the profit/loss figure is not overly influenced by one or two outsiders.

Colin Phillips · Jun 14, 2011

All in good time, jft.

I'll post it before the next Musselburgh meeting.

Thanks, Grey.

Colin Phillips · Jun 14, 2011

Break down by year:

2007 1/10
2008 8/15
2009 2/5
2010 3/5
2011 0/2

Not so easy to find the SPs.

Venusian · Jun 14, 2011

Colin Phillips said:
When does a sample size become significant?

This is the wrong question. Rather, it should be along the lines of, "are these results significant given the sample size, do they indicate that skill is involved?".

This is all about standard deviations and distributions about the mean, try wikipedia for starters, I used to know all this stuff but I'm a little rusty these days!

Colin Phillips · Jun 14, 2011

Thanks, Ven.

Arkwright · Jun 14, 2011

Easy enough to find out who it is through a quick look on the RP site.

Won't spoil it,but for those who can't be arsed to look said trainer has some decent stats at other Northern courses,helped by one or two going in at big odds.

Had a 20/1 2 year old winner there in 2008.
Another horse won there 3 times as a 2 year old in the same season at odds of 9/1 (twice) & 9/2.

swedish chef · Jun 14, 2011

Colin Phillips said:
Break down by year:

2007 1/10
2008 8/15
2009 2/5
2010 3/5
2011 0/2

Not so easy to find the SPs.

Being an avid follower of stats I always follow Bryan Smarts 2yo at Musselburgh is he the one?

I use to follow Peter Winkworth's 2yo at Windsor but he stopped sending them there when Hannon started to mop up the 2yo races there.

Colin Phillips · Jun 14, 2011

You got 'im, Chef.

Arkwright · Jun 14, 2011

Was it Bartley's victory at Carlisle that caught your eye Colin?

I believe it was his first 2 year old winner this season.

Colin Phillips · Jun 14, 2011

No, Arkers, it was something I came across when I was looking for draw information.

Colin Phillips · Sep 25, 2011

PTOLEMAIC, trained by Bryan Smart, wins at Musselburgh this afternoon @ 11/1.

Drone · Sep 25, 2011

These are just the sort of sample sizes or 'statistics' if you really want a dignification, that are the refuge of the foolish, and the knowing smugness of the wise

Typical is the lengthy hiatus...then...won at 11/1

Rather than rely on this latest cast-iron system I, personally, would ponder: why does Smart like Musselburgh?

Answer that satisfactorily and perhaps you'll be all the more confident punting his next horse on those sandy links

Colin Phillips · May 4, 2012

Boxing Shadows (B.Smart 2-y-o Musselburgh) caught in the shadow the post to finish second @ 7/1.

SlimChance · May 4, 2012

On his form or profile was the trainers course stats built into the SP of 7/1?

Colin Phillips · May 5, 2012

No form, Slim, first run.

shadeed · May 10, 2012

Great topic -

You are right Colin - stats are bandied about without people understanding their true significance.

A key thing to note about using statistics is to be clear about the question you are attempting to answer. Here, what you are attempting to ask is 'how confident can I be that the profit/loss performance of this trainers runners at this course (as represented by the data) will be replicated in future' - I think that's what you are asking?

In this case a couple of things are important to think about -

Important to note that, in statistical terms, your historical numbers aren't a sample, they are the entire population of that trainers runners over the period you've looked at. There are therefore things you can say with certainty about that population (e.g. number winners, prices, etc).

But what you are looking to establish is the likely statistical relationship between that population (the historical data you have) and a totally separate population (the future results of that trainer).

That is an important distinction - looking at a sample (part of a data set) of a population (the entire data set) is, from a statistics viewpoint, very different to looking at one data set and attempting to predict the make-up of another data set.

In this case your theoretical hypothesis is that the yard have a modus operandi (either consciously or not) which means that the horses they run at a certain course are over-priced.

For horse racing prediction the difficulty is that there are so many variables that it is impossible to establish a clear figure for how many data points would be required, historically, to establish firm or meaningful confidence levels about how the future will look. That's the basic problem with 'back-fitting' (which is what you are doing here) - there is no guarantee that the future will look like the past and there are so many random variables in horse racing that it is impossible to place any meaningful numerical assessment (calculated or estimated) on how likely it is that the future will resemble the past.

But there is a useful rule of thumb which is the 'law of large numbers'. The law of large numbers states, among other things and in laymans terms, that the larger number of observations you have the greater confidence you have that the future will resemble the past. The insurance industry is based on this law.

You'll find if you look at trainer/course/race type relationships Colin that you will find hundreds (maybe thousands) of historically positive profit/loss correlations. If you test these you'll quickly find that they are wholly unreliable in isolation. The data sets are just too small. Sometimes they will pay off, but that will be down to chance (or you will have no way of establishing whether they were down to chance or otherwise, no matter how inuitively appealing the hypothesis may be) and the odds will be heavily against you.

One way of improving your odds (reducing the risk) is to apply the law of large numbers and aggregate such instances (find, maybe twenty/thirty trainer/course correlations that have worked historically and follow them as a group). That will reduce variation and reduce the risk.

But, putting numbers against the risks/confidence levels based on data set sizes - almost impossible. If it were that simple we'd all be rich.

Colin Phillips · May 10, 2012

Thanks for that, Shadeed.

Aragorn · May 10, 2012

Great post.

Grey · May 10, 2012

Very well explained, Shadeed

Colin Phillips · Jun 3, 2012

Defeat for Bryan Smart's 2-y-o Rhagori Aur at Musselburgh yesterday.

Stats now 17 winners from 37 runners and a level stakes profit of £60.63.

Colin Phillips · Sep 17, 2012

Rhagori Aur wins at 9/1 at Musselburgh for Bryan Smart now 18/38 profit £69.63.

Grey · Sep 17, 2012

I hope you're backing them, Colin

gigilo · Sep 17, 2012

Bit of a sickener this winning today,when it won at thirsk it looked listed class good time compared to +80 older handicappers actually ran quicker was one of my first to follows from the 2yr olds.The first 6/7 from that race are still winning and some have competed in group races,yet this one looked like it had gone backwards and i've disgarded it last two runs!!Think i put it up in the supersprint as a bet with satsuma and it goes and wins off 74 today.:blink:

Sample size?

At the Start

Conditional

Senior Jockey

At the Start

At the Start

At the Start

At the Start

Journeyman

At the Start

At the Start

Journeyman

At the Start

At the Start

Conditional

At the Start

SlimChance

Guest

At the Start

At the Start

At the Start

At the Start

Senior Jockey

At the Start

At the Start

Senior Jockey

Senior Jockey