A Maths Issue (how to solve this?)

Warbler

At the Start
Joined
Jun 6, 2005
Messages
8,493
OK without going right through it, I've currently found myself tracking the coronavirus 'new infections' data in attempt to establish the direction of travel that we're heading (it's my belief that we're now at a level below that which we locked down under incidentally) but as ever, it's bit a muddy

Now I'm using the daily DoH data, and running the reports as a 7 day moving/ rolling average to omit the peaks of Tuesday and troughs of Sunday and Monday. That's fine. I'm happy that's the right way of doing it. The problem occurs of course with the raw data report which can show an increase in positive tests resulting from the obvious assumption that the more people you test, the more cases you detect.

To try and put this on a level, I've introduced a baseline of 10,000 tests. That is to say divide the number of positives into the number of people tested, and multiply by 10,000. This figure then goes into the moving average until such time that it falls off

What I'm wrestling with however is the sample construct. Throughout March the only people tested were those presenting with symptoms or those who had been in contact with people who'd tested positive. This creates a skewed sample, but is OK for analytical purposes so long as its evenly applied as you still get a trend
line. In April (certainly the second half of the month) we began testing key workers (people who weren't reporting symptoms). Naturally the ratio of people testing positive to the number of people tested rises, which might also be attributable to the virus being less prevalent too (probably takes us into the realms of determinant coefficients).

Altering a sample during a survey period is of course a nightmare

Is there any method in quantitative analysis that anyone is aware of that could adjust for this?

I'm looking at an output at the moment which has probably inflated the position of March because the sample was targeted. If you then project onto the 10,000 tested
baseline you get a higher figure than was probably the case. This in turn leads to me potentially over-estimating the prevalence of Covid-19 circa March 15th onwards, potentially rendering the whole conclusion that we're broadly at our March 21st position again wrong

 
Last edited:
Is this essentially a 'goodness of fit' problem?

If so, employing statistical 'smoothing' tests might help

Chi-Squared and Poisson Distribution spring to mind, both of which are available on Excel

Not that I'd have a clue how to use them: I just know the names :)

This forum might prove useful:
https://math.stackexchange.com/

This pandemic caused by such a seemingly atypical pathogen with so many unknowns and so many variables should have statisticians salivating, so good luck with the research
 
Last edited:
Doesn't the DoH data include re-tests, which (I'm guessing) would skew the figures upwards anyway? The premise being that negative tests are less likely to be repeated.
 
Last edited:
Cheers
Drone.

I think you're broadly in the right area for treating the predictive element on the data, but I can't make it stick.

The data is roughly conforming with normally distributed sample so would pass a null hypothesis making Chi possible on a Pearson test. I could re-examine it

Poisson I know less about in terms of how to apply the correction as I wouldn't know what to test for. It's perhaps closer to something like a confidence test by way of a forecast.

The issue really stems from how the sampling changed during the survey period, and can clearly fall apart on a small targeted data. Clearly you can't extrapolate that 4 positives on 8 tests would equate to 5,000 on 10,000 tests and a 50% infection rate

The problem I've got really is how to manipulate March's data safely. Something crude like using the lower quartile or standard error might be more accurate that what I've done, but it's pure guesswork. I suspect March's data is over reporting the amount of infection because the testing that was done was much more targeted. Therefore when I say we've recovered our position as of March 20th the chances are the amount of infection in the community on March 20th wasn't as high as I'm modelling
[SUB][/SUB]
 
Last edited:
Doesn't the DoH data include re-tests, which (I'm guessing) would skew the figures upwards anyway? The premise being that negative tests are less likely to be repeated.

I've been able to source reports for number of people tested as opposed to number of tests conducted.

In any event, it's the trend line that I'm after. The actually Y number needn't be that important so long as the method use to generate is consistent across the survey period
 
Last edited:
Back
Top