Friday, March 18, 2011

False Positives (Part 2 of 4)

Pregnant, huh? Not this time, sorry, don't believe everything you read. But if at first you don't succeed, try, try again.
Suppose we pick a person off the street, and want to know the probability that this person wears glasses? In our city, the only statistics available say that 65% of females and 40% of males wear glasses. We also know that 51% of the population are female and 49% male. How can we combine these to get the probability we want?

Call Y the event that our selected person is female. The event we are really interested in is X, that our selected person wears glasses. We have split this into two smaller events: X & Y and X & not Y.

These two events are mutually exclusive, so:

P(X) = P(X & Y) + P(X & not Y)

Expressing P(X & Y) in terms of conditional probability, we get

P(X & Y)  =  P(X/Y)p(Y)

and similarly

P(X & not Y)  =  P(X/ not Y)P(notY)

Putting these into the formula of Bayes Theorem, we get:

P(X) = P(X/Y)P(Y) + P(X/ not Y)P(not Y)

At this point we can use the data we have, P(X/Y) = 0.65, P(X/ not Y) = 0.4, P(Y) = 0.51 and P(not Y) = 0.49. Putting this together:

P(X) = 0.65 x 0.51 + 0.4 x 0.49 = 0.5275

A test for a certain disease has the following accuracy: if someone has the disease, the test will produce a positive result 99% of the time, and give a false negative one per cent of the time.

If someone does not have the disease, the test gives a negative result 95% of the time, and gives a false positive 5% of the time (pictured above).

The disease itself is quite rare (unlike in the picture). It occurs in just 0.03% of the population. A person, Harold, is picked at random from the population and tested. He tests positive. The critical question is; what is the probability he has the disease?

We need to translate this into mathematics.

Let X be the event that Harold (or his wife, just kidding) has the disease.

Before we factor in the test data, P(X) = 0.0003.

Let Y be the event that Harold tests positive. We can split this event to write ...

P(Y) = P(Y/X)P(X) + P(Y/ not X)P(not X), which comes out as ...

0.99 x 0.0003 + 0.05 x 0.9997 = 0.0052955

We are really interested in P(X/Y), the probability that he has the disease, given that he has tested positive, and can work this out using Bayes' Theorem:

P(X/Y) = P(Y/X) x ((P(X)/P(Y)) = 0.99 x (0.0003/0.0052955) = 0.56

So the probability that Harold has the disease, given that he has tested positive for it, is a little under 6 %.

The explanation for this surprising result is that the true positives form a high proportion of the very small number of disease sufferers.

These are greatly outnumbered by the false positives, a fairly small proportion of a hugely larger number of non-sufferers.

So, despite the seeming accuracy of the test, when randomly chosen people are tested, a large majority of the positive results will be false.

From: Mathematics 1001, by Dr. Richard Elwes

No comments: