Is gender associated with marks scored or salary? How do you find out?
Association between a categorical variable and a numerical variable.
Today I thought I will cover something that kind of excited me for the first time when I was learning statistics.
Association between a categorical variable and a numerical variable. What exactly does that mean? Let us consider an example of Gender and Salary or Gender and Marks. Salary and Marks are numerical variables. They take numbers as values.
Gender is a categorical variable.
Given a set of data, how can we check whether salary and gender are related? Or marks and gender are related?
The correlation between these kinds of variables can be measured using what is called as a point biserial correlation coefficient.
The formula for the point biserial correlation coefficient is given by:
$$R_{pb}= {\frac{Y_0-Y_1}{S_x}}*\sqrt{P_0P_1} $$
$$Y_0 - is\, the\, mean\, of\, the\, category\, encoded\, as\, 0 $$
$$Y_1 - is\, the\, mean\, of\, the\, category\,encoded\, as\, 1$$
$$ S_x - sample\, standard\, deviation\, of\, the\, numerical\, variable\, $$
$$P_0 - proportion\, of\, data\, corresponding\, to\, data\, set\, 0$$
$$ P_1 - proportion\, of\, data\, corresponding\, to\, data\, set\, 1$$
When the point bi serial coefficient is near to 1, it shows that the categorial variable and the numerical variable are associated.
Let us see how point bi serial coefficient can be put to use with an example.
Below is an example from IIT Madras's course content. Given is the salary of a sample of nurses.
Let us calculate each of the values one by one.
Let us encode F as 0 and M as 1.
That way:
$$average\,salary\,of\,females\, y_0 = \frac{(46+47+40+45+50+55+60+69+70+75+80)}{11}$$
$$ \therefore y_0 = 57.909$$
$$average\,salary\,of\,males\, y_1 = \frac{34 +18+22+34+36+35+28+44+33}{9}$$
$$ \therefore y_1 = 31.556$$
$$proportion\,of\,females\, p_0 = \frac{11}{(9+11)} = 0.55 $$
$$proportion\,of\,males\, p_1 = \frac{9}{(9+11)} = 0.45$$
$$ Sample\, standard\,deviation\,of\,x\: S_X = 17.47 (calculated\, using\, a\, spreadsheet\, as\, below)$$
Now applying the formula,
$$R_{pb} = {\frac{57.909-31.556}{17.47}}*\sqrt(0.55*0.45)$$
$$ R_{pb} = 0.75$$
This shows that bi serial correlation coefficient is close to 1 and there is a strong correlation between gender and salary in the case of nurses data provided here. And that female nurses are better paid in this case.