The Use of Notation in Basic Statistics – Part I
One thing that gets students confused very frequently, and I would say more than necessary, is the liberal use of mathematical notation that occurs in Statistics, even at basic levels. More often than it would be desired, instructors use notation that students are unsure about. Rightfully so, teachers see in the use of notation a way of expressing ideas in a precise, unequivocal, more compact way. And as ideas build up, the use of notation can become more convoluted, or convoluted enough to leave students confused and biting the dust.
In the following paragraphs we will attempt to clarify the use of notation in Statistics from the bottom up, from notations in the most basic descriptive statistics, to the notation used in more sophisticated hypothesis tests.
Notation in Descriptive Statistics
The following symbols are commonly used when working with descriptive statistics. These symbols are still used throughout most of your Statistics class.
\(\bar{X}\): This is the sample mean, which corresponds to the arithmetic average of the value from a sample \({{X}_{1}}\), \({{X}_{2}}\),...,\({{X}_{n}}\). This is statistic (because it is constructed with sample information). In some courses, especially in the Social and Behavioral Sciences, they use \(M\) to refer to the sample mean.
\({s}^{2}\): This is the sample variance, which is computed as
\[{{s}^{2}}=\frac{1}{n-1}\left( \sum\limits_{i=1}^{n}{X_{i}^{2}}-\frac{1}{n}{{\left( \sum\limits_{i=1}^{n}{{{X}_{i}}} \right)}^{2}} \right)\]
This is statistic (because it is constructed with sample information). There are other versions of the above formula, but they all lead to the same numerical value.
\(s\): This is the sample standard deviation, which is computed by taking the square root of the sample variance, or simply by using the above formula, which is computed from the sample data \({X}_{1}\), \({{X}_{2}}\),...,\({X}_{n}\)
\[s=\sqrt{\frac{1}{n-1}\left( \sum\limits_{i=1}^{n}{X_{i}^{2}}-\frac{1}{n}{{\left( \sum\limits_{i=1}^{n}{{{X}_{i}}} \right)}^{2}} \right)}\]
This is statistic (because it is constructed with sample information). There are other versions of the above formula, but they all lead to the same numerical value.
\(SS\): This is the "sum of squares". This statistics measures the squared variation of a variable \(X\) with respect to the sample mean. If you have a sample \({{X}_{1}}\), \({X}_{2}\),...,\({{X}_{n}}\), the formula used to compute it is
\[SS=\sum\limits_{i=1}^{n}{{{\left( {{X}_{i}}-\bar{X} \right)}^{2}}}\]Often times, a subscript is used to indicate what variable we refer to, if not clear. For example, you can write \(S{{S}_{X}}\) to refer to the sum of squares of variable \(X\), or you can write \(S{{S}_{Y}}\) to refer to the sum of squares of variable Y. In Social and Behavioral Sciences, you will typically write the sum of squares of \(X\) as \(SS_{XX}\) instead of \(SS_{X}\) but it is all simply about what is the preferred notation that makes more sense. There are other expressions that are equivalent when it comes about expressing the sum of squares. For example, here we have two alternative ways to write the sum of squares:
\[S{{S}_{XX}}=\sum\limits_{i=1}^{n}{{{\left( {{X}_{i}}-\bar{X} \right)}^{2}}}=\sum\limits_{i=1}^{n}{X_{i}^{2}}-\frac{1}{n}{{\left( \sum\limits_{i=1}^{n}{{{X}_{i}}} \right)}^{2}}\]
Based on the above, there is a clear link between the sample variance and the sum of squares:
\[{{s}^{2}}=\frac{S{{S}_{XX}}}{n-1}\]
Notice that notation sometimes is excessive, and sometimes is inconsistent. Indeed, it is very common to use a subscript for the sum of squares (like in \(S{{S}_{XX}}\)) to indicate which variable we are referring to (\(X\) in this case). Although, in the case of the variance or standard deviation such use of subscripts is less common, although still acceptable. For example, you can write \({{s}_{X}}\) to specify the sample standard deviation of variable \(X\), or more precisely said, \({{s}_{X}}\) indicates the sample standard deviation computed off the sample \({{X}_{1}}\), \({{X}_{2}}\),...,\({{X}_{n}}\) that comes from the random variable \(X\).
\(m\): Sample median. The point (or interpolated point) that sets the middle of the distribution. There is not a universal agreement about referring the sample median as \(m\), but it is a common practice.
\({{Q}_{j}}\): This is the j th quartile, with \(j=1,2,3,4\). These are the points (or interpolated points) that divide the distribution in quarters. Notice that \({{Q}_{2}}\) is the median.
\({{P}_{x}}\): This is the x-th percentile, with \(0\le x\le 100\). These are the points (or interpolated points) so that x percent of the distribution is to the left of those points. Observe that \(m={{Q}_{2}}={{P}_{50}}\).
IQR: This is the interquartile range , and it is defined as \(IQR={{Q}_{3}}-{{Q}_{1}}\), which is the difference between the third and first quartiles. This is commonly used as a measure of dispersion and to detect outliers.
Other descriptive statistics: There are many less commonly used descriptive statistics for which there are no universal symbols to use. For example, skewness, kurtorsis, moments of higher order, etc, are sometimes used, but not compact symbols are universally used to denote them.