Random Variables

Up until now, we have been talking about events and outcomes in a fairly abstract way. Outcomes of experiments can be pretty much anything: the colours of objects, the names of people, etc. In order to do some more satisfying mathematics, and eventually talk about statistics, we want to assign values to each outcomes in a sample space.

Definition: A random variable $X:\mathcal{S}\to\mathbb{R}$ is a function from the sample space of an experiment to the real numbers.

This is really just another way of saying that for each outcome in a sample space, we assign a real number to it. We can then define an event space by grouping the outcomes that have the same value together. Typically, we use the capital $X$ to refer to the random variable in an abstract sense, and a lower case $x$ when we're referring to the values that the random variable takes. So if $E$ is an outcome in the sample space that we have assigned the value $x$, we would write $X(E) = x$.

Example: When flipping a coin, there are two possible outcomes: heads, $H$, and tails $T$. We can assign a value of $1$ to $H$, and a value of $0$ to $T$. Then in this experiment we would have $$\mathcal{S} = \{H,T\}$$ and $$X(s) = \begin{cases} 1 &\text{if}~s=H\\ 0 &\text{if}~s=T. \end{cases} $$

There are many reasons we might want to deal with random numbers rather than abstract events and outcomes. For example, we can put numbers in order, which is hard to do for things like colours. We can also add, subtract, multiply, and divide numbers, which would be hard to do with names of people.

Random variables can be split into two types based on the values that they take. A random variable is called discrete if it can only take a countable number of values. When we say that there are "a countable number of values", this doesn't necessarily mean that there are a finite number of possible values, rather that we can come up with a method to start count all of them, even if we had to take an infinite amount of time to do so. A random variable is continuous if it is not discrete.

In most cases, there will be a correspondence between whether a quantity is measured or counted, and whether a random variable is continuous or discrete. You should be able to write the range of a discrete random variable as a list, or the start of a list, while the range of a continuous random variable will have to expressed as an interval, or union of intervals.

Examples

  • The random variable from the coin flipping example is discrete, since there are only two possible values, and we can count to two.
  • The exact volume of water in a randomly chosen body of water is a continuous random variable, since we can't count volume. The volume can take on any value larger than $0$ liters.
  • Is the number of stars in the universe discrete or continuous?

Discrete Random Variables

The intuition for dealing with discrete random variables is easier to develop than with continuous random variables, although the mathematics is a little harder. For each random variable, we can talk about its distribution, which is a description of how the probabilities of each possible value are arranged. There are two main ways to do this for discrete random variables.

Definition: The probability mass function (pmf) of a discrete random variable is the function $$f(x) = P(\{s\in\mathcal{S} ~|~ X(s) = x \}).$$ I.e., the probability that the random variable $X$ takes the specific value $x$.

Definition: The cumulative distribution function (cdf) of a discrete random variable is the function $$F(x) = P(\{s\in\mathcal{S}~|~X(s) \le x\}).$$ I.e. the probability that the random variable $X$ takes any value less than or equal to a specific value $x$.

Example: Consider an experiment where two dice are rolled, and let $X$ be the sum of the pips on both dice.

  1. Find the pmf and plot it as a histogram.
  2. Find and plot the cdf.

I. As discussed in other examples, there are 36 possible outcomes for the roll of two dice. There isn't exactly a nice closed form expression for the probabilities, but we can see that 1 outcome leads to a sum of 2, 2 outcomes lead to a sum of 3, and so on. The pmf is most easily summarized using a histogram or bar chart. Probability mass function for the sum of two dice.

II. Again, there will not be any nice closed form for the cdf, but we can find the value of cdf at $x$ by summing up the values of the pmf that are less than $x$. $$F(x) = \sum_{j = 1}^{x} f(j)$$ For example, if we want to know $F(4)$ we calculate \begin{align}F(4) &= \sum_{j=1}^4 f(j)\\ &= f(1)+f(2)+f(3)+f(4)\\ &= 0+\frac{1}{36} + \frac{2}{36} + \frac{3}{36}\\ &= \frac{1}{6}. \end{align} The cdf is also nicely summarized using a histogram or bar chart. Cumulative distribution function for the sum of two dice.

Exercises:

  • Consider an experiment where a coin is flipped three times. Find and plot the pmf and cdf for the number of heads.
  • For each of the following random variables, determine whether it is discrete or continuous, and determine a possible range for the values.
    1. The amount words in a randomly chosen book
    2. The density of a randomly selected pine 2x4
    3. The ratio of blue candies to the total number of candies in a randomly selected package of smarties