Statistical Thinking

A distribution is a function that shows all possible values of a variable and how frequently each value occurs.

Mean and Standard Deviation

Mean or average describes the center of a numeric distribution by adding all values and dividing by the count.
Standard Deviation describes the spread of values in a numeric distribution relative to the mean. Calculated by finding the average squared distance from each data point to the mean and square-rooting the result.

Skewed

A skewed distribution is asymmetrical with a steep change in frequency on one side and a flatter, trailing change in frequency on the other.

Median and IQR

Median to find the middle value when all values are arranged from smallest to largest.
IQR A quartile is simply a marker for a quarter (25%) of the data.

Outliers and Robust

Outliers are extreme values that are distant from the rest of the distribution. Like skewness, outliers tend to influence the mean more heavily than the median.
Because the median and IQR are NOT heavily influenced by extreme values, we say they are robust. Robust statistics are often a better choice to measure the center and spread of a distribution that is skewed or has outliers.

Aggregate Data and Variable Relationships

The mode is defined as the value with the highest frequency, but we can also think of the mode as the value where the peak of the distribution occurs.

Aggregated data is by summarizing a numeric variable across each value of a categorical variable.

Aggregating data is a way of exploring variable relationships. We specifically looked at relationships between a numeric variable and a categorical variable, but we should also examine relationships between two numeric variables.

We can describe this relationship more precisely by measuring the correlation coefficient. This number ranges from -1 to +1 and tells us two things about a linear relationship:

Direction: A positive coefficient means that higher values in one variable are associated with higher values in the other. A negative coefficient means higher values in one variable are associated with lower values of the other.
Strength: The farther the coefficient is from 0, the stronger the relationship and the more the points in a scatter plot look like a line.