statistics

essential for extracting insights and drawing conclusions

Basic Formulas

Mean (μ) - the average result of a test, survey, or experiment; add the values and divide the sum by the number of values; X̅ is variable for sample mean

Median - the score that divides the sorted results in half; the middle value; if the set has an even number of elements, take the average of the middle two values

Mode - the most common/frequent value of a test, survey, or experiment; there can be no mode or more than one mode

Range - the difference between the largest and smallest elements in a set

Proportion - fraction of the population that has a certain characteristic (p for population proportion, p̂ for sample proportion)

mean (μ)


Distribution

Statistic vs. Parameter - statistic is data that describes the sample population (x̄, s, p̂) vs. parameter describes the entire population (μ, σ, p)

Variance - the measure of how far a set of numbers are spread out from their mean; tells you the degree of the spread in the data set i.e. measure of uncertainty/fluctuation/variability; expressed in large units, e.g., meters²

Standard Deviation - square root of variance; on average, how far each value lies from the mean; more common than variance because statement is expressed in the same unit as the original values

Sample Standard Deviation - when using a sample as an estimate of the whole population; use N-1 instead of N

variance var(x)

 

standard deviation (σ)

sample standard deviation (s)

 

normal distribution, “the bell curve”, Gaussian Curve

Normal Distribution

  • “Regression to the mean” or central tendency

  • Standard Deviation is the variation on either side of the mean

  • Bell is symmetrical; mean, median, mode are all equal

  • The Empirical Rule states that:

68% of the data falls within 1σ from the mean
95% of the data falls within 2σ from the mean
99.7% of the data falls within 3σ from the mean

  • Total area under the curve equals 100% or 1

  • Calculate area of a data point using the z-table

Uniform Distribution - all outcomes are equally probable; e.g., the probability of rolling a 1, 2, 3, 4, 5, or 6 on a die

Z-score - (or standard score) tells you how far the data point is from the mean; formula is z = (x – μ) / σ

Standard Error (SE) - measure the standard deviation of the samples mean to the population mean; z-score formula is (new mean - population mean) / (standard deviation / (square root of number of the new number of items)); z = (x – μ) / (σ / √n)

Student’s T-test - a method for testing the significance of the differences between sample groups when the population standard deviation is unknown; i.e. “did this happen by chance?”

‣ Independent Samples t-test - compares the means for two groups
‣ Paired sample t-test - compares means from the same group at different times
‣ One sample t-test - tests the mean of a single group against a known mean

T-score - (or T-value) the ratio between the difference between two groups and the difference within the groups

‣ The larger the t score, the more difference there is between groups
‣ The smaller the t score, the more similarity there is between groups
‣ T score of “3” means that the groups are 3x as different from each other as they are within each other
‣ When running t-test, the bigger the t-value, the more likely that the results are repeatable
‣ Every t-value has a p-value to go with it

T-Distribution; smaller the df, heavier the tails

Student’s T Distribution (T-distribution)

  • a type of probability distribution similar to the normal distribution but has heavier tails

  • there is a greater chance for extreme values vs. normal distributions

  • used when standard deviation for the population is unknown

Degrees of Freedom - (or “df”) the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample; formula is size of the sample size minus one or Df​ = N−1

Ex. If you have sets [3, 5, 10], [2, 7, 9], [11, 1, 6] the mean is 6. You have the freedom to choose any first two numbers, but the third number is a fixed number in order to achieve the average of 6. So the df = 2

z-score vs. t-score

  • Use t-score when sample size < 30 and standard deviation is unknown

  • Use z-score when sample size > 30 and standard deviation is known

Confidence Interval

  • a range of values, above and below the statistic’s (sample) mean, that would contain an unknown population parameter; the probability that the confidence interval would contain the true population parameter when you draw a random sample many times

  • degree of uncertainty associated with a sample estimate of a population; most often, researchers choose 90%, 95%, or 99%

  • Not to be confused: 95% confidence interval does not mean 95% chance that the population parameter falls between a and b. It means that 95% of the interval estimates would include the parameter.

  • conducted during statistical methods like the t-test

  • formula is sample statistic + margin of error

Margin of error - the range of values above and below the sample statistic in a confidence interval; formula is critical value * standard deviation of statistic or critical value * standard error of statistic

Critical value - a cut-off value that tells us how far from the sample mean we can vary and remain confident—usually one standard deviation from the mean; this is usually the t-score or z-score


Hypothesis & Testing

Significance (α) - the measure of whether the results of research were due to chance; significance level generally fixed at 0.05 (this is just a statistical benchmark)

P-value - “probability value”; the way in which significance is quantified and reported statistically

‣ defines the probability of getting a result that is either the same or more extreme (rare/special) than other actual observations; “did the result from the sample data occur by chance?”
‣ used as an alternative to the rejection point to provide the least significance for which the null hypothesis would be rejected (α < 0.05)
‣ the smaller the p-value, the stronger is the evidence that is in favor of the alternative hypothesis given observed frequency and expected frequency

P > 0.10 (>10%) - No evidence against Null Hypothesis; Nothing special
P > 0.05 (>5%) - Weak evidence against H0
P = 0.05 (5%) - Moderate evidence against H0; Statistically significant
P < 0.05 (<5%) - Good evidence against H0
P < 0.01 (<1%) - Strong evidence against H0
P < 0.001 (<0.1%) - Very strong evidence against H0; Highly significant

P-value approach to hypothesis testing - P-value of the hypothesis test is reported and readers can interpret the statistical significance themselves

Null Hypothesis (H0) - also called H-null, H-zero, H-nought, or “the conjecture”; the commonly accepted fact. This is the default hypothesis that a quantity to be measured is zero (or null). It proposes that no statistical significance exists in a set of given observances.

Alternative Hypothesis (H1 or Ha) - a “hunch”; this is the driver of our data collection and experimental design

Upper-tailed test - μ1 > μ0, and an increase is hypothesized

Lower-tailed test - μ1 < μ0, where a decrease is hypothesized

Two-tailed test - μ1 ≠ μ0, where a difference is hypothesized


Statistical Relationship

Correlation - the degree to which two factors appear to be related (not the same as causation)

Causation - a relationship where one factor causes the other

Regression - a statistical method that determines the relationship between a dependent variable (Y-axis) and a series of other independent variables (X-axis)

Coefficient of Determination - or R-squared (R2); the percentage variation in y explained by x-variables; i.e. how the differences in one variable can be explained by the differences in another variable; expressed as a percentage or between 0 to 1

Anscombe’s Quartet - an example where 4 datasets have similar statistical metrics (e.g. mean, variance, correlation) but significant differences when data is plotted on a graph. Don’t rely on statistical metrics alone; use data visualization to get the full picture.

Anscombe’s Quartet results

Anscombe’s Quartet


Probability

Experiment - (in probability theory) a procedure that involves chance of probability and can be infinitely repeated; experiment is random if there is more than one possible outcome, and deterministic if it has only one

Outcome - the result of the experiment

Conditional Probability - the likelihood of an outcome based on the occurrence of a previous outcome; calculated by multiplying the probability of the preceding event by the updated probability of the succeeding event

Bayes’ Theorem - ​mathematical formula for determining conditional probability; formula is P(AB) = P(AB) / P(B) ​= P(A) x P(BA) / P(B)

  • named after 18th-century mathematician Thomas Bayes

  • allows one to update the predicted probabilities of an event by incorporating new information

  • often used in finance for calculating or updating risk evaluation

  • useful when implementing machine learning​

independent probability

conditional (dependent) probability

Formulas and Symbols

  • Probability of ‘A’ and ‘B’ = P(A ∩ B)

  • Probability of ‘A’ or ‘B’ = P(A ∪ B)

  • Probability of ‘A’ given event ‘B’ occurred = P(A | B)

Combination - order does not matter; e.g. ingredients for a salad; formula is C(n, r) = n! / (n - r)! r!

Permutation - order matters; there are more possible combinations when order matters; e.g. phone number or lock; formula is P(n, r) = n! / (n - r)!

permutation

combination