statistics

essential for extracting insights and drawing conclusions

Basic Formulas

Mean (μ) - the average result of a test, survey, or experiment; add the values and divide the sum by the number of values; X̅ is variable for sample mean

Median - the score that divides the sorted results in half; the middle value; if the set has an even number of elements, take the average of the middle two values

Mode - the most common/frequent value of a test, survey, or experiment; there can be no mode or more than one mode

Range - the difference between the largest and smallest elements in a set

Proportion - fraction of the population that has a certain characteristic (p for population proportion, p̂ for sample proportion)

Distribution

Statistic vs. Parameter - statistic is data that describes the sample population (x̄, s, p̂) vs. parameter describes the entire population (μ, σ, p)

Variance - the measure of how far a set of numbers are spread out from their mean; tells you the degree of the spread in the data set i.e. measure of uncertainty/fluctuation/variability; expressed in large units, e.g., meters²

Standard Deviation - square root of variance; on average, how far each value lies from the mean; more common than variance because statement is expressed in the same unit as the original values

Sample Standard Deviation - when using a sample as an estimate of the whole population; use N-1 instead of N

Normal Distribution

“Regression to the mean” or central tendency
Standard Deviation is the variation on either side of the mean
Bell is symmetrical; mean, median, mode are all equal
The Empirical Rule states that:

68% of the data falls within 1σ from the mean
95% of the data falls within 2σ from the mean
99.7% of the data falls within 3σ from the mean

Total area under the curve equals 100% or 1
Calculate area of a data point using the z-table

Uniform Distribution - all outcomes are equally probable; e.g., the probability of rolling a 1, 2, 3, 4, 5, or 6 on a die

Z-score - (or standard score) tells you how far the data point is from the mean; formula is z = (x – μ) / σ

Standard Error (SE) - measure the standard deviation of the samples mean to the population mean; z-score formula is (new mean - population mean) / (standard deviation / (square root of number of the new number of items)); z = (x – μ) / (σ / √n)

Student’s T-test - a method for testing the significance of the differences between sample groups when the population standard deviation is unknown; i.e. “did this happen by chance?”

‣ Independent Samples t-test - compares the means for two groups
‣ Paired sample t-test - compares means from the same group at different times
‣ One sample t-test - tests the mean of a single group against a known mean

T-score - (or T-value) the ratio between the difference between two groups and the difference within the groups

‣ The larger the t score, the more difference there is between groups
‣ The smaller the t score, the more similarity there is between groups
‣ T score of “3” means that the groups are 3x as different from each other as they are within each other
‣ When running t-test, the bigger the t-value, the more likely that the results are repeatable
‣ Every t-value has a p-value to go with it

T-Distribution; smaller the df, heavier the tails

Student’s T Distribution (T-distribution)

a type of probability distribution similar to the normal distribution but has heavier tails
there is a greater chance for extreme values vs. normal distributions
used when standard deviation for the population is unknown

Degrees of Freedom - (or “df”) the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample; formula is size of the sample size minus one or Df = N−1

Ex. If you have sets [3, 5, 10], [2, 7, 9], [11, 1, 6] the mean is 6. You have the freedom to choose any first two numbers, but the third number is a fixed number in order to achieve the average of 6. So the df = 2

z-score vs. t-score

Use t-score when sample size < 30 and standard deviation is unknown
Use z-score when sample size > 30 and standard deviation is known

Confidence Interval

a range of values, above and below the statistic’s (sample) mean, that would contain an unknown population parameter; the probability that the confidence interval would contain the true population parameter when you draw a random sample many times
degree of uncertainty associated with a sample estimate of a population; most often, researchers choose 90%, 95%, or 99%
Not to be confused: 95% confidence interval does not mean 95% chance that the population parameter falls between a and b. It means that 95% of the interval estimates would include the parameter.
conducted during statistical methods like the t-test
formula is sample statistic + margin of error

Margin of error - the range of values above and below the sample statistic in a confidence interval; formula is critical value * standard deviation of statistic or critical value * standard error of statistic

Critical value - a cut-off value that tells us how far from the sample mean we can vary and remain confident—usually one standard deviation from the mean; this is usually the t-score or z-score

Hypothesis & Testing

Significance (α) - the measure of whether the results of research were due to chance; significance level generally fixed at 0.05 (this is just a statistical benchmark)

P-value - “probability value”; the way in which significance is quantified and reported statistically

‣ defines the probability of getting a result that is either the same or more extreme (rare/special) than other actual observations; “did the result from the sample data occur by chance?”
‣ used as an alternative to the rejection point to provide the least significance for which the null hypothesis would be rejected (α < 0.05)
‣ the smaller the p-value, the stronger is the evidence that is in favor of the alternative hypothesis given observed frequency and expected frequency

P > 0.10 (>10%) - No evidence against Null Hypothesis; Nothing special
P > 0.05 (>5%) - Weak evidence against H0
P = 0.05 (5%) - Moderate evidence against H0; Statistically significant
P < 0.05 (<5%) - Good evidence against H0
P < 0.01 (<1%) - Strong evidence against H0
P < 0.001 (<0.1%) - Very strong evidence against H0; Highly significant

P-value approach to hypothesis testing - P-value of the hypothesis test is reported and readers can interpret the statistical significance themselves

Null Hypothesis (H0) - also called H-null, H-zero, H-nought, or “the conjecture”; the commonly accepted fact. This is the default hypothesis that a quantity to be measured is zero (or null). It proposes that no statistical significance exists in a set of given observances.

Alternative Hypothesis (H1 or Ha) - a “hunch”; this is the driver of our data collection and experimental design

Upper-tailed test - μ1 > μ0, and an increase is hypothesized

Lower-tailed test - μ1 < μ0, where a decrease is hypothesized

Two-tailed test - μ1 ≠ μ0, where a difference is hypothesized

Statistical Relationship

Correlation - the degree to which two factors appear to be related (not the same as causation)

Causation - a relationship where one factor causes the other

Regression - a statistical method that determines the relationship between a dependent variable (Y-axis) and a series of other independent variables (X-axis)

Coefficient of Determination - or R-squared (R2); the percentage variation in y explained by x-variables; i.e. how the differences in one variable can be explained by the differences in another variable; expressed as a percentage or between 0 to 1

Anscombe’s Quartet - an example where 4 datasets have similar statistical metrics (e.g. mean, variance, correlation) but significant differences when data is plotted on a graph. Don’t rely on statistical metrics alone; use data visualization to get the full picture.

Probability

Experiment - (in probability theory) a procedure that involves chance of probability and can be infinitely repeated; experiment is random if there is more than one possible outcome, and deterministic if it has only one

Outcome - the result of the experiment

Conditional Probability - the likelihood of an outcome based on the occurrence of a previous outcome; calculated by multiplying the probability of the preceding event by the updated probability of the succeeding event

Bayes’ Theorem - mathematical formula for determining conditional probability; formula is P(A∣B) = P(A⋂B) / P(B) = P(A) x P(B∣A) / P(B)

named after 18th-century mathematician Thomas Bayes
allows one to update the predicted probabilities of an event by incorporating new information
often used in finance for calculating or updating risk evaluation
useful when implementing machine learning

Formulas and Symbols

Probability of ‘A’ and ‘B’ = P(A ∩ B)
Probability of ‘A’ or ‘B’ = P(A ∪ B)
Probability of ‘A’ given event ‘B’ occurred = P(A | B)

Combination - order does not matter; e.g. ingredients for a salad; formula is C(n, r) = n! / (n - r)! r!

Permutation - order matters; there are more possible combinations when order matters; e.g. phone number or lock; formula is P(n, r) = n! / (n - r)!

home page

data science