To listen to the lecture, click here.
Why is this problem different from other problems?
Data Description. You are given data to plot or given a plot of data to interpret.
Experimental Design. You are asked to comment on a procedure for a study or asked to design a procedure for a study.
Probability. You are asked to find the probability of an event occurring or the expected value of a random variable.
Statistical Inference. You are asked to compute a confidence interval or to carry out a hypothesis test.
Keywords: mean, median, mode, skewed, boxplot, modified boxplot, stemplot, dotplot, scatterplot, range, outlier, interquartile range (IQR), slope, intercept, predicted value, correlation, regression outlier, influential point, exponential regression, power regression.
Skewed to the right means that mean is higher than the median. Think of the distribution of wealth. Bill Gates is a lot farther above the median than the poorest person is below the median. Bill Gates raises the mean wealth in the population well above the median wealth, which skews the distribution to the right.
The rule of thumb for an outlier uses the magic number of 1.5 times the IQR. Add that number to the top quartile and subtract it from the bottom quartile to get the boundaries for outliers.
A scatterplot shows correlation between two variables. A side-by-side boxplot shows a comparison of two variables. Boxplots, stemplots, and dotplots can indicate whether a distribution is normal or skewed.
A power regression is y = aXb. For b less than one, it is concave down. An exponential regression is y = abX.
There are two ways to calculate the slope of a regression: b = rsy/sx or b = sxy/sx2
The intercept is a = y - bX
For a given Xi, the predicted value of Y is Y^i, is equal to a + bXi. The residual is equal to Yi minus the predicted value.
Keywords: observational study, experiment, random sample, stratified random sample, voluntary response sample, convenience sample, block design, blind, double-blind, placebo effect, lurking variable, confounding effect, causality, simulation.
It is easier to infer causality from an experiment than from an observational study.
Block design is a way to reduce the amount of random variation in the results. In that sense, it is like taking a larger sample size.
Keywords: conditional probability, Venn diagram, tree diagram, independent, probability of success (failure), sample space, expected value, random variable, normal random variable, binomial random variable, geometric random variable.
Words that indicate conditional probability include "given," "if," and "of." Words that indicate joint probability include "both" and "and."
Two events are independent if and only if p(A and B) = p(A)p(B)
If all of the engines have to fail in order for the boat to be stranded, then the probability of getting home is one minus the probability that all of the engines fail. The probability of all engines failing is the probability of one engine failing taken to the nth power, where n is the number of engines.
If Y = a + bX, then E(Y) = a + bE(X) and variance (Y) = b2variance(X)
if W = X+Y, then E(W) = E(X) + E(Y). If X and Y are independent, then the variance of W = variance of X + variance of Y, and the standard deviation of W = the square root of (variance of X + variance of Y)
If you have a scalar variable that is normally distributed, use the Z transformation and normcdf() to find the probability that the variable will fall in a particular range.
To go from a percentile to a standard deviation, use invnorm(). To go from a standard deviation to natural units, use the Z transformation.
If you have a binary variable, and the question is about the number of successes in n trials, then use the binomial distribution. binompdf(n,p,k) is the probability of getting exactly k successes in n trials; binomcdf(n,p,k) is the probability of getting k successes or fewer in n trials.
If you have a binary variable, and the question is about the number of trials until the first success, then use the geometric distribution.
The mean of a binomial distribution is np, and the standard deviation is the square root of np(1-p).
The mean of a geometric distribution is 1/p, and the standard deviation is outside the scope of the course
Keywords: parameter, statistic, central limit theorem, bias, sampling variability, confidence interval, null hypothesis, alternative hypothesis, two-sided alternative, type I error, type II error, significance level, p-value, critical value, power, t-test, matched-pair t-test, two-sample t-test, one-proportion z-test, two-proportion z-test, chi-square, goodness of fit, test for independence, regression equation, R-square, residual, standard error of slope, t-statistic for slope, p-value for slope, power regression, exponential regression
The central limit theorem states that as the sample size increases, the distribution of the sample mean becomes normal with a mean equal to the population mean and a standard deviation equal to the population standard deviation divided by the square root of n. This is true even though the population may not be normally distributed.
The null hypothesis is always an equality. The alternative hypothesis is always an inequality (except when calculating power).
When the p-value is lower than the significance level, you reject the null hypothesis
Type I error is rejecting the null hypothesis when it is true (sending an innocent person to jail)
Power is one minus the probability of a type II error (the closer the power is to 1, the better the test). To calculate power, (a) pretend that the null hypothesis is true and find the critical value at which you would just reject the null hypothesis. Then, (b) forget the null hypothesis and instead pretend that a specific alternative hypothesis is true. Then figure out the probability of reaching or exceeding the critical value that you found in (a).
The one-prop z-test and two-prop z-test are used for sample proportions (binary variables). We did not use the regular z-test in this course.
A goodness of fit test compares the actual distribution of responses to a distribution that you expected to find based on some previous results or assumptions.
A test for independence tests for any association between two categorical variables. The null hypothesis is that the rows and columns are independent (e.g., that your letter grade in statistics is independent of your number score on the AP).
Always state the null hypothesis in words. Define the parameter. Write H0 and Ha. State the type of test (e.g., two-sample t-test), list all of the inputs to the calculation.
Certain assumptions must hold for tests to be valid. State these assumptions, even if you are not asked to do so.
State your conclusion in words.
For a regression, the important inference concerns the slope coefficient of the regression. The ratio of the coefficient to its standard error is a t-ratio, with the number of degrees of freedom equal to n-2, where n is the number of observations in the sample (the number of points in the scatterplot). A low p-value means that you can reject the null hypothesis of no relationship between X and Y in favor of the alternative hypothesis, which is that there is a relationship.