Hypothesis Testing, SVMs
Hypothesis Testing and python applications (access in incognito mode if you have trouble):
https://towardsdatascience.com/demystifying-hypothesis-testing-with-simple-python-examples-4 997ad3c5294
1) Read and follow along with the tutorial listed above. There are 4 python examples in the tutorial: One Population Proportion, A Difference in Population Proportions, One Population Mean, The Difference in Population Means.
a) One Population Proportion: let’s create some data to work with. You can create 2 distributions for a variable “ages” like so:
Let’s assume the following:
The population mean is ~43.0
The null hypothesis is that the mean remains less than or equal to ~43.0.
The alternative hypothesis is that the mean of the sample distribution (minnesota ages) is meaningfully larger than the population mean.
Use a confidence of 95%, and a significance of 5% to make your conclusions.
Run a Z-Test and explain your conclusion.
b) A Difference in Population Proportions: create 2 populations like so:
import numpy as np pct_1 = np.random.random()
pct_2 = np.random.random() p1 = [pct_1, 1-pct_1] p2 = [pct_2, 1-pct_2] pop1 = np.random.choice([0,1], size=(1000,), p=p1) pop2 = np.random.choice([1,0], size=(1000,), p=p2)
Interpret 1’s in the datasets as parents who say their child has had swimming Lessons. 0’s mean that the child has not had swimming lessons.
For each population, print the percentage of children who have had some swimming lessons.
Let the null hypothesis be: There is no meaningful difference between the proportion of children in each population who have had swimming lessons.
What is the alternative hypothesis?
Perform a T-Test and explain the results.
2) SVMs – You can follow along and get some useful ideas from here:
https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/
a) Using the winequality dataset from previous homeworks, use an SVM to try and predict the quality. For this assignment:
i) You will need to split your dataset into train and test sets
ii) Try multiple kernels in your SVM, at least 3. For each kernel, train and evaluate the model 5 times, capture the performances of the model during each run to measure the average performance across all 5 runs. Which kernel performs best?
iii) Perform this experiment once using the raw data, and once using standardized data. Which performs better? Why do you think that is?
3) Try using PCA as a part of your preprocessing pipeline. On this dataset, is there any meaningful improvement? How many principal components did you try and why? Explain your results. Using PCA may or may not help, explain why this may be the case