STATS 3DA3
Homework Assignment 3
Instruction
• Due before 10:00 PM on Friday, February 28, 2025.
• Upload a PDF copy of your solutions to Avenue to Learn. You do not need to rewrite the questions in your submission.
• Late Submission Penalty: A 15% deduction per day will be applied to assignments submitted after the deadline.
• Late Submission Limit: Assignments submitted more than 72 hours late will receive a grade of zero.
• Grace Period for Accommodations: A 72-hour extension beyond the due date is granted for students with approved accommodations through SAS.
• Your submission must follow the Assignment Standards listed below.
Assignment Standards
• Include a title page with your name and student number. Assignments without a title page will not be graded.
• Use Quarto Jupyter Notebook for your work (strongly recommended).
• Format your document with an 11-point font (Times or similar), 1.5 line spacing, and 1-inch margins on all sides.
• Use a new page for the solution to each question (e.g., Question 1, Question 2, Question 3).
– Clearly number all solutions and sub-parts.
• Do not include screenshots in your submission; they will not be accepted.
• Ensure your writing and referencing are appropriate for the undergraduate level.
• You may discuss homework problems with other students, but you must prepare and submit your own written work.
• The originality of submitted work will be checked using various tools, including publicly available internet tools.
Assignment Policy on the Use of Generative AI
• The use of Generative AI is not permitted in assignments, except for using GitHub Copilot as a coding assistant.
– If GitHub Copilot is used, you must clearly indicate this in the code comments.
• In alignment with McMaster academic integrity policy, it “shall be an offence knowingly to submit academic work for assessment that was purchased or acquired from another source”. This includes work created by generative AI tools. Also state in the policy is the following, “Contract Cheating is the act of”outsourcing of student work to third parties” with or without payment.” Using Generative AI tools is a form of contract cheating. Charges of academic dishonesty will be brought forward to the Office of Academic Integrity.
Question:
In this assignment, you will explore K-Nearest Neighbors (KNN) and Decision Tree classification algorithms. You will apply both techniques to a dataset from the UCI Machine Learning Reposi-tory, gaining hands-on experience in data retrieval, preprocessing, model building, and evaluation. This exercise is designed to strengthen your understanding of classification methods and their ap- plications in real-world scenarios.
Dataset:
The dataset for this assignment is the Wine Quality Database, which includes 12 input attributes to predict the wine quality. Your objective is to build classifiers that accurately predict the wine quality category based on these attributes.
• Dataset Link: https://archive.ics.uci.edu/dataset/186/wine%2Bquality.
1) How many observations (rows) and features (variables) are present in the dataset?
2) What types of attributes are included in the dataset? Identify which attributes are numerical, categorical, or of other types.
3) Which variable serves as the response (target) if our goal is to build a classifier to predict the wine quality?
4) Are there any missing values in the dataset? If so, describe how you would handle them.
5) Display five rows from the original dataset, 代写STATS 3DA3 K-Nearest Neighbors (KNN) which includes both predictors and the response variable.
Hint: You can access the predictors and response by using data.original in the fetched dataset.
6) Is any transformation necessary for the response variable? Apply the transformation if needed. Additionally, how balanced is the dataset in terms of the response variable?
7) Remove observations with quality scores of 3, 4, 8, and 9 from the original dataset. Use this filtered data to complete questions 8 through 19.
Hint: Use isin([3, 4, 8, 9]) to identify the observations to drop.
8) After filtering, how many unique quality scores remain in the dataset?
9) Are there any potential outliers in the filtered dataset? Describe the method(s) you would use to identify them.
Note: You do not need to handle the outliers, only describe how to detect them.
10) Separate the predictors and the response variable from the filtered dataset.
11) Are any data transformations necessary for the features before training a classification tree model? If so, explain the rationale and apply the transformation.
12) Split the dataset (filtered in Part (10) and transformed in Part (11)) into training (80%) and testing (20%) subsets.
13) Train a classification tree model using the training data and perform model selection through cross-validation (e.g., tuning tree depth). After identifying the best model based on validation performance, evaluate its final performance on the test data.
Hint: Use the Gini index to grow the tree and classification accuracy for model selection.
14) Using the best classification tree model, identify the two most important features for predict- ing wine quality.
15) Write at least one statement summarizing the classification tree model’s performance and its implications in the context of the dataset and the problem.
16) Create copies of X_train and X_test from Part (12) and save them as X_train2 and X_test2.
17) Is any additional data transformation necessary for features before training a KNN classifier model? If so, write the rationale for the transformation and then apply the transformation to the features in X_train2 and X_test2.
Hint: Explain why feature scaling may or may not be necessary for KNN and how it could affect model performance.
18) Using the training data (X_train2, y_train), train a K-Nearest Neighbors (KNN) classifier and perform model selection through cross-validation (e.g., tuning the neighborhood size). After selecting the best model based on validation performance, evaluate its final performance on the test data (X_test2, y_test).
Note:
1) If any transformations were applied to X_train2 and X_test2 in Part 17, ensure those trans- formed datasets are used here.
2) Begin tuning the neighborhood size for cross-validation starting from 2.
19) Write at least one statement summarizing the KNN classifier model’s performance and its implications in the context of the dataset and the problem.
20) Write at least two statements that compare and contrast classification and KNN classifers performance and interpretation of the model on the test set.
Grading scheme
1. |
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. |
answer [1] answer [1] answer [1] Codes and answer [2] Codes [1] Codes and answer [2] Codes [1] Codes and answer [1] Codes and answer to detect outliers [2] Codes [1] Rationale and Codes [2] Codes [1] Codes for cross-validation [1], rationale for best model selection [1], codes for test perfomance [1] |
|
14. 15. 16. 17. 18. |
Codes and write an answer [1] 1 statement [1] Codes [1] Rationale and Codes [2] Codes for cross-validation [1], rationale for best model selection [1], codes for test perfomance [1] |
|
19. 20. |
1 statement [1] 2 statemens to compare and contrast [2] |
The maximum point for this assignment is 30. We will convert this to 100%.