BUS 235C Data Mining Discussion Assignment
BUS 235C Data Mining
1. Applying classification trees to the sonar data.
a) Read in the files sonar_train.csv and sonar_test.csv. Note that these csv files don’t have a header row with column names, so you the commands:
train <- read.csv(“sonar_train.csv”,header=FALSE)
test <- read.csv(“sonar_test.csv”,header=FALSE)
Since the columns aren’t named Note that since the columns aren’t named, R will name the columns “V1”, “V2” etc.
b) Using the training data, fit a classification tree using the first 60 columns (the sonar readings) to predict the 61st column (the type of object). Use the rpart argument control=rpart.control(minsplit=0,cp=0) to override R’s default stopping rules.
c) Print out your tree.
d) How large is the tree with the lowest cross validation error (i.e. how many splits does it have)?
e) How large is the tree that you would choose using the 1-SE rule?
f) Use the prune function to create the tree from part (d). Print out the tree.
g) Using this tree, make predictions for the training and testing sets and report your accuracy.
h) Look back at the cross validation error rates (xerror) and choose the next larger tree than the one you chose in (d). What is its accuracy when applied to the training and testing set?
i) Which of the two trees do you think would make more accurate predictions on new data?
2. In this exercise, you will try to distinguish between two different wine varietals based on their hue and color intensity. Read in the data set “wine-train-hw.csv” and use it as your training set. Read in “wine-test-hw.csv” as your testing set.
a) Fit a classification tree to the training set using the default values of rpart. Print out your tree.
b) Print the confusion matrix for your tree.
c) Report the accuracy for your tree.
d) Using your tree, make predictions for the testing set. Compute the accuracy. How does it compare to the accuracy in (c)?
e) Report the sensitivity and specificity using your testing set.
3. Applying regression trees to the airfare data.
a) Using R, read in the files airfare-train.csv, airfare-validate.csv, and airfare-test.csv.
b) Fit a regression tree to predict the fare using all of the other variables. Print out your tree.
c) For which combination of variable values does your tree predict the lowest fare?
d) For which combination of variable values does your tree predict the highest fare?
e) Which (if any) variables does your tree not use?
f) What is the predicted fare of a route that has COUPON = 1, VACATION = No, SW = No, SLOT=Free, GATE = Constrained, DISTANCE = 1000, and PAX = 6000?
g) Compute your model’s MAPE on the training data, validation, and testing data.
4. In this problem we will apply k-nearest neighbors to some income data from the US Census. The data contain demographic and employment information. The last column (Income) indicates whether the person earned more or less than $50,000. We will try to predict this column based on the first 11 columns.
a) Read in the file income-small.csv. Use the “stringsAsFactors = TRUE” option as follows so that character columns are automatically converted to factors.
my.df <- read.csv(“income-small.csv”,
stringsAsFactors = TRUE)
b) Randomly split the data into a training set with 200 rows and a testing set with 100 rows. Use the command
before you do your sampling so that we all end up with the same training and testing sets.
c) After you split the data, use the following commands to separate the training set (which I’ve assumed you called train) into x (the first 11 columns) and y (the 12th column) and convert the x data to numeric:
x <- sapply(train[,1:11], as.numeric)
y <- train[,12]
d) Normalize the columns of x by subtracting their means and dividing by their standard deviations.
e) Write a for loop to try different values of k and use cross validation to choose the optimal value of k (i.e. the best number of neighbors to use).
f) Using the best value of k, make predictions for your testing set. Report the accuracy. (Note you will also have to separate, convert, and normalize your testing set as you did with your training set.)
g) How does the accuracy on the testing set compare to what was predicted by cross validation?
Now, we will repeat this analysis using the much larger data sets income-train-hw.csv and income-test-hw.csv.
h) Load the datasets (again, using stringsAsFactors = TRUE), separate them into x and y, convert x to numeric, and normalize x as you did before. (Note you don’t have to take a random sample since they’ve already been split up into training and testing.)
i) How many rows does each data set have?
j) Repeat parts e-g using the larger data sets. Limit your search to values of k up to 20. The for loop will take a while, so be prepared to set your computer aside or work on something else in the meantime. (One way to speed things up is to
only try odd values of k.)
The post BUS 235C Data Mining Discussion Assignment appeared first on Grade Master-Pro.
Write my Essay. Premium essay writing services is the ideal place for homework help or essay writing service. if you are looking for affordable, high quality & non-plagiarized papers, click on the button below to place your order. Provide us with the instructions and one of our writers will deliver a unique, no plagiarism, and professional paper.
Get help with your toughest assignments and get them solved by a Reliable Custom Papers Writing Company. Save time, money and get quality papers. Buying an excellent plagiarism-free paper is a piece of cake!
All our papers are written from scratch. We deliver high quality assignment answers to students.