Background
· For this assignment, you will be using the Cleansing_Week4.R script and the data.csv
· The code is in R programming language. You should open R studio and open the file Cleansing_Week4.R. Follow the steps in the code and answer each of the following questions below.
· Some manipulation and rework of the code is required. The steps are explained in detail in the Code.
· Steps 0 through Step 5, included, should be completed.
Instructions
You should complete all the steps provided in the code and answer the following questions in a report.
After you complete your readings, and listen to the provided videos (Required), you will proceed with this implementation and report.
1. Introduction
· Provide information about the Language, GUI, and Data File you are using in this assignment. Use references to support the importance of the language you are using, the advantages, disadvantages, and how it relates to other languages that are used in Data Science.
· Provide the Value stored in the variable Randomizer in your code and your Student ID in this section. Take a printscreen of the output in your Console and paste it here.
2. Data Presentation before Cleansing
Run Step 0 and answer the following questions.
A. Data file format and the corresponding command that you used to read the data. Does the file have headers?
B. How many observations are there?
C. How many variables are in the data?
D. What is the purpose of the command str(df). Take a printscreen of the output in your Console and paste it here.
E. summary(df) # find out what this means and answer the question in your paper.
F. Answer the following questions:
a. # What type of variables does your file include
b. # Specific data types?
c. # Are they read properly?
d. # Are there any issues?
e. # Does your file include both NAs and blanks? How did you identify those?
f. # How many NAs do you have and
g. # How many blanks?
3. Data Preprocessing
A. Summarize the steps of preprocessing you expect to complete before you run the previous steps in your code. Recommend methods of inputting NAs in each of the variables when needed, and or observations. Review literature and suggest methods of imputation for Categorical and Numeric Variables.
B. Run the Step 1 in your code. How this step affected the NAs and the blanks in your variables (you can run summary(df)) to determine this. Take a printscreen of the output in your Console and paste it here.
C. For each of the Numeric Variables record the Mean and the Median, for the Categorical Variables record the counts. Present them on your paper on a table.
D. Run Steps 2-3 and 4. How many observations include NAs, how many variables include NAs, what is the percentage of rows and columns that have NAs, if we were to eliminate those, what is the approximate size of the remaining dataset? Is this the proper method of imputing?
E. Run Step 5 and answer the following questions:
1.
a. What is the method of imputation that is described? What does linear interpolation mean? Research and discuss if this is an appropriate method. The above method of imputation has now changed some of the statistics of your variables.
· Run summary(df) and compare with the previous statistics. Take a printscreen of the output in your Console and paste it here.
· Do you observe any undesired changes? Explain in detail, how could you have avoided this?
· Are there any more NA’s in your file?
Length: This assignment must be 4-5 pages (excluding the title and reference page)