Preface

Motivation

Malignant mesotheliomas (MM) are very aggressive tumors. While there is an association to asbestos exposure, it may also be related to previous simian virus 40 (SV40) infection and quite possible for genetic predisposition. Molecular mechanisms and rural living are also associated with the development of mesothelioma.

Overall Objective

The goal of this assignment is to develop a model to predict the diagnosis of MM. Early detection of mesotheliomas helps physicians to design a personalized treatment for saving lives. You will explore the data and practice building classification models. You will also explore the limitations and uncertainties of various approaches. During the course of your analysis there will be many options for you to explore and various approaches that you may take. Part of your job will be to choose amongst these many options and focus on a specific approach that you find most interesting or has the greatest potential for success.

Data

The data come from the UCI Machine Learning repository. As stated on the website:

The mesotheliomas disease data set were prepared at Dicle University Faculty of Medicine in Turkey. Three hundred and twenty-four Mesothelioma patient data. In the dataset, all samples have 34 features.

You can access the data here and read about the 34 features on the UCI Machine Learning page.

Problem 1: Exploring the Mesotheliomas data set

Problem 1.1

Write code to test if the dataset (Mesothelioma data set.xlsx) exists locally on your computer. If not, then download the file. Read in the Mesothelioma data set.xlsx in the Data sheet using the readxl package.

Note: Check out the Attribute and Description sheets in the .xlsx file.

## add your code here 

Problem 1.2

Use your exploratory data analysis skills to explore the features in the dataset. For example, create summary statistics, tables and/or plots to better understand the features. Summarize your findings.

## add your code here 

Add a summary of your findings here

Problem 1.3

Use your exploratory data analysis skills to explore the relationship between whether or not the patient was diagnosed with MM and how that differs across age and gender? Are there any interesting features related to age or gender? Summarize your findings.

## add your code here 

Add a summary of your findings here

Problem 1.4

Use your exploratory data analysis skills to explore if there any other interesting features that contribute to the diagnosis. Summarize your findings.

## add your code here 

Add a summary of your findings here

Problem 1.5

Split the data into a train_set (70% of the data) and test_set (30% of the data).

## add your code here 

Problem 2: Model building

In this problem, you will explore different approaches to predict the diagnosis of MM. In each case, use 10-fold cross validation to build a model using the train_set.

Problem 2.1: GLM approach

Fit a logistic regression model to predict the diagnosis of MM. Which variables did you include in your model? Summarize your findings.

Things you may want to consider:

  • It looks like the variable diagnosis method perfectly matches the class of diagnosis, so we may want to exclude this variable in the following analysis.)
  • Model selection. Consider different approaches for model selection. For example, check out the step() function in the stats R package. What approach did you chooose and why?
## add your code here 

Add a summary of your findings here

Problem 2.2

Train and tune a support vector machine to predict the diagnosis of MM. Summarize your findings.

## add your code here 

Add a summary of your findings here

Problem 2.3

Tune and fit a random forest modelto predict the diagnosis of MM. Summarize your findings.

## add your code here 

Add a summary of your findings here

Problem 2.4

Predict the diagnosis of MM using the test_set for each of the three approaches. Explore performance metrics and plots for each of the three approaches. Summarize your findings.

## add your code here 

Add a summary of your findings here

Problem 3: Narrative

Given the results you produced in Problems 1 and 2, it’s time to summarize your results and create a summary of which method performed best and the choices you made in model fitting and tuning.

To get full credit for the problem, you will need the following:

  • Write a one paragraph abstract summarizing your findings from Problems 1-2 above.

  • Produce one summary figure/table/graphic/etc. that provides evidence which approach worked best.

  • Produce a figure/table/graphic/etc. that describes other possible features that would be interesting to explore that would be useful in deciding predicting the diagnosis of MM.

  • Write one paragraph outlining the limitations of your analysis and what could be done in the future to address those limitations.