Welcome to Advanced Data Science! This course will focus on hands-on data analyses with a main objective of solving real-world problems. We will teach the necessary skills to gather, manage and analyze data using the R programming language. The course will cover an introduction to data wrangling, exploratory data analysis, statistical inference and modeling, machine learning, and high-dimensional data analysis. We will teach the necessary skills to develop data products including reproducible reports that can be used to effectively communicate results from data analyses. We will train students to become data scientists capable of both applied data analysis and critical evaluation of the next generation next generation of statistical methods.

Course objectives

Upon successfully completing this course, students will be able to:

  1. Formulate quantitative models to address scientific questions
  2. Obtain, clean, transform, and process raw data into usable formats
  3. Organize and perform a complete data analysis, from exploration, to analysis, to synthesis, to communication
  4. Apply a range of statistical methods for inference and prediction
  5. Build data science products that can be used by a broad audience

Course logistics

Pre-requisites

This course is designed for PhD students in the Biostatistics department at Johns Hopkins Bloomberg School of Public Health. It assumes a fair amount of statistical knowledge and moves relatively quickly. I am open to anyone taking the class, but since it is a core requirement for our PhD program I will not be slowing down or allowing auditors for the class.

Required Textbook

None. Instead, we have a list of recommended readings on the web site available at Resources.

Course Communication

We will use Slack to organize course discussions. There are channels to ask questions and discuss the lectures, homework assignments, and final projects. The channels will be monitored by the TA during class. We will also use Slack for all annoucments, so it is important that you are signed up. Feel free to ask questions during class, or anytime.

Office hours

Term 1

Staff Day, Time Location
Guoqing Tues, 3-4pm E3037
Stephanie Wed, 2:20-3:00pm E3545
Bingkai Mon, 3-4pm E3030
Roger Fri, 10:30-11:30am E3535

Course components

We break down the course components and grading into two terms:

140.711.01 Advanced Data Science I

We will learn these concepts through hands-on data analysis assignments. Specficially, grades will be based on:

  • 3 homeworks (33.3% each)

140.712.01 Advanced Data Science II

We will learn these concepts through hands-on data analysis assignments. Specficially, grades will be based on:

  • 2 homeworks (25% each)
  • 1 final project (50%)

Homework

Homework will be submitted using git/GitHub and will be due at midnight on Fridays (unless otherwise stated). We will cover more about this in the second lecture. The last commit before midnight will be used to grade the assignment.

Collaboration Policy

You are welcome and encouraged to discuss the lectures and homework problems with others in order to better understand it, but the work you turn in must be your own. For example, you must write your own code, run your own data analyses, and communicate and explain the results in your own words and with your own visualizations. You may not submit the same or similar work to this course that you have submitted or will submit to another. All students turning in plagiarized solutions will be reported to Office of Academic Integrity, and will fail the assignment.

Quoting Sources

You must acknowledge any source code that was not written by you by mentioning the original author(s) directly in your source code (comment or header). You can also acknowledge sources in a README.txt file if you used whole classes or libraries. Do not remove any original copyright notices and headers. However, you are encouraged to use libraries, unless explicitly stated otherwise!

You may use examples you find on the web as a starting point, provided its license allows you to re-use it. You must quote the source using proper citations (author, year, title, time accessed, URL) both in the source code and in any publicly visible material. You may not use existing complex combinations or large examples. For example, you may not use a ready to use multiple linked view visualization. You may use parts out of such examples.

Final Project

At the beginning of the second term, you will start to work on a data science final project. The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choosing. You will acquire the data, design your visualizations, run statistical analysis, and communicate the results. You will have the opportunity to meet with either a TA or instructor at the beginning to initially help guide you in this project. You will have approximately three weeks at the end of term to focus on the final project.

You will work closely with other classmates in a 2-4 (max) person project team. You can come up with your own teams and use Slack to find prospective team members. We recognize that individual schedules, preferences, and other constraints might limit your ability to work in a team. If this the case, ask us for permission to work alone. However, you will still be expected to complete all portions of the final project on your own.

Missed Activities and Assignment Deadlines

Projects and homework must be turned in on time, with the exception of late days for homeworks as stated below. It is important that everybody attends and proactively participates in class and online. We understand, however, that certain factors may occasionally interfere with your ability to participate or to hand in work on time. If that factor is an extenuating circumstance, we will ask you to provide documentation directly issued by the University, and we will try to work out an agreeable solution with you (and/or your teammates).

Late Day Policy

Each student is given three late days for homework at the beginning of each term (711 and 712). A late day extends the individual homework deadline by 24 hours without penalty. No more than two late days may be used on any one assignment. Late days are intended to give you flexibility: you can use them for any reason no questions asked. You do not get any bonus points for not using your late days. Also, you can only use late days for the homework deadlines.

Although the each student is only given a total of 3 late days, we will be accepting homework from students that pass this limit. However, we will be deducting 10% for each extra late day. For example, if you have already used all of your late days for the term, we will deduct 10% for the assignment that is <24 hours late, and 20% points for the assignment that is 24-48 hours late.

Regrading Policy

It is very important to us that all assignments are properly graded. If you believe there is an error in your assignment grading, please send an email to one of the instructors within 7 days of receiving the grade. No re-grade requests will be accepted orally, and no regrade requests will be accepted more than 7 days after you receive the grade for the assignment.

Additional Information

Accessibility

If you have a documented disability (physical or cognitive) that may impair your ability to complete assignments or otherwise participate in the course and satisfy course criteria, please meet with us at your earliest convenience to identify, discuss, and document any feasible instructional modifications or accommodations. You should also contact the Office of Student Disability Services to request an official letter outlining authorized accommodations.