Statistics 5525: Data Analytics I

Syllabus

August 28, 2017

Statistics 5525 will be a comprehensive course in Data Mining, Machine Learning, and Probabilistic Modeling techniques. The course covers techniques in supervised, unsupervised, and visualized learning in high dimensional spaces. Theoretical, probabilistic, and applied aspects of data analytics. Methods include generalized linear models in high dimensional spaces, regularization, lasso and related methods, principal component regression (pca), tree methods, and random forests. Clustering methods including K means, hierarchical clustering, biclustering, and model-based clustering will be thoroughly examined. Distance-based learning methods include multi dimensional scaling, the self organizing map, graphical/network models, and isomap. Supervised learning will consist of discriminant analyses, supervised pca, support vector machines, and kernel methods.

How are 5525 and 5526 different?

STAT 5526 will be a comprehensive course focusing on more theoretical concepts which come up in this class. The purpose of STAT 5525 is to present many of the algorithms and techniques used in DA, and apply them to data. The exercises in 5525 will focus both on both theoretical and practical issues. Discussions of Reproducing Kernel Hilbert Spaces (RKHB) and the mathematical details of kernel methods will be reserved for 5526. Familiarity with a variety of techniques will be developed in 5525, and expanded upon in 5526.

We are drowning in information and starving for knowledge. -R. D. Roger

Grading policies, office hours, and general information


Course Objectives

  • To  To develop an understanding of techniques in Machine Learning, Data Mining, and Probabilistic based modeling.
  • To compare and contrast algorithmic and model based learning techniques.
  • To understand the theory behind these techniques and implement them.

Logistics

  • Lecture Times and Location:  M/W/F, 8:00 - 8:50 PM,  in Hutcheson 204.
  • Instructor: Professor Scotland Leman,   401A Hutcheson Building,   ,   leman(AT)vt(DOT)edu
  • Instructor's Office Hours:   After each class and group meetings
  • Teaching Assistants:    Sumin Shen
  • TAs' Office Hours:  (TBA)

Prerequisites

Students should have completed a graduate level inference class (Required), and preferably Multivariate Statistics (recommended). Course like 5444 and 5314 (Bayes and Simulation, respectively)will also help you tremendously. Ability to program in a high level programming language such as R or Matlab is assumed.

Readings

The primary text is:

Hastie, Friedman, and Tibshirani   (2009). The Elements of Statistical Learning: Second Edition, ISBN: 978-0-387-84857-0.   Springer

This is a very comprehensive book on Statistical Learning models and algorithms; however, this text should not limit your reading from other relevant texts.

A good supplementary text is:

Chistopher Bishop   (2007). Pattern Recognition and Machine Learning: Second Edition, ISBN: 978-0387310732.   Springer

This book takes Bayesian inference as a primitive, and extends theory to machine learning.

Computing

For computing, you may use any upper level language of your choosing. For instance, C/C++, Java, Matlab, R, all make for reasonable choices.

Graded work

Graded work for the course will consist of problem sets, computational problems, and some mini quizzes. You may work in teams of 2-3 people for bi-weekly homework (bi-weekly HW teams may be altered throughout the semester). For the final project, you will be in teams of 4-5, but this team will be permanent and changes will not be allowed. Your final grade will be determined as follows:

 
Quizzes 10 %
Homeworks 40 %
Project 50 %

There are no make-ups for exams, in-class or homework problems except for a medical or familial emergency or previous approval of the instructor.  See the instructor in advance of relevant due dates to discuss possible alternatives.

Cumulative numerical averages of 90 - 100 are guaranteed at least an A-.   Cumulative numerical averages of 80 - 89 are guaranteed at least a B-.   Cumulative numerical averages of 70 - 79 are guaranteed at least a C-.   Cumulative numerical averages of 60 - 69 are guaranteed at least a D-.  These ranges may be lowered, but they will not be raised (e.g., if everyone has averages in the 90s, everyone gets at least an A-).


Academic honesty

You are expected to abide by Virginia Tech's Community Standard for all work for this course.  Violations of the Standard will result in a failing final grade for this course and will be reported to the Dean of Students for adjudication.  Ignorance of what constitutes academic dishonesty is not a justifiable excuse for violations.

For the homework problems, you may work with a study group with others but must submit your own answers, unless otherwise indicated.  For exams, you are required to work alone and for only the specified time period. .   

Procedures if you suspect your work has been graded incorrectly

Every effort will be made to mark your work accurately.    You should be credited with all the points you've worked hard to earn!   However, sometimes grading mistakes happen.  If you believe that an error has been made on an in-class problem or exam, return the paper to the instructor immediately, stating your claim in writing.

The following claims will be considered for re-grading:

(i)    points are not totaled correctly;
(ii)   the grader did not see a correct answer that is on your paper;
(iii)  your answer is the same as the correct answer, but in a different form (e.g., you wrote a correct answer as 1/3 and the grader was looking for .333);
(iv)  your answer to a free response question is essentially correct but stated slightly differently than the grader's interpretation.

The following claims will not be considered for re-grading:

(v)   arguments about the number of points lost;
(vi)  arguments about question wording.

Considering re-grades takes up valuable time and resources that TAs and the instructor would rather spend helping you understand material.  Please be considerate and only bring claims of type (i), (ii), (iii), or (iv) to our attention.

Upcoming Conferences

Statistics Jobs

Statistics Organizations