Big Data Mining

Welcome to Big Data Mining V 4.0!

This course will focus on understanding the statistical structure of large-scale (big) datasets using machine learning (ML) algorithms. We will cover the basics of ML and study their scalable versions for implementation within distributed computing frameworks. We will pursue ML techniques such as matrix factorization, convex optimization, dimensionality reduction, clustering, classification, graph analytics and deep learning, among others. We will emphasize algorithmic development for big data mining in three different, but general scenarios: (1) when available memory is extremely large (e.g., a shared memory architecture like Cray Urika); (2) when available memory is small, but can be distributed across a cluster (e.g., cloud-like environments); and (3) when the available memory is small and data has to be analyzed “in-situ” or “online” (e.g., streaming environments).  The course will be project driven (3 mini projects) with source material from a variety of real-world applications. There will be one final course project, along with a presentation. Students will be expected to design, implement and test their ML solutions in Apache Spark. Class information will be available at

Time and Location

Class Logistics

  • Class Timings: 8.00 AM – 8.50 AM Mon/Wed/Fri.
  • Location: Min Kao 406
  • Office Hours: After class (9.00 AM – 10 AM on Wed)


  • I would really appreciate if you can give me a heads up by informing me that you’d like to meet. Send an email.
  • Please post first to Piazza before coming to office hours.
  • If it is possible to post your question on Piazza, please do so, since it is almost guaranteed that if you have a question, then others do.

Contact E-mail:

Please email me at

Teaching Assistant

  • Name: Yongli Zhu
  • Office hours:  Monday: 12:00 – 1:30 PM and Wednesday:
    3:30 – 5:00 PM @ Min Kao 206.


  • Thank you for attending the first class!
  • The class link has been updated. Also, the instructions for installing Spark, PySpark and Jupyter notebook integration is available here. Windows installation instructions are slightly different – link is here.

Outline of Topics

The course will provide an overview of a number of topics related to data mining. In particular, we will learn how to design our algorithms to deal with large-scale datasets.

  1. Big Data Architectures: MapReduce/Hadoop, Apache Spark, some outlines of computing models used for tackling big datasets, storing and managing datasets (2-3 classes)
  2. Clustering: How do we find groups of items that are similar? Similarity metrics, K-means algorithm, other approaches to tackle streaming data and in-memory analyses, stochastic gradient descent, etc. Supervised/unsupervised approaches (6 classes)
  3. Classification: Logistic regression, some basic introduction to deep learning theory, support vector machines, regression, and topics on how to modify these algorithms for large datasets (4-5 classes)
  4. Graph analytics: Using graph searches to find patterns in data. A graph theoretic view of data analytics. (2 classes)
  5. Dimensionality Reduction: Singular value decomposition, Principal component analysis, independent component analysis, non-negative matrix factorization, approaches to scale these algorithms for large datasets (4-5 classes)
  6. Deep Learning: Basics and theory, implementation of simple learning networks, representation of networks, deep neural nets, convolutional neural networks (3 classes)


There are no required textbooks for the class. However, some suggested references are listed below:

  • LRU’14: Jure Leskovec, Anand Rajaraman, Jeffrey Ullman, Mining of massive datasets. Cambridge Press (2014). Link
  • BHK’16: Avrim Blum, John Hopcroft, Ravi Kannan, Foundations of data science. Link.
  • BBL’11: Ron Bekkerman, Mikhail Bilenko, John Langford, Scaling up machine learning: Parallel and distributed applications. Cambridge Press (2011). Link
  • HKP’11: Jiawei Han, Micheline Kamber, Jian Pei, Data mining: concepts and techniques. Kauffmann Press (2011). Link

The lectures, suggested readings, slides and lecture notes are provided below.

  • Week 1: Jan 10 – Jan 13, 2018
    • Jan 10 – Class 1: Introduction – Part 1
      • Lecture notes. Link
      • Python Tutorials. Link
    • Jan 12 – Class 2: Introduction – Part 2
      • Lecture notes. Link
      • Installation of Apache Spark. Link
  • Week 2: Jan 15 – Jan 19, 2018
    • Jan 15 – Holiday (Martin Luther King day)
    • Jan 17 – Class cancelled due to weather.
    • Jan 19 – Class 3: MapReduce and Spark. Link
      • Reference notes for MapReduce. Link
      • Apache Spark. Link
      • Resilient Distributed Datasets. Link
      • Other research. Link
  • Week 3: Jan 22- Jan 26, 2018
    • Jan 22 – Class 4: MapReduce and Spark (Continued)
      • Reference material – same as Class 3 (Jan 19)
    • Jan 24 – Class 5: Naive Bayes and your first data mining approach! Link
      • Reference notes for Naive Bayes. Link
    • Jan 26 – Class 6: Naive Bayes (continued) – same as class 5
  • Week 4: Jan 29 – Feb 2, 2018
    • Jan 29 – Class 7: Naive Bayes (streaming). Link
      • LRU’14: Chapter 3-4
    • Jan 31 – Class 8: Practical aspects of Apache Spark. Link.  (Demo 1)
      • Shang Gao – Data Science Ph.D. student, UTK
    • Feb 2 – Class 9: Practical aspects of Apache Spark. Link. (Demo 2)
      • Shang Gao – Data Science Ph.D. student, UTK
  • Week 5: Feb 5 – Feb 9, 2018
    • Feb 5 – Class 10: Naive Bayes (streaming). Link
      • LRU’14: Chapter 3-4
    • Feb 7 – Class 11: Strategies for streaming data (part I). [Same as in class 10]
      • LRU’14: Chapter 3-4
    • Feb 9 – Class 12: Strategies for streaming data (part II). [Same as in class 10]
      • LRU’14: Chapters 3-4
  • Week 6: Feb 12 – Feb 16, 2018
    • Feb 12 – Class 13: Strategies for streaming data (part III). [Same as in class 10]
      • LRU’14: Chapters 3-4
    • Feb 14 – Class 14: Classification & Regression (part I). Link
      • HKP’11: Chapter 8
    • Feb 16 – Class 15: Classification .& Regression (part II). Link
  • Week 7: Feb 19 – Feb 23, 2018
    • Feb 19 – Class 16: Classification & Regression (part III). Link
    • Feb 21 – Class 17: Practical aspects of classification/ regression.
    • Feb 23 – Class 18: Practical aspects of classification/regression. (Demo)


Assignment Description Out In
1 Naive Bayes  (15%) Jan 26, 2018 Feb 16, 2018
2 Clustering (15%) Feb 16, 2018 Mar 9, 2018
3 Matrix factorization  (15%)  Mar 9, 2018 Apr 2, 2018
 Total  45%


  • Assignments take time; start early!
  • Individual assignments only: collaboration is okay, but reports and code have to be original.
  • Mention who you collaborated with.
  • Latex your assignments; I will cut 10% of the grade if assignments are not typeset!
  • Please use the following Latex Template to typeset your assignments. Alternatively, you can use the following Word Template.
  • Electronic hand-over of assignments:
    • [lastname]-HW–submit.tgz
    • [lastname]-HW–
    • Ex: Ramanathan-HW-0-submit.tgz
  • Post questions via Piazza


Projects are worth 45% of your grade. We will use the following guidelines for evaluating the projects.

Deliverable Due-date % Grade
Initial selection of topics Jan 26, 2018  1
Project Description and Approach Feb 27, 2018 2
Initial Report Mar 19, 2018 7
Project Demonstration Apr 9-20, 2018 15
Final Project Report (10 pages) Apr 27, 2018 10
Poster (12-16 slides) Apr 20, 2018 10

All projects should follow the NIPS template. I will cut 10% of the grade if the project reports are not in the right format.

Project presentations

You have about 15-18 minutes for your presentation. Total number of slides can vary depending on how you organize your presentation, general guidelines are that a slideshow for 15-18 minutes will typically have 15-18 slides.

  • Discuss a brief outline of the problem that you are interested in solving. Explain the broader context of the problem (up to 5 min).
  • Discuss your approach that you have taken. Present an overview of what others have done in this area (up to 5 min).
  • Discuss your results (between 5-7 min).
  • Conclusions and future work (about 1 min).

You will evaluate your peers in terms of the presentation. We will distribute a “grading sheet” so that everyone will be evaluated on similar criteria.

 Poster Presentations

For the poster presentations, you are expected to present a poster measuring 3 ft x 4 ft (or 4 ft x 3 ft) on Apr 20, 2018 between 8.00 AM and 9.30 AM. We will have the poster session at the 4th floor atrium on Min Kao. The poster session will have people from ORNL and potentially a number of faculty members from UT Knoxville. Please make sure you attend the poster session since you will have to evaluate your peers.

Project Topics/Ideas

Data Analytic Infrastructure/Systems

Benchmarking Deep Learning for Text Comprehnsion: Our group has been developing a number of different text comprehension algorithms using a variety of platforms, including  Torch, Theano, TensorFlow, mxnet, etc. We would like to set up large-scale benchmarks for these approaches on our exisiting supercomputing platforms at Oak Ridge Leadership Computing Facility (OLCF). As part of this project, you’d be expected to implement our deep learning codes on various deep learning platforms and evaluate both their performance, and optimization options on OLCF platforms. Contacts: Arvind Ramanathan (, Hong-Jun Yoon (

Benchmarking Deep Learning for Molecular Dynamics Simulations: We have a large repository of long time-scale molecular dynamics simulations. Our group has been setting up different deep learning approaches for analyzing these datasets. We  would like to set up large-scale benchmarks for these approaches on our exisiting supercomputing platforms at Oak Ridge Leadership Computing Facility (OLCF). As part of this project, you’d be expected to implement our deep learning codes on various deep learning platforms and evaluate both their performance, and optimization options on OLCF platforms. ContactsArvind Ramanathan (, Debsindhu Bhowmik ( 

Data Analytics Algorithms & Applications

Time-dependent recurrent models for Molecular Dynamics Simulations: One of the challenges of running long timescale simulations is that we need to identify events that happen at multiple timescales simultaneously. One way to organize these large simulations is to find conformational states that share some conformational/ energetic similarity. What this means is that we need to build unsupervised learning algorithms that discover these states  automatically. We have done some preliminary work related to these ideas and would like to continue examining this area a bit more. Contacts: Arvind Ramanathan (