Big Data Mining

Welcome to Big Data Mining V 4.0!

This course will focus on understanding the statistical structure of large-scale (big) datasets using machine learning (ML) algorithms. We will cover the basics of ML and study their scalable versions for implementation within distributed computing frameworks. We will pursue ML techniques such as matrix factorization, convex optimization, dimensionality reduction, clustering, classification, graph analytics and deep learning, among others. We will emphasize algorithmic development for big data mining in three different, but general scenarios: (1) when available memory is extremely large (e.g., a shared memory architecture like Cray Urika); (2) when available memory is small, but can be distributed across a cluster (e.g., cloud-like environments); and (3) when the available memory is small and data has to be analyzed “in-situ” or “online” (e.g., streaming environments).  The course will be project driven (3 mini projects) with source material from a variety of real-world applications. There will be one final course project, along with a presentation. Students will be expected to design, implement and test their ML solutions in Apache Spark. Class information will be available at https://ramanathanlab.org/cosc526.

Time and Location

Class Logistics

  • Class Timings: 8.00 AM – 8.50 AM Mon/Wed/Fri.
  • Location: Min Kao 406
  • Office Hours: After class (9.00 AM – 10 AM on Wed)

Notes

  • I would really appreciate if you can give me a heads up by informing me that you’d like to meet. Send an email.
  • Please post first to Piazza before coming to office hours.
  • If it is possible to post your question on Piazza, please do so, since it is almost guaranteed that if you have a question, then others do.

Contact E-mail:

Please email me at ramanathana@ornl.gov.

Teaching Assistant

  • Name: Yongli Zhu
  • Emailyzhu16@vols.utk.edu
  • Office hours:  Monday: 12:00 – 1:30 PM and Wednesday:
    3:30 – 5:00 PM @ Min Kao 206.

Announcements

  • Thank you for attending the first class!
  • The class link has been updated. Also, the instructions for installing Spark, PySpark and Jupyter notebook integration is available here. Windows installation instructions are slightly different – link is here.

Outline of Topics

The course will provide an overview of a number of topics related to data mining. In particular, we will learn how to design our algorithms to deal with large-scale datasets.

  1. Big Data Architectures: MapReduce/Hadoop, Apache Spark, some outlines of computing models used for tackling big datasets, storing and managing datasets (2-3 classes)
  2. Clustering: How do we find groups of items that are similar? Similarity metrics, K-means algorithm, other approaches to tackle streaming data and in-memory analyses, stochastic gradient descent, etc. Supervised/unsupervised approaches (6 classes)
  3. Classification: Logistic regression, some basic introduction to deep learning theory, support vector machines, regression, and topics on how to modify these algorithms for large datasets (4-5 classes)
  4. Graph analytics: Using graph searches to find patterns in data. A graph theoretic view of data analytics. (2 classes)
  5. Dimensionality Reduction: Singular value decomposition, Principal component analysis, independent component analysis, non-negative matrix factorization, approaches to scale these algorithms for large datasets (4-5 classes)
  6. Deep Learning: Basics and theory, implementation of simple learning networks, representation of networks, deep neural nets, convolutional neural networks (3 classes)

Lectures

There are no required textbooks for the class. However, some suggested references are listed below:

  • LRU’14: Jure Leskovec, Anand Rajaraman, Jeffrey Ullman, Mining of massive datasets. Cambridge Press (2014). Link
  • BHK’16: Avrim Blum, John Hopcroft, Ravi Kannan, Foundations of data science. Link.
  • BBL’11: Ron Bekkerman, Mikhail Bilenko, John Langford, Scaling up machine learning: Parallel and distributed applications. Cambridge Press (2011). Link
  • HKP’11: Jiawei Han, Micheline Kamber, Jian Pei, Data mining: concepts and techniques. Kauffmann Press (2011). Link

The lectures, suggested readings, slides and lecture notes are provided below.

  • Week 1: Jan 10 – Jan 13, 2018
    • Jan 10 – Class 1: Introduction – Part 1
      • Lecture notes. Link
      • Python Tutorials. Link
    • Jan 12 – Class 2: Introduction – Part 2
      • Lecture notes. Link
      • Installation of Apache Spark. Link
  • Week 2: Jan 15 – Jan 19, 2018
    • Jan 15 – Holiday (Martin Luther King day)
    • Jan 17 – Class cancelled due to weather.
    • Jan 19 – Class 3: MapReduce and Spark. Link
      • Reference notes for MapReduce. Link
      • Apache Spark. Link
      • Resilient Distributed Datasets. Link
      • Other research. Link
  • Week 3: Jan 22- Jan 26, 2018
    • Jan 22 – Class 4: MapReduce and Spark (Continued)
      • Reference material – same as Class 3 (Jan 19)
    • Jan 24 – Class 5: Naive Bayes and your first data mining approach! Link
      • Reference notes for Naive Bayes. Link
    • Jan 26 – Class 6: Naive Bayes (continued) – same as class 5
  • Week 4: Jan 29 – Feb 2, 2018
    • Jan 29 – Class 7: Naive Bayes (streaming). Link
      • LRU’14: Chapter 3-4
    • Jan 31 – Class 8: Practical aspects of Apache Spark. Link.  (Demo 1)
      • Shang Gao – Data Science Ph.D. student, UTK
    • Feb 2 – Class 9: Practical aspects of Apache Spark. Link. (Demo 2)
      • Shang Gao – Data Science Ph.D. student, UTK
  • Week 5: Feb 5 – Feb 9, 2018
    • Feb 5 – Class 10: Naive Bayes (streaming). Link
      • LRU’14: Chapter 3-4
    • Feb 7 – Class 11: Strategies for streaming data (part I). [Same as in class 10]
      • LRU’14: Chapter 3-4
    • Feb 9 – Class 12: Strategies for streaming data (part II). [Same as in class 10]
      • LRU’14: Chapters 3-4
  • Week 6: Feb 12 – Feb 16, 2018
    • Feb 12 – Class 13: Strategies for streaming data (part III). [Same as in class 10]
      • LRU’14: Chapters 3-4
    • Feb 14 – Class 14: Classification & Regression (part I). Link
      • HKP’11: Chapter 8
    • Feb 16 – Class 15: Classification .& Regression (part II). Link
  • Week 7: Feb 19 – Feb 23, 2018
    • Feb 19 – Class 16: Classification & Regression (part III). [Same as in Class 15]
    • Feb 21 – Class 17: Practical aspects of classification/ regression. (Demo)
    • Feb 23 – Class 18: Practical aspects of classification/regression. (Demo)
  • Week 8: Feb 26 – Mar 2, 2018
    • Feb 26 – Class 18: Classification & Regression (part IV). Link
    • Feb 28 – Class 19: Classification & Regression (part V). Link
    • Mar  2 – Class 20: Classification & Regression (part VI). [Same as in class 19]
  • Week 9: Mar 5- Mar 9, 2018
    • Mar 5 – Class 21: Classification & Regression (part VII). [Same as class 19]
    • Mar 7 – Class 22: Classification & Regression (part IX)
    • Mar 9 – Class 23: Clustering (part I). Link
  • Week 10: Mar 19- Mar 23, 2018
    • Mar 19 – Class 24 – Clustering (part II). [Same as class 24]
    • Mar 21 – Class 25 – Clustering (part III). Link
    • Mar 23 – Class 26 – Clustering (part IV)
  • Week 11: Mar 26 – Mar 30, 2018
    • Mar 26 – Class 27 – Clustering (part V). Link
    • Mar 28 – Class 28 – Clustering (part VI)
    • Mar 30 – Holiday
  • Week 12: Apr 2 – Apr 6, 2018
    • Apr 2 – Class 29 – Dimensionality Reduction (Part 1). Link
    • Apr 4 – Class 30 – Dimensionality Reduction (part II)
    • Apr 6 – Class 31 – Graph Mining (part I). Link (1). Link (2)
  • Week 13: Apr 9 – Apr 13, 2018
    • Apr 9 – Class 31 – Graph Mining (Part II)
    • Apr 11 – Class 32 – Graph Mining (Part III)
    • Apr 13 – Class 33 – Deep Learning (Part I). Link

Assignments

Assignment Description Out In
1 Naive Bayes  (22.5%) Jan 26, 2018 Feb 16, 2018
2 Clustering (22.5%); dataset Feb 28, 2018 Apr 11, 2018
Total 45%

Notes

  • Assignments take time; start early!
  • Individual assignments only: collaboration is okay, but reports and code have to be original.
  • Mention who you collaborated with.
  • Latex your assignments; I will cut 10% of the grade if assignments are not typeset!
  • Please use the following Latex Template to typeset your assignments. Alternatively, you can use the following Word Template.
  • Electronic hand-over of assignments:
    • [lastname]-HW–submit.tgz
    • [lastname]-HW–submit.zip
    • Ex: Ramanathan-HW-0-submit.tgz
  • Post questions via Piazza

Projects

Projects are worth 45% of your grade. We will use the following guidelines for evaluating the projects.

Deliverable Due-date % Grade
Initial selection of topics Jan 26, 2018  1
Project Description and Approach Feb 27, 2018 2
Initial Report Mar 19, 2018 7
Project Demonstration Apr 9-20, 2018 15
Final Project Report (10 pages) Apr 27, 2018 10
Poster (12-16 slides) Apr 20, 2018 10

All projects should follow the NIPS template. I will cut 10% of the grade if the project reports are not in the right format.

Project presentations

You have about 15-18 minutes for your presentation. Total number of slides can vary depending on how you organize your presentation, general guidelines are that a slideshow for 15-18 minutes will typically have 15-18 slides.

  • Discuss a brief outline of the problem that you are interested in solving. Explain the broader context of the problem (up to 5 min).
  • Discuss your approach that you have taken. Present an overview of what others have done in this area (up to 5 min).
  • Discuss your results (between 5-7 min).
  • Conclusions and future work (about 1 min).

You will evaluate your peers in terms of the presentation. We will distribute a “grading sheet” so that everyone will be evaluated on similar criteria.

 

Project Topics/Ideas

Data Analytic Infrastructure/Systems

Benchmarking Deep Learning for Text Comprehnsion: Our group has been developing a number of different text comprehension algorithms using a variety of platforms, including  Torch, Theano, TensorFlow, mxnet, etc. We would like to set up large-scale benchmarks for these approaches on our exisiting supercomputing platforms at Oak Ridge Leadership Computing Facility (OLCF). As part of this project, you’d be expected to implement our deep learning codes on various deep learning platforms and evaluate both their performance, and optimization options on OLCF platforms. Contacts: Arvind Ramanathan (ramanathana@ornl.gov), Hong-Jun Yoon (yoonh@ornl.gov).

Benchmarking Deep Learning for Molecular Dynamics Simulations: We have a large repository of long time-scale molecular dynamics simulations. Our group has been setting up different deep learning approaches for analyzing these datasets. We  would like to set up large-scale benchmarks for these approaches on our exisiting supercomputing platforms at Oak Ridge Leadership Computing Facility (OLCF). As part of this project, you’d be expected to implement our deep learning codes on various deep learning platforms and evaluate both their performance, and optimization options on OLCF platforms. ContactsArvind Ramanathan (ramanathana@ornl.gov), Debsindhu Bhowmik (bhowmikd@ornl.gov). 

Data Analytics Algorithms & Applications

Time-dependent recurrent models for Molecular Dynamics Simulations: One of the challenges of running long timescale simulations is that we need to identify events that happen at multiple timescales simultaneously. One way to organize these large simulations is to find conformational states that share some conformational/ energetic similarity. What this means is that we need to build unsupervised learning algorithms that discover these states  automatically. We have done some preliminary work related to these ideas and would like to continue examining this area a bit more. Contacts: Arvind Ramanathan (ramanathana@ornl.gov).

 

Advertisements