Welcome to Big Data Mining V 4.0!
This course will focus on understanding the statistical structure of large-scale (big) datasets using machine learning (ML) algorithms. We will cover the basics of ML and study their scalable versions for implementation within distributed computing frameworks. We will pursue ML techniques such as matrix factorization, convex optimization, dimensionality reduction, clustering, classification, graph analytics and deep learning, among others. We will emphasize algorithmic development for big data mining in three different, but general scenarios: (1) when available memory is extremely large (e.g., a shared memory architecture like Cray Urika); (2) when available memory is small, but can be distributed across a cluster (e.g., cloud-like environments); and (3) when the available memory is small and data has to be analyzed “in-situ” or “online” (e.g., streaming environments). The course will be project driven (3 mini projects) with source material from a variety of real-world applications. There will be one final course project, along with a presentation. Students will be expected to design, implement and test their ML solutions in Apache Spark. Class information will be available at https://ramanathanlab.org/cosc526.
Time and Location
- Class Timings: 8.00 AM – 8.50 AM Mon/Wed/Fri.
- Location: Min Kao 406
- Office Hours: After class (9.00 AM – 10 AM on Wed)
- I would really appreciate if you can give me a heads up by informing me that you’d like to meet. Send an email.
- Please post first to Piazza before coming to office hours.
- If it is possible to post your question on Piazza, please do so, since it is almost guaranteed that if you have a question, then others do.
Please email me at firstname.lastname@example.org.
- Name: Yongli Zhu
- Email: email@example.com
- Office hours: Monday: 12:00 – 1:30 PM and Wednesday:
3:30 – 5:00 PM @ Min Kao 206.
- Thank you for attending the first class!
- The class link has been updated. Also, the instructions for installing Spark, PySpark and Jupyter notebook integration is available here. Windows installation instructions are slightly different – link is here.
Outline of Topics
The course will provide an overview of a number of topics related to data mining. In particular, we will learn how to design our algorithms to deal with large-scale datasets.
- Big Data Architectures: MapReduce/Hadoop, Apache Spark, some outlines of computing models used for tackling big datasets, storing and managing datasets (2-3 classes)
- Clustering: How do we find groups of items that are similar? Similarity metrics, K-means algorithm, other approaches to tackle streaming data and in-memory analyses, stochastic gradient descent, etc. Supervised/unsupervised approaches (6 classes)
- Classification: Logistic regression, some basic introduction to deep learning theory, support vector machines, regression, and topics on how to modify these algorithms for large datasets (4-5 classes)
- Graph analytics: Using graph searches to find patterns in data. A graph theoretic view of data analytics. (2 classes)
- Dimensionality Reduction: Singular value decomposition, Principal component analysis, independent component analysis, non-negative matrix factorization, approaches to scale these algorithms for large datasets (4-5 classes)
- Deep Learning: Basics and theory, implementation of simple learning networks, representation of networks, deep neural nets, convolutional neural networks (3 classes)
There are no required textbooks for the class. However, some suggested references are listed below:
- LRU’14: Jure Leskovec, Anand Rajaraman, Jeffrey Ullman, Mining of massive datasets. Cambridge Press (2014). Link
- BHK’16: Avrim Blum, John Hopcroft, Ravi Kannan, Foundations of data science. Link.
- BBL’11: Ron Bekkerman, Mikhail Bilenko, John Langford, Scaling up machine learning: Parallel and distributed applications. Cambridge Press (2011). Link
- HKP’11: Jiawei Han, Micheline Kamber, Jian Pei, Data mining: concepts and techniques. Kauffmann Press (2011). Link
The lectures, suggested readings, slides and lecture notes are provided below.
- Week 1: Jan 10 – Jan 13, 2018
- Week 2: Jan 15 – Jan 19, 2018
- Week 3: Jan 22- Jan 26, 2018
- Week 4: Jan 29 – Feb 2, 2018
- Week 5: Feb 5 – Feb 9, 2018
- Feb 5 – Class 10: Naive Bayes (streaming). Link
- LRU’14: Chapter 3-4
- Feb 7 – Class 11: Strategies for streaming data (part I). [Same as in class 10]
- LRU’14: Chapter 3-4
- Feb 9 – Class 12: Strategies for streaming data (part II). [Same as in class 10]
- LRU’14: Chapters 3-4
- Feb 5 – Class 10: Naive Bayes (streaming). Link
- Week 6: Feb 12 – Feb 16, 2018
- Week 7: Feb 19 – Feb 23, 2018
- Feb 19 – Class 16: Classification & Regression (part III). Link
- Feb 21 – Class 17: Practical aspects of classification/ regression.
- Feb 23 – Class 18: Practical aspects of classification/regression. (Demo)
|1||Naive Bayes (15%)||Jan 26, 2018||Feb 16, 2018|
|2||Clustering (15%)||Feb 16, 2018||Mar 9, 2018|
|3||Matrix factorization (15%)||Mar 9, 2018||Apr 2, 2018|
- Assignments take time; start early!
- Individual assignments only: collaboration is okay, but reports and code have to be original.
- Mention who you collaborated with.
- Latex your assignments; I will cut 10% of the grade if assignments are not typeset!
- Please use the following Latex Template to typeset your assignments. Alternatively, you can use the following Word Template.
- Electronic hand-over of assignments:
- Ex: Ramanathan-HW-0-submit.tgz
- Post questions via Piazza
Projects are worth 45% of your grade. We will use the following guidelines for evaluating the projects.
|Initial selection of topics||Jan 26, 2018||1|
|Project Description and Approach||Feb 27, 2018||2|
|Initial Report||Mar 19, 2018||7|
|Project Demonstration||Apr 9-20, 2018||15|
|Final Project Report (10 pages)||Apr 27, 2018||10|
|Poster (12-16 slides)||Apr 20, 2018||10|
All projects should follow the NIPS template. I will cut 10% of the grade if the project reports are not in the right format.
You have about 15-18 minutes for your presentation. Total number of slides can vary depending on how you organize your presentation, general guidelines are that a slideshow for 15-18 minutes will typically have 15-18 slides.
- Discuss a brief outline of the problem that you are interested in solving. Explain the broader context of the problem (up to 5 min).
- Discuss your approach that you have taken. Present an overview of what others have done in this area (up to 5 min).
- Discuss your results (between 5-7 min).
- Conclusions and future work (about 1 min).
You will evaluate your peers in terms of the presentation. We will distribute a “grading sheet” so that everyone will be evaluated on similar criteria.
For the poster presentations, you are expected to present a poster measuring 3 ft x 4 ft (or 4 ft x 3 ft) on Apr 20, 2018 between 8.00 AM and 9.30 AM. We will have the poster session at the 4th floor atrium on Min Kao. The poster session will have people from ORNL and potentially a number of faculty members from UT Knoxville. Please make sure you attend the poster session since you will have to evaluate your peers.
Data Analytic Infrastructure/Systems
Benchmarking Deep Learning for Text Comprehnsion: Our group has been developing a number of different text comprehension algorithms using a variety of platforms, including Torch, Theano, TensorFlow, mxnet, etc. We would like to set up large-scale benchmarks for these approaches on our exisiting supercomputing platforms at Oak Ridge Leadership Computing Facility (OLCF). As part of this project, you’d be expected to implement our deep learning codes on various deep learning platforms and evaluate both their performance, and optimization options on OLCF platforms. Contacts: Arvind Ramanathan (firstname.lastname@example.org), Hong-Jun Yoon (email@example.com).
Benchmarking Deep Learning for Molecular Dynamics Simulations: We have a large repository of long time-scale molecular dynamics simulations. Our group has been setting up different deep learning approaches for analyzing these datasets. We would like to set up large-scale benchmarks for these approaches on our exisiting supercomputing platforms at Oak Ridge Leadership Computing Facility (OLCF). As part of this project, you’d be expected to implement our deep learning codes on various deep learning platforms and evaluate both their performance, and optimization options on OLCF platforms. Contacts: Arvind Ramanathan (firstname.lastname@example.org), Debsindhu Bhowmik (email@example.com).
Data Analytics Algorithms & Applications
Time-dependent recurrent models for Molecular Dynamics Simulations: One of the challenges of running long timescale simulations is that we need to identify events that happen at multiple timescales simultaneously. One way to organize these large simulations is to find conformational states that share some conformational/ energetic similarity. What this means is that we need to build unsupervised learning algorithms that discover these states automatically. We have done some preliminary work related to these ideas and would like to continue examining this area a bit more. Contacts: Arvind Ramanathan (firstname.lastname@example.org).