COSC 526: Big Data Mining at Electrical Engineering and Computer Science Department, University of Tennessee, Knoxville
Similar to the previous year, this class will be focused on Big Data. The target of the class is to make aware of the recent developments in the aspects of data mining, typically focused on understanding the structure of large datasets. I am emphasizing on three aspects of algorithmic development for big data mining: (1) when available memory is extremely large (e.g., a shared memory architecture like Cray Urika); (2) when available memory is small, but can be distributed across a cluster (e.g., cloud like environments); and (3) when the available memory is small and data has to be analyzed “in-situ” or “online” (e.g., streaming environments). I plan on using Apache Spark, GraphX and other recent platforms for the projects. The students usually do four (individual) mini-projects and a class-level project that is highlighted as a poster session at the end of the semester.
COSC 526: Big Data Mining at Electrical Engineering and Computer Science Department, University of Tennessee at Knoxville
The emphasis of this section will be on Big Data. Tentative topics to be covered include: (1) Introduction to big data mining paradigms using (a) Distributed computing tools such as Map-Reduce/Hadoop and (b) Multi-core tools including GPUs and heterogeneous compute resources, (2) Ideas to “munge, manipulate and analyze” large volumes of data, (3) Streaming Data Analytics, (4) Randomized/probabilistic approaches to construct matrix decompositions/ dimensionality reduction, (5) Similarity search in high dimensional datasets, (6) Link Detection/ Page Rank and applications, and (7) Graph mining techniques.
Planned datasets for course projects include: (1) large volumes of social media data (over a year worth of data collected at ORNL; >10 TB), (2) open source claims data from the Centers for Medicaid and Medicare (~80-100 GB but complex and noisy healthcare related data), (3) 1000 genome project ( >2 TB data but highly complex and noisy biological datasets), and (4) cybersecurity data.
The course grade was based on two mini-projects (with mini implementation examples), a course-project (which begins within the first two weeks of class) and a final poster session.