Featured Projects List

This demo visualizes gene sequence data with the Multi-Dimensional Scaling (MDS) algorithm. It runs MDS with 200 iterations on 30K data points over 30 computer nodes. The 16 clusters are generated with the K-means clustering algorithm and labeled by different colors.

PlotViz

Large-scale high dimensional data visualization is highly valuable for scientific discovery in many fields of data mining and information retrieval. PlotViz is a 3D data point browser that visualizes large volumes of 2- or 3-dimensional data as points in a virtual space on a computer screen and enables users to explore the virtual space interactively. PlotViz was initially designed to consume outputs of dimension reduction algorithms for visualizing high-dimensional data in a lower-dimensional space, such as Multi-dimensional Scaling (MDS) and Generative Topographic Mapping (GTM). Used together with such dimension reduction algorithms, PlotViz can help users to discover intrinsic structures of high-dimensional data and browse large volumes of data points interactively and efficiently in a virtual 3D space.

banner

Harp – Hadoop and Collective Communication for Iterative Computation

Recent newly designed Big Data processing tools focus on the abstraction of data and related computation flows. For example, Hadoop MapReduce defines data as Key-Value pairs and computation as Map-Reduce tasks. Pregel/Giraph perceives data as vertices and edges in graphs and computation as BSP iterations. Spark abstracts data as RDDs with transformation operations on top of them. However, there is no abstraction of communication patterns in these tools. On the other hand, traditional distributed data processing tools, represented by MPI, have abstraction of collective communication. But this kind of abstraction is limited because it is only based on arrays and buffers. It cannot support complicated data abstractions and related communication patterns such as shuffling on Key-Values or graph communication based on edges and vertices.

To improve the expressiveness and performance in big data processing, the Harp library is introduced, which provides data abstractions and related communication abstractions and transforms map-reduce programming models into map-collective models. The word “harp” symbolizes the effort to make parallel processes cooperate together through collective communication for efficient data processing, just as strings in a harp can make concordant sound. Harp can integrate with Hadoop 1.2.1 and Hadoop 2.2.0. It supports data abstraction types such as arrays, key-values, and graphs with related collective communication operations on top of each type. Several applications are developed based on Harp framework, including K-means clustering, multi-dimensional scaling and PageRank. Being based on Hadoop, Harp has better sustainability and fault tolerance properties than Twister or Twister4Azure that inspired it.

banner

Twister – Iterative MapReduce

Twister was our original Iterative MapReduce system introduced with open source in 2010. The Harp project described earlier builds on Twister and Twister4Azure.

banner

Twister4Azure – Iterative MapReduce for Windows Azure Cloud

Twister4Azure is an iterative MapReduce framework for Azure cloud that extends the MapReduce programming model to support data-intensive iterative computations. Twister4Azure enables a wide array of large-scale iterative data mining and data analysis applications to utilize the Azure cloud platform in an easy, efficient, and fault-tolerant manner. Its architecture utilizes the scalable, distributed, and highly available Azure cloud services as the underlying building blocks and employs a decentralized control architecture that avoids single point failures. Twister4Azure takes care of almost all the Azure infrastructure (service failures, load balancing, etc.) and coordination challenges, and frees users from having to deal with the challenges of cloud services. It also implements Map-Collectives high performance communication and computation abstractions, supporting Map-AllGather, Map-AllReduce, MapReduceMergeBroadcast and Map-ReduceScatter patterns. This work was presented at CCGrid in May 2014 and has pioneered many of the Collectives ideas that we are carrying forward on Harp described below.

banner

IndexedHBase

As data intensive problems evolve, many research projects require efficient analysis of a target subset of data, rather than the whole dataset. IndexedHBase is a storage system that extends HBase with a customizable indexing framework to support fast queries and analysis of interesting data subsets. By building index structures that are specially customized for the actual applications, IndexedHBase can achieve a query evaluation speed that is significantly faster (by one to two orders of magnitude) than using the existing indexing techniques provided by commercial NoSQL databases such as Riak. Leveraging an architecture based on YARN, IndexedHBase can be integrated with various parallel computing platforms, such as Hadoop MapReduce, Harp and Twister, to complete efficient analysis of the query results.

banner

CloudMOOC course

Funded by Google, we have developed a Cloud Computing online course (CloudMOOC) using their Course Builder technology. CloudMOOC offers a repository of 10-20 minute lessons (‘songs’) drawn from the Course Builder library, which are assembled as a playlist into ‘albums’ (units, modules, courses). This is similar to the ‘Playlist’ that is a popular feature in YouTube and web-based music repositories. The customization will be explored using material from Indiana University to form a repository of lessons from which courses can be prepared. Our initial efforts will be aimed at online data science programs in an effort to help the U.S. meet a projected shortage of data analysts, data solutions architects and other data science positions of nearly 2 million workers by 2018.

The content of the Cloud Computing online course is hosted in the Google App Engine infrastructure and supports computing laboratories associated with MOOCs. Course projects are packaged in a sandbox VM along with the required environment setup or VM cluster support of massive online courses for hands-on lab experience using NSF FutureGrid testbed as a cloud backend. The MOOCs produce custom modules for courses optimized for data mining and data analysis applications that fit the desired research, education and training requirements. The CloudMOOC project has improved the basic course builder code by supporting a look and feel adapting to laptop, PC, or portable device.

We will explore community interest in building and using such MOOCs in a new X-MOOC repository, including open source learning management software from Google and edX. This will address a critical need faced by an increasing number of researchers and students for collaborative and reusable work in the Big Data area.