Petascale : View Position

View Position

Project Title	Applying Deep Learning on Time Series Astronomical Data
Summary	In this project, we propose to apply advanced machine learning techniques on time series data in order to optimize accurate classification. As a result of the synoptic sky surveys, extremely large amounts of astronomical data is collected on a daily basis. It has become increasingly necessary to develop methods that can quickly, accurately, and efficiently process these large data sets. By working to develop advanced machine learning techniques, the large sets of data can be processed on an automated level without the need of constant manual intervention. In addition these processes will also need to work on data that has a lot of noise and irregularities. Our current goal has the potential to revolutionize the analysis of massive time series data sets by allowing for the data to be classified and processed in a quicker, more specific, and more accurate manner. Also, the algorithms and methods used can be reapplied and specified to various other fields that involve large data sets and/or classification. Finally, this project will allow the student to further develop their high performance computing skills, gain experience in machine learning and classification, and let the student work on a project that encapsulates both parts of her major.
Job Description	The student’s first task will be to participate in the Blue Waters Intern training program at NCSA. Next, the student will gain in-depth understanding of the current;y available synoptic data sets such as the Catalina all sky survey, with nearly half a billion sources observed over hundreds of epochs. The student’s third task will focus on data pre-processing and to identify anomalies and missing data. This will include data visualization of the observations and error distributions. Fourth, the student will use standard machine learning algorithms implemented in the scikit-learn and keras (or possibly lasagne) toolkits to perform non-parametric regression on selected physical attributes of these time domain data sets. Fifth, based on this analysis, the student will identify algorithms that do not scale well with the data volumes, and work with the PI and others in the Laboratory for Cosmological Data Mining to accelerate these algorithms by using C++ machine learning libraries. Sixth, the student will develop and organize a working pipeline to process the data from a target survey and develop a model representation for the different types of sources that might be present in the data (e.g., variable stars, AGN, Blazers, transient phenomena). Seventh, the student will explore innovative visualization of the machine learning models and synoptic data. Finally, the student will extend this work to new machine learning models and to other synoptic survey data, in order to understand how different survey's characteristics (e.g., cadence) affect the performance of various machine learning methods.
Conditions/Qualifications	Student needs to understand synoptic astrophysical data sets, machine learning, and supercomputing applications.
Start Date	05/16/2016
End Date	05/15/2017
Location	University of Illinois, Urbana-Champaign
Interns	Sushma Adari

Shodor

a national resource for computational science education