Detecting the Unexpected Day one
Today was the first day of Detecting the Unexpected, a conference all about what we don’t know about. How do we find new things in data sets which are too big to look at? What new tools will we use with the next generation of surveys? As conference chair Josh Peek pointed out, these are questions that have been around for awhile – at least since the big data projects of the turn of the last century:
Lovely shoutout to the “founding mothers” of astronomical data science by @jegpeek during the opening remarks of #dtu17 #womenswork pic.twitter.com/BmUVe2iNlN
— Kelle Cruz (@kellecruz) February 27, 2017
We had a great set of talks today which started off with Umaa Rebbapragada, who used machine learning to help identify transient objects in the Palomar Transient factor and I though this was a great slide of some of the issues that pop up with dealing with machine learning:
Rebbapragada discusses how to overcome training set and domain set having different underlying distributions #DtU17 pic.twitter.com/Jxfjw3CoOi
— Molly Peeples (@astronomolly) February 27, 2017
This was followed by Ashish Mahabah discussing some of the issues to be faced by the LSST survey and detecting transients in there, and then Casey Law describing the challenges of detecting fast radio bursts. These three talks raised a few points that we would see throughout much of the day:
- Domain adaptation: Can networks trained on one data set be applicable to another data set? If not, how much re-training will they require?
- Labels, labels, labels: What training set do you need in order to teach your machines?
- Compressing data: How do we go through the big data streams to what we want to know?
- Errors: How to include the things that aren’t common to ML codes like errors, probability distributions, incompleteness, etc?
The second half of the morning session was focused on deep learning with an overview by Minia Manteiga and then a number of deep learning applications were presented. Likely because it was the most visual, I found using deep learning to find lensing candidates by Brian Nord and Colin Jacobs to be the most striking, but the most provocative was Marc Huertas-Company and Fernando Caro fitting simulations and comparing them to data. There was already chatter about an unconference session about that.
Arjun Dey’s talk was highly entertaining and informative about maximizing serendipity discoveries, but his was also one of the first talks to introduce large data sets ready for mining. Other data sets that are available include Pan Starrs DR1 (Armin Rest and Branimir Sesar) and the Hubble Source Catalog (Rick White). There is so much data out there, but the big challenge is organizing it in some manner to be useful.
The afternoon definitely had some great talks on outliers. This included finding the 400 weirdest spectra in SDSS by Dalya Baron and DEMUD (Discovery via Eigenbasis Modeling of Uninteresting Data) by Kiri Wagstaff. And a critical point was made by Stephanie Juneau during her talk on AGN and their galaxy hosts: understanding the completeness and systematics are absolutely essential for determine the underlying relationships observed in the data.
It was a great day filled with so much detail that I’m not even coming close to capturing all of the it nor even what happened during the unconferences! Unfortunately, I’ve only mentioned a few highlights and some very broad strokes. Fortunately, STSCI records the talks and webcasts are available here and there are the running notes as well.