Large scale analytics on scientific image data
Scientific discoveries are increasingly driven by analyzing large volumes of data. Increasingly this data is in form of images. However, systems support for large scale image analytics and machine learning is still scarce. This project is focused on leveraging decades of research on DBMS to build systems that support domain scientist working with image data, allowing them to focus on research on image data rather than worrying about storing, managing, comparing data, models and visualization for their research.
Comparative Evaluation of Big-Data Systems on Scientific Image Analytics Workloads, VLDB 2017
In this first investigation we evaluate five big data systems for parallel data processing: a domain-specific DBMS for multidimensional array data (SciDB), a general purpose cluster computing library with persistence capabilities (Spark ), a traditional parallel general-purpose DBMS (Myria), along with a general-purpose (Dask) and domain-specific (TensorFlow) parallel programming library. To evaluate these systems, we implement two representative end-to-end image analytics pipelines from astronomy and neuroscience.
Multilabel multiclass classification of OCT images augmented with age, gender and visual acuity data, under submission.
This work is supported in part by NSF grant AITF 1535565 and a gift from Intel.