Processing and analysis of large amount of astronomical data on Microsoft Azure HDInsight

S.V. Gerasimov, A.V. Mesheryakov


Machine learning provides effective techniques to accurately measure photometric redshifts (photo-z) of extragalactic astronomical objects, which allows researchers to build maps of Large Scale Structure of the Universe. These maps are widely used in various fundamental research fields of extragalactic astrophysics and observational cosmology. Though making predictions by these models for a huge number of objects in astronomical catalogs, containing a broad-band photometry over all the sky, is a challenging task and requires a significant computational resources. In the article we tested the Apache Spark horizontally-scalable framework, deployed in the cloud Microsoft Azure, for the task of photo-z measurements for galaxies from the big photometric dataset of Sloan Digital Sky Survey.

Full Text:

PDF (Russian)


Zhang, Y., Zhao Y. “Astronomy in the Big Data Era”, 2015, Data Science Journal, 14, p.11

C.Snijders, U.Matzat, U. Reips “Big Data”: Big Gaps of Knowledge in the Field of Internet Science International Journal of Internet Science 2012, 7 (1), 1–5 ISSN 1662-5544

T. Seth and V. Chaudhary, “Big Data in Finance”, in Big Data: Algorithms, Analytics, and Applications, Chapman and Hall/CRC Big Data Series, CRC Press, 2014

A. Belle, R. Thiagarajan, S. Soroushmehr, F.Navidi, D. Beard, K. Najarian “Big Data Analytics in Healthcare” BioMed Research International Volume 2015 (2015)

A. Greene, K. Giffin, C. Greene, J. Moore “Adapting bioinformatics curricula for big data” Briefings in Bioinformatics, 2015, 1–8

J.Dean, S.Ghemawat “MapReduce: Simplified Data Processing on Large Clusters” OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004

“Apache Hadoop” data zaprosa 3.12.2016

“Apache Spark” data zaprosa 3.12.2016

“The Sloan Digital Sky Survey: Mapping the Universe” data zaprosa 3.12.2016

“ Pan-STARRS” data zaprosa 3.12.2016

“LSST Information for Scientists ” data zaprosa 3.12.2016

“Euclid” data zaprosa 3.12.2016

“Spektr-Rentgen-Gamma” data zaprosa 3.12.2016

Pineau F. X. et al. “Probabilistic multi-catalogue positional cross-match” arXiv preprint arXiv:1609.00818. – 2016

Ivezic et al “Statistics, Data Mining, and Machine Learning for Astronomy” Princeton University Press, 2014

Bilicki et al. “WISE × SuperCOSMOS Photometric Redshift Catalog: 20 Million Galaxies over 3/pi Steradians” The Astrophysical Journal Supplement Series, Volume 225, Issue 1, article id. 5, 24 pp. (2016)

Beck et al. “Photometric redshifts for the SDSS Data Release 12” Monthly Notices of the Royal Astronomical Society, Volume 460, Issue 2, p.1371-1381

Soumagnac et al. “Star/galaxy separation at faint magnitudes: application to a simulated Dark Energy Survey” Monthly Notices of the Royal Astronomical Society, Volume 450, Issue 1, p.666-680 (2015)

Brescia et al. “Automated physical classification in the SDSS DR10. A catalogue of candidate quasars” Monthly Notices of the Royal Astronomical Society, Volume 450, Issue 4, p.3893-3903 (2015)

Möller et al. “Photometric classification of type Ia supernovae in the SuperNova Legacy Survey with supervised learning” arXiv:1608.05423 (2016)

Elorrieta et al. “A machine learned classifier for RR Lyrae in the VVV survey” Astronomy & Astrophysics, Volume 595, id.A82, 11 pp. (2016)

“SExtractor” data zaprosa 3.12.2016

Abdalla et al. “A comparison of six photometric redshift methods applied to 1.5 million luminous red galaxies” Monthly Notices of the Royal Astronomical Society, Volume 417, Issue 3, pp. 1891-1903 (2011)

Geurts, P., Ernst, D. & Wehenkel, L. “Extremely randomized trees” Mach Learn (2006) 63: 3. doi:10.1007/s10994-006-6226-1

Meshcheryakov A. et al. “High-accuracy redshift measurements for galaxy clusters at z < 0.45 based on SDSS-III photometry” Astronomy Letters, Volume 41, Issue 7, pp.307-316 (2015)

Brescia et al. “A catalogue of photometric redshifts for the SDSS-DR9 galaxies” Astronomy & Astrophysics, Volume 568, id.A126, 7 pp. (2014)

“scikit-learn” data zaprosa 3.12.2016

Carrasco Kind M. & Brunner R. “TPZ: photometric redshift PDFs and ancillary information by using prediction trees and random forests” Monthly Notices of the Royal Astronomical Society, Volume 432, Issue 2, p.1483-1501 (2013)

N. Meinshausen “Quantile Regression Forests” Journal of Machine Learning Research 7 (2006) 983–999

“Apache Spark MLlib” data zaprosa 3.12.2016

“Scikit-learn integration package for Spark” data zaprosa 3.12.2016

“PySpark+Scikit-learn=Sparkit-learn” data zaprosa 3.12.2016

“Apache Parquet” data zaprosa 3.12.2016

“Dark Energy Spectroscopic Instrument ” data zaprosa 3.12.2016


  • There are currently no refbacks.

Abava  Absolutech Convergent 2020

ISSN: 2307-8162