Machine Learning in Astronomy:

A Case Study
in Quasar-Star Classiﬁcation.

8 min readOct 20, 2020

So our first question is what is Machine Learning?

Well according to Wikipedia, Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks.

In simple words : Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

Machine Learning is mostly divided in three main categories:

1)Supervised.
2)Unsupervised.
3)Reinforcement.

Now the next question is What is Quasar-Star?

Introduction: A quasar is a quasi-stellar radio source, which was ﬁrst discovered in 1960. They emit electromagnetic radiation in the frequency bands corresponding to radio waves, visible, ultraviolet, infrared, X-rays and gamma rays. They are many light-years away from the Earth and the radiation from a quasar could take billions of years to reach us and may carry signatures of the early stages of the universe. It is very difficult for telescope as they are very far away and similar to normal stars.
>>So we are using semi-automated or automated techniques to distinguish quasars from stars<<

Note** Most information is taken from the paper published by Mohammed Viquar and his team, click here to go there**

In this case study we are going to use few Supervised Machine Learning algorithms as follows:
1)support vector machines (SVM)
2)K nearest neighbour hybrid (SVM-KNN).
3)AdaBoost.
4)Asymmetric AdaBoost.
We are using Sloan Digital Sky Survey (SDSS) database to train all the models.

Procedure review:

>>We used SVM for classifying stars, galaxies, and quasars.
>>We used an SVM-KNN method which is a combination of SVM and KNN. SVM-KNN improves the performance of SVM by using KNN to better classify the samples which occur near the boundary (hyperplane) constructed by the SVM learner.
>>Decision tree classiﬁers are also used for star-galaxy separation.
>>We have performed artiﬁcial balancing of data to counter the eﬀects of class bias
>>we have used asymmetric AdaBoost, which is a method designed to handle imbalanced datasets.

Where we have gathered the data?

The Sloan Digital Sky Survey (SDSS) is the most extensive redshift survey of the Universe, whose data collection began in 1998. Data comprises of the complete imaging over the northern Galactic cap.
For this case Study, about 287 million objects are registered, over 9583 deg2. More than 1.27 million spectra are available from this survey in the
‘u’, ‘g’, ‘r’, ‘i’ and z-bands. DR7 , released in 2009, covers 11,663 deg2 of the sky. The DR7 was the end of the SDSS-II phase. This catalogue contains the same ﬁve bands of data as in the DR6, but of 357 million distinct objects. All of the data that is released by SDSS is made available over the Internet . The Sky-Server provides interfaces for querying and obtaining data as per a user’s needs. Using the available interfaces, spectral data, as well as images, can be obtained. The data are available for non-commercial use only, without written permission. From this data, we make use of the classes of quasars and stars.

Methods Used:

1)Artiﬁcial balancing of data needs to be performed such that the classes present in the dataset used for training a model don not present a bias to the learning algorithm. The ratio of the number of quasars to the number of stars is 7:1, and hence, either class is not equally represented to the classiﬁer. In quasar- star classiﬁcation, the stars’ class dominates the quasars’ class. This causes an increase in the inﬂuence of the stars’ class on the learning algorithm and results in a higher accuracy of classiﬁcation. In artiﬁcial balancing, an equal number of samples from both the classes are taken for training the classiﬁer. This eliminates the class bias and the data imbalance.
Hence, the voting for the stars’ class was found to be 99.41% which is higher than the voting of quasars, which is 98.19%. The accuracy claimed is doubtful as data imbalance and class bias is prevalent.

2)A separability test is used to determine the nature of the separability of data. In particular, if the data are not linearly separable, certain classiﬁers may not work well or may not be appropriate.

3)An SVM classiﬁer requires the data to be separable so that it is possible to yield a hyperplane separating both the classes. Consider a set of n-samples from the data set and two classes C1and C2 corresponding to quasars and stars, or vice-versa
Let x be the input matrix with n-rows corresponding to the n-data points and an array y with n-elements, where the j-th element of y is the class label of the j-th row in x. Out of the set of n-points, a pair of points, created by taking one from either class, is used to create a support vector S. Each point is then added to the support vector S. The position of samples from both the classes are determined in a 5-dimensional support vector (the ﬁve dimensions being u−g, g−r,r−i,i−z, and z); any points which are geometrically present on the wrong side of the hyperplane by virtue of their class belongingness are added to a vector V-such that S=S∪V. If any coeﬃcients are negative due to the addition of V-to S, then such points are pruned.

4)The K-nearest neighbour (KNN) classiﬁer is a simple method for algorithmic classiﬁcation, based on geometric similarity of the K-closest training samples in the feature space. When a previously unobserved sample is fed for classiﬁcation to the K−NN classiﬁer, it searches the feature space for the K-samples which are closest to the test sample. The K-closest samples may belong to diﬀerent classes; the learning algorithm selects the class to which the majority of the K-nearest samples belong and determines it to be the class to which the test sample belongs. Here, the parameter K needs to be fed as an input and often depends on the data being explored. However, in practice, a value of K between 7 and 11 works well.

5)Adaptive Boost or AdaBoost is a general ensemble learning approach that makes use of the results of multiple weak learners to make a strong prediction. AdaBoost works in multiple rounds by incrementally training weak learners, where each successive weak learner tries to classify the misclassiﬁed samples of the previous learner, with increased weights on the misclassiﬁed samples. AdaBoost can be used on any learning algorithm but the most popular learners for AdaBoost are short decision trees or decision stumps. In the current study, the weak learners over which AdaBoost was used are decision trees with one level.

6)Asymmetric AdaBoost: Handling the Data Imbalance ProblemMathematically .
The asymmetric AdaBoost algorithm aims to incorporate initial costs of misclassiﬁcation in order to make the AdaBoost algorithm more sensitive to biases.

Result Obtained:

a)Results Obtained Using the Unbalanced Data Set:

>>The ROC curves of SVM, SVM-KNN, and AdaBoost on an unbalanced dataset are shown in Figures 2(a), 2(b), and 2(c) respectively. The accuracies of these methods are 98.6%, 98.86%, and 97.2% respectively, shown in Table 1. Notably, the diﬀerence between the sensitivity and speciﬁcity of SVM and SVM-KNN is approximately 9%.

b)Results Obtained Using Artificially Balancing the Data Set:

>>The ROC curves of SVM, SVM-KNN, and AdaBoost after artiﬁcial balancing are shown in Figures 2(a), 2(b), and 2(c) respectively. The accuracies of these methods are 96.92%, 97.87%, and 96.54% respectively, shown in Table 2, The diﬀerence between the sensitivity and speciﬁcity of all the models is negligible;
In the case of AdaBoost, both the sensitivity and speciﬁcity are about 5% higher compared to the values attained with an unbalanced dataset. In this case, there is no requirement to report the F-Score as it is a metric that should be used in the case of unbalanced or biased datasets. The method of artiﬁcial balancing does well to reduce the eﬀects of bias, as seen from the small diﬀerence between sensitivity and speciﬁcity.

c)Results of the Asymmetric AdaBoost Classiﬁer :

>>The entire dataset was split into training and testing sets. Weights were assigned to both the classes: the stars class was assigned a weight of 0.10 and the weight of the quasar class is kept constant, and equal to 1. The mean accuracy of classiﬁcation was 99.9995% after running the asymmetric AdaBoost classiﬁer for 1000 iterations. The ROC curve of this method is shown in Figure 2(g). Simply put, an appropriate weight initialization arrives at the best weight distribution for a given number of estimators faster than equal initial weights. The ROC curve plotted for asymmetric AdaBoost is shown in Figure 2(g). Asymmetric AdaBoost tends to classify positives samples more carefully when compared to negative samples as it corrects the misclassiﬁcation. Its precision and recall values are found to be equal to 1. The value of F-score is also equal to 1, shown in Table 1. In the design of any experiment, there exists an inherent trade-oﬀ between good results and time of execution. Using asymmetric AdaBoost improves the time of execution, while best-preserving accuracy.

Conclusion :

Asymmetric AdaBoost is endowed with greater computational eﬃcacy compared to SVM. Given high accuracy, fast speed and easy modulation of parameters in contrast to SVM, asymmetric AdaBoost is a good choice as a classiﬁer as speciﬁed in Tables 1 and 2. These classiﬁers can be used to classify multi-wavelength astronomical data sources and pre-select quasar candidates for large surveys. The case study is firmly focused on scientiﬁc correctness and algorithmic relevance.

Machine Learning in Astronomy:

A Case Study
in Quasar-Star Classiﬁcation.

So our first question is what is Machine Learning?

Machine Learning is mostly divided in three main categories:

Now the next question is What is Quasar-Star?

Procedure review:

Where we have gathered the data?

Methods Used:

Result Obtained:

a)Results Obtained Using the Unbalanced Data Set:

b)Results Obtained Using Artificially Balancing the Data Set:

c)Results of the Asymmetric AdaBoost Classiﬁer :

Conclusion :

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Rishabh

No responses yet

More from Rishabh

Pong Game using Pygame

Pygame is a cross-platform set of Python modules designed for writing video games. It includes computer graphics and sound libraries…

Fetching Car Details using Computer Vision and OCR.

Task Description:

Flutter Application To run Docker Commands.

Flutter is an open-source UI software development kit created by Google. It is used to develop cross platform applications for Android…

How is Open Short Path First Routing Protocol implemented?

What are Graphs?

Recommended from Medium

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

In the rapidly advancing landscape of artificial intelligence (AI) and machine learning (ML), specific methodologies and their…

Basic AI & ML Concepts for MLOps Engineers

There’s a lot of misunderstanding (or no understanding at all) of AI & ML , before jumping deep into the world of MLOps, let’s clear those…

I used OpenAI’s o1 model to develop a trading strategy. It is DESTROYING the market

It literally took one try. I was shocked.

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Laziness Does Not Exist

Psychological research is clear: when people procrastinate, there's usually a good reason

Machine Learning in Astronomy:

A Case Studyin Quasar-Star Classiﬁcation.

So our first question is what is Machine Learning?

Machine Learning is mostly divided in three main categories:

Now the next question is What is Quasar-Star?

Procedure review:

Where we have gathered the data?

Methods Used:

Result Obtained:

a)Results Obtained Using the Unbalanced Data Set:

b)Results Obtained Using Artificially Balancing the Data Set:

c)Results of the Asymmetric AdaBoost Classiﬁer :

Conclusion :

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Rishabh

No responses yet

More from Rishabh

Pong Game using Pygame

Pygame is a cross-platform set of Python modules designed for writing video games. It includes computer graphics and sound libraries…

Fetching Car Details using Computer Vision and OCR.

Task Description:

Flutter Application To run Docker Commands.

Flutter is an open-source UI software development kit created by Google. It is used to develop cross platform applications for Android…

How is Open Short Path First Routing Protocol implemented?

What are Graphs?

Recommended from Medium

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

In the rapidly advancing landscape of artificial intelligence (AI) and machine learning (ML), specific methodologies and their…

Basic AI & ML Concepts for MLOps Engineers

There’s a lot of misunderstanding (or no understanding at all) of AI & ML , before jumping deep into the world of MLOps, let’s clear those…

I used OpenAI’s o1 model to develop a trading strategy. It is DESTROYING the market

It literally took one try. I was shocked.

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Laziness Does Not Exist

Psychological research is clear: when people procrastinate, there's usually a good reason

A Case Study
in Quasar-Star Classiﬁcation.