Machine Learning in Astronomy:

So our first question is what is Machine Learning?
Well according to Wikipedia, Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks.
In simple words : Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
Machine Learning is mostly divided in three main categories:
1)Supervised.
2)Unsupervised.
3)Reinforcement.

Now the next question is What is Quasar-Star?

Introduction: A quasar is a quasi-stellar radio source, which was first discovered in 1960. They emit electromagnetic radiation in the frequency bands corresponding to radio waves, visible, ultraviolet, infrared, X-rays and gamma rays. They are many light-years away from the Earth and the radiation from a quasar could take billions of years to reach us and may carry signatures of the early stages of the universe. It is very difficult for telescope as they are very far away and similar to normal stars.
>>So we are using semi-automated or automated techniques to distinguish quasars from stars<<


Note** Most information is taken from the paper published by Mohammed Viquar and his team, click here to go there**
In this case study we are going to use few Supervised Machine Learning algorithms as follows:
1)support vector machines (SVM)
2)K nearest neighbour hybrid (SVM-KNN).
3)AdaBoost.
4)Asymmetric AdaBoost.
We are using Sloan Digital Sky Survey (SDSS) database to train all the models.
Procedure review:
>>We used SVM for classifying stars, galaxies, and quasars.
>>We used an SVM-KNN method which is a combination of SVM and KNN. SVM-KNN improves the performance of SVM by using KNN to better classify the samples which occur near the boundary (hyperplane) constructed by the SVM learner.
>>Decision tree classifiers are also used for star-galaxy separation.
>>We have performed artificial balancing of data to counter the effects of class bias
>>we have used asymmetric AdaBoost, which is a method designed to handle imbalanced datasets.
Where we have gathered the data?
The Sloan Digital Sky Survey (SDSS) is the most extensive redshift survey of the Universe, whose data collection began in 1998. Data comprises of the complete imaging over the northern Galactic cap.
For this case Study, about 287 million objects are registered, over 9583 deg2. More than 1.27 million spectra are available from this survey in the
‘u’, ‘g’, ‘r’, ‘i’ and z-bands. DR7 , released in 2009, covers 11,663 deg2 of the sky. The DR7 was the end of the SDSS-II phase. This catalogue contains the same five bands of data as in the DR6, but of 357 million distinct objects. All of the data that is released by SDSS is made available over the Internet . The Sky-Server provides interfaces for querying and obtaining data as per a user’s needs. Using the available interfaces, spectral data, as well as images, can be obtained. The data are available for non-commercial use only, without written permission. From this data, we make use of the classes of quasars and stars.
Methods Used:
1)Artificial balancing of data needs to be performed such that the classes present in the dataset used for training a model don not present a bias to the learning algorithm. The ratio of the number of quasars to the number of stars is 7:1, and hence, either class is not equally represented to the classifier. In quasar- star classification, the stars’ class dominates the quasars’ class. This causes an increase in the influence of the stars’ class on the learning algorithm and results in a higher accuracy of classification. In artificial balancing, an equal number of samples from both the classes are taken for training the classifier. This eliminates the class bias and the data imbalance.
Hence, the voting for the stars’ class was found to be 99.41% which is higher than the voting of quasars, which is 98.19%. The accuracy claimed is doubtful as data imbalance and class bias is prevalent.
2)A separability test is used to determine the nature of the separability of data. In particular, if the data are not linearly separable, certain classifiers may not work well or may not be appropriate.

3)An SVM classifier requires the data to be separable so that it is possible to yield a hyperplane separating both the classes. Consider a set of n-samples from the data set and two classes C1and C2 corresponding to quasars and stars, or vice-versa
Let x be the input matrix with n-rows corresponding to the n-data points and an array y with n-elements, where the j-th element of y is the class label of the j-th row in x. Out of the set of n-points, a pair of points, created by taking one from either class, is used to create a support vector S. Each point is then added to the support vector S. The position of samples from both the classes are determined in a 5-dimensional support vector (the five dimensions being u−g, g−r,r−i,i−z, and z); any points which are geometrically present on the wrong side of the hyperplane by virtue of their class belongingness are added to a vector V-such that S=S∪V. If any coefficients are negative due to the addition of V-to S, then such points are pruned.
4)The K-nearest neighbour (KNN) classifier is a simple method for algorithmic classification, based on geometric similarity of the K-closest training samples in the feature space. When a previously unobserved sample is fed for classification to the K−NN classifier, it searches the feature space for the K-samples which are closest to the test sample. The K-closest samples may belong to different classes; the learning algorithm selects the class to which the majority of the K-nearest samples belong and determines it to be the class to which the test sample belongs. Here, the parameter K needs to be fed as an input and often depends on the data being explored. However, in practice, a value of K between 7 and 11 works well.
5)Adaptive Boost or AdaBoost is a general ensemble learning approach that makes use of the results of multiple weak learners to make a strong prediction. AdaBoost works in multiple rounds by incrementally training weak learners, where each successive weak learner tries to classify the misclassified samples of the previous learner, with increased weights on the misclassified samples. AdaBoost can be used on any learning algorithm but the most popular learners for AdaBoost are short decision trees or decision stumps. In the current study, the weak learners over which AdaBoost was used are decision trees with one level.
6)Asymmetric AdaBoost: Handling the Data Imbalance ProblemMathematically .
The asymmetric AdaBoost algorithm aims to incorporate initial costs of misclassification in order to make the AdaBoost algorithm more sensitive to biases.
Result Obtained:
a)Results Obtained Using the Unbalanced Data Set:
>>The ROC curves of SVM, SVM-KNN, and AdaBoost on an unbalanced dataset are shown in Figures 2(a), 2(b), and 2(c) respectively. The accuracies of these methods are 98.6%, 98.86%, and 97.2% respectively, shown in Table 1. Notably, the difference between the sensitivity and specificity of SVM and SVM-KNN is approximately 9%.

b)Results Obtained Using Artificially Balancing the Data Set:
>>The ROC curves of SVM, SVM-KNN, and AdaBoost after artificial balancing are shown in Figures 2(a), 2(b), and 2(c) respectively. The accuracies of these methods are 96.92%, 97.87%, and 96.54% respectively, shown in Table 2, The difference between the sensitivity and specificity of all the models is negligible;
In the case of AdaBoost, both the sensitivity and specificity are about 5% higher compared to the values attained with an unbalanced dataset. In this case, there is no requirement to report the F-Score as it is a metric that should be used in the case of unbalanced or biased datasets. The method of artificial balancing does well to reduce the effects of bias, as seen from the small difference between sensitivity and specificity.

c)Results of the Asymmetric AdaBoost Classifier :
>>The entire dataset was split into training and testing sets. Weights were assigned to both the classes: the stars class was assigned a weight of 0.10 and the weight of the quasar class is kept constant, and equal to 1. The mean accuracy of classification was 99.9995% after running the asymmetric AdaBoost classifier for 1000 iterations. The ROC curve of this method is shown in Figure 2(g). Simply put, an appropriate weight initialization arrives at the best weight distribution for a given number of estimators faster than equal initial weights. The ROC curve plotted for asymmetric AdaBoost is shown in Figure 2(g). Asymmetric AdaBoost tends to classify positives samples more carefully when compared to negative samples as it corrects the misclassification. Its precision and recall values are found to be equal to 1. The value of F-score is also equal to 1, shown in Table 1. In the design of any experiment, there exists an inherent trade-off between good results and time of execution. Using asymmetric AdaBoost improves the time of execution, while best-preserving accuracy.



Conclusion :
Asymmetric AdaBoost is endowed with greater computational efficacy compared to SVM. Given high accuracy, fast speed and easy modulation of parameters in contrast to SVM, asymmetric AdaBoost is a good choice as a classifier as specified in Tables 1 and 2. These classifiers can be used to classify multi-wavelength astronomical data sources and pre-select quasar candidates for large surveys. The case study is firmly focused on scientific correctness and algorithmic relevance.