CHAPTER ONE
INTRODUCTION
1.1 Background to the Study
In recent years, information and its transformation into Knowledge became crucial as more and more data is being generated in real world situations which is drastically varying the provision of services for use of predictive analytics or other certain advanced methods to extract value from such data, and seldom to a particular size of data set. However providing a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions. Machine Learning has become one of the mainstays of information technology and with that, a rather central, albeit usually hidden, part of our life. With the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary ingredient for technological progress.
With this rapid growth several difficult machine learning “real-world” problems are posed, these problems are being characterized by imbalanced learning data, where at least one class is under-represented relative to others. Examples include (but are not limited to): fraud/intrusion detection, medical diagnosis/monitoring, bioinformatics, and text categorization. The imbalanced learning problem has drawn a significant amount of interest from academia, industry, and government funding agencies. The fundamental issue with the imbalanced learning problem is the ability of imbalanced data to significantly compromise the performance of most standard learning algorithms. Most standard algorithms assume or expect balanced class distributions or equal misclassification costs. Therefore, when presented with complex imbalanced data sets, these algorithms fail to properly represent the distributive characteristics of the data and resultantly provide unfavorable accuracies across the classes of the data. When translated to real-world domains, the imbalanced learning problem represents a recurring problem of high importance with wide-ranging implications, warranting increasing exploration.
On these basis this Project seeks to provide a detailed comparative study of the current understanding of the imbalanced learning problem and the state-of-the-art solutions created to address this problem providing ensembles to address class imbalance, the assessment metrics for imbalanced learning and highlighting the major opportunities and challenges for learning from imbalanced data.
1.2 Statement of the Problem
In recent years the problem of imbalanced data has being recognized and is being considered as a very crucial problem in data mining and machine learning, this problem occurs when there is significantly fewer training instances of one class compared to another class often associated with asymmetric costs of misclassifying elements of different classes. Additionally the distribution of the test data may differ from that of the learning sample and the true misclassification costs may be unknown at learning time. The problem with class imbalances is that standard learners are often biased towards the majority class and that is because these classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration. Although much awareness of the issues related to data imbalance has been raised, many of the key problems still remain open and are in fact encountered more often, especially when applied to massive datasets. In this project, we concentrate on the two class case.
1.3 Objectives of the study
In this project, we seek to;
Provide a survey of the current understanding of the imbalanced learning problem and the state-of-the-art solutions created to address this problem.
Recognize and state crucial real world problems with imbalanced data.
Provide strategies of dealing with data in imbalanced domain.
Provide a critical review of the innovative research developments targeting the imbalanced learning problems
Stimulate future research in this field, highlighting the major opportunities and challenges for learning from imbalanced data.
To comparatively study and determine the most efficient algorithm in learning from imbalanced data.
Provides various suggested methods that are used to compare and evaluate the performance of different imbalanced learning algorithms.
Provide Strategies to deal with imbalanced data sets.
1.4 Significance of the study
With the constant expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Hence a great influx of attention needs to be devoted to the imbalanced learning problem and the high activity of advancement in this field, remaining knowledgeable of all current developments can be an overwhelming task. Due to the relatively young age of this field and because of its rapid expansion, consistent assessments of past and current works in the field in addition to projections for future research are essential for long-term development. In this work, we will analyze the imbalanced learning problem which is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews, providing a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.
1.5 Scope of the study
The study is restricted to the nature of Imbalanced data, providing comparative study of learning schemes for learning from imbalanced data. The scope of the study in broad terms of other than learning from imbalanced data. Few among them are;
Machine Learning algorithmic approach to learning from imbalanced data such as decision Trees (The Naïve Bayes Tree), and Artificial Neural network (The Multilayer Perceptron )
Machine learning performance evaluation measures.
Performance and monitoring measures used in evaluating imbalanced data learning.
Model Creation that would be used for learning from imbalanced data
1.6 Project Management
The work involved in the development of this project has been broken down into several steps and allocated across, considerably. The details are contained in (appendix A).
1.7 Organization of the study
This study consist of the following sections:
Chapter 1 – Introduction
This chapter gives the introduction of the entire report, presenting the historical background of the study, the rationale behind the work, imbalanced data and learning for such data giving the problem definition and aims/ objectives of the study
Chapter 2 – Literature Review
In this section a detail review of related study is being carried out hence discovering the theoretical framework upon which this research is built.
Chapter 3 – Research Methodology and Application
In this section we have considered few methodologies used in the analysis of imbalanced data, focusing on the imbalanced data learning algorithms. Data-sets from the Keel repository with different imbalance ratios (IRs).
Chapter 4 – Implementation and Evaluation
In this section machine learning algorithms the Naïve Bayes tress and the Multi-Layer Perceptron are used for learning on imbalanced datasets which are evaluated and implemented, providing evaluation metrics for imbalanced data classification problem. Hence we will show the experimental study carried out on the behavior of some algorithms, it also examine the use of non-parametric test for statistical comparisons of the results of the classifiers. In this section we will analyze the behaviors of the best combination of components under different IR levels.
Chapter 5 – Discussion, Evaluation and conclusion
This section gives a detailed summary of the results are indicated and some conclusions and recommendations based on the findings will be made also providing suggestion (s) for future research, made for other investigations to carry out research in the related field or area.
1.8 Operational Definition
1.8.1 Concepts
Algorithm – It is a step by step finite sequence of well-defined set of instructions used to solve problems on a computer, a computational procedure that takes values as input and produces values as output, in order to solve a well-defined computational problem
Data – Numbers, characters, images, or other method of recording, in a form which can be assessed by a human or (especially) input into a computer, stored and processed there, or transmitted on some digital channel.
Data Mining – is an analytic process designed to explore data (usually large amounts of data – typically business or market related – also known as “big data”) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data
Imbalanced Dataset – A dataset is imbalanced if the classification categories are not approximately equally represented that is the classes are not approximately equally represented.
Learning – is the act of acquiring new, or modifying and reinforcing, existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information
Machine – an apparatus using mechanical power and having several parts, each with a definite function and together performing a particular task.
Machine Learning – a scientific discipline that explores the construction and study of algorithms that can learn from data and make/take decision on unseen data based on what they have learned from previous data.
Mining – a term explaining the process of finding a small set of precious patterns from a great deal of raw material (big data)
Comparative – Comparative study is a research methodology that aims to make comparisons across different field in this case algorithms used in learning from imbalanced data.
Attribute- a piece of information which determines the properties of a field or tag in a database or a string of characters in a display.
1.8.2 Technology
Decision Tree – a predictive model which maps observations about an item to conclusions about the item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning.
Cross Validation – Cross validation sometimes called rotation estimation is a model validation technique for assessing how accurate and valid the result of a statistical analysis method will be.
Artificial Neural Network- family of statistical learning algorithms inspired by biological neural networks (the central nervous systems of animals, in particular the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. Artificial neural networks are generally presented as systems of interconnected “neurons” which can compute values from inputs, and are capable of machine learning as well as pattern recognition, what makes them interesting is their adaptive nature.
1.7.3 Tools
Keel (Knowledge Extraction based on Evolutionary Learning) – is an open source (GPLv3) Java software tool which empowers the user to assess the behavior of evolutionary learning and Soft Computing based techniques for different kinds of Data Mining problems: regression, classification, clustering, Pattern mining and so on.
Datasets – a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set.
WEKA (Waikato Environment for Knowledge Analysis) – WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
1.8 Conclusion
Machine learning is growing and expanding in a very rapid pace. Its importance and bewildered growth helps in combining of collaborative activities with sophisticated pattern recognition, intelligent decisions self-modifying and self-learning has brought about computing without infrastructure flexibility and ideal Power. This Section gives an overview and preliminary study on the study of learning pattern using imbalanced datasets evaluating algorithms that helps in the learning process.