Machine learning / Artificial intelligence will play an increasing role in software development in future; in the foreseeable future, there will probably be only a few software projects that can do without machine learning. According to Google/Alphabet boss Sundar Pichai: “We will move from a mobile-first world to an AI-first world”, where the computer will be a language assistant equipped with artificial intelligence (AI). Not every software developer, not every product owner will therefore necessarily need to become a certified data scientist or expert in machine learning. At the very least, however, he should have a basic understanding of machine learning. This blog describes the learning experience in an Online machine learning relevant course, you’ll be part of the learning journey.
The course I’m talking about can be found on the udemy Academy. Course title: “Machine Learning A-Z™: Hands-On Python & R In Data Science”. The initial price of the course was about 400 EUR, I actually booked the course for a mere 15 EUR. The course includes about 40 hours of video training and various teaching material. The course promises: teaching the basics of machine learning, building robust models for machine learning, conception/performance of powerful data analysis, dealing with specific topics like Natural Language Programming or Deep Learning.. The course definitely keeps the promises.
Here’s Part 1 and Part 2 of the Online Course:
- Machine Learning for Product Owner, IT Project Manager (Crash Course Part 1): Introduction and Data Preparation
- Machine Learning for Product Owner, IT Project Manager (Crash Course Part 2) – Regression Methods
The methods of regression (compare PART 2) were concerned with predicting a numerical value. For example: The salary (dependent variable) is a function of age and qualification (independent variables). Part 3 is dedicated to those (mathematical) methods in machine learning that are used for classification / categorization. That is: A category is to be predicted. For example: “buyer” or “non-buyer”. The starting point could be the data set of socio-economic characteristics of persons who have seen an online ad. It is also known who of these target persons bought the product (“buyer”) and who did not buy the product (“non-buyer”). This makes two categories: buyers and non-buyers. Classification methods, however, are not restricted to two (2) categories. Classification methods also work for more than two (2) categories; e.g. market segments.
Machine Learning / Artificial Intelligence – Logistic Regression
The first method to be introduced is a linear method: Logistic Regression (also: Logit-Model). It is a method that is applied in practice to so-called linear problems.
So, what’s a linear problem? – Take a two-dimensional coordinate system with numerous data points of two different categories (blue, red). If you can place a LINE in the coordinate system in such a way that the red points are (almost completely) separated from the blue points, then you have a linear problem. In three-dimensional space, it must be possible to separate areas from each other by means of a plane.
For the introduction of logistic regression we consider a simple starting situation: There are observation points that indicate whether a product was bought by a target person or not (“buyer” vs. “non-buyer”). The age of the persons is known. If this information is plotted in a coordinate system (cf. the following figure), it becomes clear: The higher the age of a person, the more likely it is that the person will buy the product (red crosses on a dotted line): The dotted line corresponds to a purchase probability of 100% (0% = product is not bought, 100% = product is bought for sure).
Fig.: Logistic Regression based on Sigmoid Function
An S-shaped curve approximates this probability distribution: Where you have the bend (or: curvature), the probability that the product will be purchased increases sharply. How can this be mathematically be expressed? – With the so-called sigmoid function! “P” means probability, “ln” stands for the logarithmic function, and there are also two coefficients (b0, b1). In short: sigmoid function is the mathematical basis for the logistic regression.
Let’s apply this method of Logistic Regression to a data set (which you can download the data once you’re registered for the course). The data refers to the scenario described above: An ad is played out to persons, and it is known whether a target person has bought the product (“1”) or not (“0”). There’s two independent variables, ie. two socio-economic characteristics: age and (estimated) salary:
Fig.: Data Set used for showcase of Classification method in Python
So, let’s implement the Python Code as per the Course Instructions: Import of the relevant library of SciKit-Learn. Generation of a class object. Application of the “fit” method to the data. Finally, you’ll get the following results of the classification method:
Fig.: Plotted graph based on classification with Logistic Regression
First of all, this plotted graph shows very clearly that the method of logistic regression is a linear method: Each data point corresponds to a combination of the two properties / independent variables “(estimated) salary” (Y axis) and “age” (X axis). The data points for the “buyer” properties (“green”) are separated from the data points for the “non-buyer” properties (“red”) by a line. Further down in this article, we’ll also look at non-linear methods, you’ll see the difference between the plotted graph for a linear and non-linear method very easily.
In the plotted graph (created in Python) you can also see where the actual observation points from the initial data set are located, which were fed into the machine learning algorithm. Green points (“buyers”) and red points (“non-buyers”) can be recognized. The prediction accuracy (“Accuracy Rate”) of this model for this method is slightly below 90% – which is quite good by the way.
Machine Learning / Artificial Intelligence – K-Nearest Neighbour
Now, let’s take a look at a non-linear method: K-Nearest Neighbour (K-NN). The principle is amazingly simple. Let us start out from a two-dimensional coordinate system in which the properties (i.e. independent variables) of a data point are plotted (see the following figure). Now we enter a new data point for which we want to make a prediction (“buy” or “not buy”). We take a simple approach: First, We draw a circle around this point and … Second, simply count the neighbours that lie within that circle. How many data points of type 1 (“red”)? And how many data points of type 2 (“green”). The next step is a no-brainer …
Fig.: The idea behind the K-Nearest Neighbor method for classification
The plotted graph for the result of this classification method looks as follows. Keep in mind: We use the same data for all classification methods that are explained in this blog post; that makes a comparison between these methods very simple.
Fig.: Plotted graph based on classification method K-Nearest Neighbor
It is clearly visible that the separation between the green area (prediction: “buyer”) and the red area is non-linear.
Machine Learning / Artificial Intelligence – Support Vector Machine (SVM)
Let’s move on to another linear method, namely the Support Vector Machine (SVM). This time we’ll start by looking at the result in the plotted graph:
Fig.: Plotted graph based on classification method Support Vector Machine (SVM)
At first glance, the plotted graph looks similar to the graph we saw in the chapter of logistic regression: A straight line dividing the green and the red area. However, at second glance you’ll see that starting point, end point and inclination of these two straight lines are not the same. That’s not surprise: The underlying mathematical methods to achieve at the graphs are different, so the result is different. Which algorithm (machine learning method) fits best to a given problem will be found out with a lot of experience or by Error&Trial.
Let’s take a closer look at the Method Support Vector Machine (compare also the following figure). Are you ready? It’s beyond High School mathematics. The method lays a straight line (called: “Maximum Margin Hyperplane”) between two points (called: “Support Vectors”). These two points are located where the “data point cloud” of “green” and “red” data points meet. Next, two parallel straight lines (here: dashed lines) are added, which just touch the two points (“Support Vectors”). The straight line is now positioned in such way (by adjusting the inclination of the straight line) that the distance between these dashed lines (called: “Maximum Margin”) is maximized.
Fig.: Explaining the method Support Vector Machine (SVM)
The special feature of this algorithm is that the position / inclination of the straight line is determined ONLY by those data points in the “border area”, that is: Where the “green” and the “red” clouds of data “come close” to each other. The other points are not taken into account in this algorithm. This approach comes with some advantages, that can be easiest explained with an analogy: Let’s assume that a machine learning algorithm is supposed to distinguish between tables and chairs. Its easy to describe the archetypical table. It has 4 table legs, a table top (100 cm x 200 cm), a table height (75 cm). In the same way the archetypical chair is described in its characteristics. An object is now assigned to these archetypical models of a table and a chair.
The procedure of the Support Vector Machine does NOT start from the archetypical model of a table or a chair, but compares a table and a chair in the “border area”, where we have a lot of similarities between real-life samples of chairs and tables. A table, for example, may be very low and has a table top with a small surface (e.g. a coffee table). A chair in the “border area” has no backrest (e.g. a stool). I think you get the idea behind that approach … This is why the machine learning procedure SVM also plays a role in machine learning.
Machine Learning / Artificial Intelligence – Kernel SVM
Next is a (non-linear) method that really thrilled me: Kernel SVM. The (mathematical) idea behind this method is brilliant, almost ingenious.The starting point is a distribution of data points that cannot be separated by a linear method (see right graph in the following figure): There is no straight line to separate the “green” from the “red” points.
Fig.: Machine learning algorithm Kernel SMV – Step 1
The ingenious approach is to transform this “non-linear” problem into a “linear” problem. How is this done? – By projecting the data points into the next higher data dimension. This sounds complicated, but it can be done rather easily in practice. Let’s start with the following situation for illustration purposes, namely data points in a one-dimensional space (i.e.: a line). By definition a problem is “linear” in a one-dimensional space, if the “red” points can be separated from the “green” points with a dot (point) on the axis. A quick look is enough to see that this is not possible:
Fig: Machine learning algorithm Kernel SMV – Step 2
In the next step, these data points are “projected into the next higher dimension”. That is: The one-dimensional space is transformed into a two-dimensional space. For that purpose we use the parabola function. We insert the X-values on the axis into the parabola function, and we’ll get the Y-values in the two-dimensional space. What you can see now: The points in the two-dimensional space lie on a parabola in such a way that the “red” and the “green” points can be separated from each other by means of a straight line. We now have a “linear” problem!
Fig.: Machine learning algorithm Kernel SMV – Step 3
Let’s take a look how to project data points from a two-dimensional space into a three-dimensional space (compare following figure). The “green” points lie on the “cone-shaped” three-dimensional structure in such a way that the “green” points can be separated from the “red” points with a plane.
By the way, in machine learning this is also called the kernel trick.
Fig.: Machine learning algorithm Kernel SMV – Step 4
The mathematical function for the “Kernel Trick” is based on a e function. It is required to define a so-called “landmark” (corresponds to the small “l” in the mathematical equation). This marks the highest point of the cone. The further away “x” is from this “landmark”, the further below the cone’s apex the point lies in three-dimensional space. The diameter of the cone can be influenced via “sigma” (in the denominator of the “fraction”).
Fig.: Machine learning algorithm Kernel SMV – Step 5
Of course, you can also merge multiple “cone structures” in order to transform distribution patterns such as the following into a linear problem:
Fig.: Machine learning algorithm Kernel SMV – Step 6
In practice you have a portfolio of different Kernels, that can be applied. It is up to the data scientist to decide which Kernel fits best for a given distribution pattern of data points. The data scientist can choose among Kernels based on the Gaussian distribution, on the sigmoid function or polynomial kernels.
Fig.: Machine learning algorithm Kernel SMV – Step 7
When you apply the Kernel SVM method to our data set, this results in the following graphic:
Fig.: Plotted graph for the machine learning algorithm Kernel SMV
Machine Learning / Artificial Intelligence – Naive Bayes Theorem
The theorem of Bayes is a mathematical theorem / statistical theorem, that provides a formula for the calculation of conditional probabilities. It is actually one of the basics of statistics.
I’ll explain it by using an example: You have two machines that produce identical parts. The production manager wants to know the probability of a part being defective if (i.e.: under the precondition that) it was produced on machine 2. This is called a conditional probability (for our example: The precondition being that the part is an output from machine 2). The mathematical notation for the preconditions is the pipe (“|”).
This can be calculated as follows. There are two mathematical expressions in the numerator. P(Mach2|Defect), i.e.: The probability that a part is an output from machine 2, if I look ONLY at the defective parts (i.e.: of all rejected parts in the QA department, I determine the share that comes from machine 2). This term is multiplied with another probability, namely: The overall probability that one part is defective: P(Defect). In the denominator we have the probability that a part from the entire production output was manufactured on machine 2: P(Mach2).
Fig.: The Naïve Bayes Theorem explained
A different mathematical notation looks as follows:
Fig.: The Naïve Bayes Theorem explained
How can this approach be applied to machine learning? – We have the following starting point: There are a number of observation points, namely employees of a certain age and salary. Some of the employees come to work on foot, while others reach the office by car. Next, we add a new data point in the coordinate system.
Fig.: The Naïve Bayes Theorem explained
The task: What’s the odds (the probability) that this new employee comes to work on foot? If we want to apply the theorem of Bayes we need three (3) probabilities: P(Walks), i.e. the probability that an employee will walk to work. How do we get that probability? Actually, we can use here the method of “K Nearest Neighbour”. We draw a circle around the new data point. Within this circle we have a group a people with homogeneous characteristics (certain age, salary):
Fig.: The Naïve Bayes Theorem explained
The next step is simple: You count within this circle, how many employees come to work on foot. This allows us to calculate P(X|Walks).
Fig.: The Naïve Bayes Theorem explained
I think the application of Naïve Bayes Theorem should be clear now.
At the end of this chapter I would like to apply the classification method Naïve Bayes to the initial data set (“buyer”, “non-buyer”, age, estimated salary). Please find below the plotted graph, that we get by applying the Naïve Bayes method to our problem. Naïve Bayes is – as you can see from the plotted graph – also a non-linear method.
It looks roughly similar to the plotted graph in the chapter K-Nearest Neighbor. The difference can be explained (to a large part) by the fact, that a different radius is applied to determine the data points that are considered for calculating of probabilities.
Fig.: Plotted Graph after applying the classification method Naive Bayes
Machine Learning / Artificial Intelligence – Decision Tree, Random Forest
Do you still remember the regression method Decision Tree from Part 2 of this course? The same principle is used for classification.
The method is easy to understand, as the following example will show: We have a set of data points as follows (in a two-dimensional data space). X1 corresponds to one variable, X2 to another variable. “Red” corresponds to one category, “green” to another.
Fig.: Classification Method based on Decision Tree – Step 1
We take an interative approach in order to create segments. First segmentation 1 (split 1), then split 2 and so on.
Fig.: Classification Method based on Decision Tree – Step 2
In the next step we “translate” the segmentation process (step by step) into a decision tree. The resulting decision tree looks as follows:
Fig.: Classification Method based on Decision Tree – Step 3
The segmentation is carried out in such way that the newly created segments have a higher homogeneity than the segment before the split. Here’s an example: The distribution of “red”/”green” data points before segmentation (splitting the segment) is 45:55. After segmentation it is 25:5 and 20:50. The two new (smaller) segments each have a higher homogeneity than the segment before the split. Mathematically, the informational entropy is minimized – which is an indicator of homogeneity or heterogeneity.
The method creates segments that are split by vertical or horizontal lines. The plotted graph after applying this method of machine learning to the example data set thus results in the following picture:
Fig: Plotted graph for the classification method of Decision Tree
By the way, this graph illustrates well the effect of overfitting. You have very small (too small) segments. A separate segment is created for (almost) every data point of the training data. The classification is overadapted to the training data, the classification method hasn’t identified properly the correlation between dependent and independent variables. Spezifically the two small “red” segments in the upper right-hand area result from the overfitting; for these two outliers a separate segment was created; this limits the predictive power of the model in precisely these areas.
At this point, it is worth recalling the Random Forest method (compare Part 2 of the Online Course). We get the following plotted graph, if we apply the Random Forest method; since the random forest method is based on the decision tree method, it is not surprising that the plotted graph looks similar:
Fig.: Plotted Graph for classification method Random Forest
Machine Learning / Artificial Intelligence – Evaluation of Machine Learning Models
Part 3 of the udemy Academy course introduces methods to evaluate the quality of a machine learning model. A simple procedure is first of all the Confusion Matrix:
Fix.: Confusion Matrix – Evaluation of the quality of a machine learning model
The light gray fields indicate where a prediction of the model is correct: It is correct if the machine learning model predicted that the even would not occur (“0”) and it DID NOT occur. And vice versa: The machine learning model predicted that the event would occur (“1”) and it DID occur.
On the other hand you have the false predictions: These are divided into two types. False Positive means: the occurrence of an event is predicted, however, in reality it did NOT occur. Example: A virus detection software identifies a file as malware, however, in reality the file is clean. False Negative means: Prediction says the event won’t occur (example: no malware), but in reality the event occurs (example: the malware causes damage). This example shows that the error of the False Negative in machine learning is considered as much more critical than a False Positive.
Let us consider another metric to evaluate the quality of a machine learning model: The so-called CAP curve, Cumulative Accuracy Profile curve:
Fig.: The Cumulative Accuracy Profile (CAP) used to evaluate the quality of a machine learning model – Step 1
How can this curve be interpreted? – Let us assume the following starting situation. You send an advertising brochure to all known target persons (this corresponds to 100% on the X-axis). It is known that 10% of these people will buy the product (this also means: If 10% of the total of target persons have bought the product, 100% of the sales potential is reached). The coordinate system (and the curves shown in it) thus shows the relationship between “percentage of contacted target persons” (X-Axis) and “percentage of realised purchases of potentially possible purchases” (Y-Axis).
If you select the target persons randomly in the mailing, you can expect that 10% of the contacted target persons will buy the product. If you have a perfect model (“crystal ball”) for predicting the probability of purchase (e.g. based on socio-economic data), then you would only send advertising brochures to those (10% of) target persons who will actually buy. This scenario corresponds to the graph “Crystal Ball”. In practice, the models of prediction will lie between the curve for the “perfect prediction model” and the curve for the “random selection” of the persons written to. You may have good models (closer to the “crystal ball curve”); and you will have bad models (closer to the “random choice curve”).
Fig.: The Cumulative Accuracy Profile (CAP) used to evaluate the quality of a machine learning model – Step 2
The quality metric CAP is based on these considerations: You take the area (in maths: integral) between the curve for the machine learning model and the curve of “random selection”. That’s the nominator. The denominator is the curve for the machine learning model and the curve for the “perfect model”. This value can be determined using statistical programs; for such calculation Python libraries are available.
Machine Learning / Artificial Intelligence – Outlook on Part 4 of the Machine Learning Course
Part 4 of the course introduces you to clustering, that is: Identification of structures or segments in the data. The course covers machine learning methods such as K-Means Clustering and Hierarchical Clustering.