Machine Learning for Product Owner, IT Project Manager (Crash Course Part 3) – Classification Methods - For Digital Movers: Digitale Transformation, Softwareentwicklung, IT Offshore

Machine learning / Artificial intelligence will play an increasing role in software development in future; in the foreseeable future, there will probably be only a few software projects that can do without machine learning. According to Google/Alphabet boss Sundar Pichai: “We will move from a mobile-first world to an AI-first world”, where the computer will be a language assistant equipped with artificial intelligence (AI). Not every software developer, not every product owner will therefore necessarily need to become a certified data scientist or expert in machine learning. At the very least, however, he should have a basic understanding of machine learning. This blog describes the learning experience in an Online machine learning relevant course, you’ll be part of the learning journey.

The course I’m talking about can be found on the udemy Academy. Course title: “Machine Learning A-Z™: Hands-On Python & R In Data Science”. The initial price of the course was about 400 EUR, I actually booked the course for a mere 15 EUR. The course includes about 40 hours of video training and various teaching material. The course promises: teaching the basics of machine learning, building robust models for machine learning, conception/performance of powerful data analysis, dealing with specific topics like Natural Language Programming or Deep Learning.. The course definitely keeps the promises.

Here’s Part 1 and Part 2 of the Online Course:

The methods of regression (compare PART 2) were concerned with predicting a numerical value. For example: The salary (dependent variable) is a function of age and qualification (independent variables). Part 3 is dedicated to those (mathematical) methods in machine learning that are used for classification / categorization. That is: A category is to be predicted. For example: “buyer” or “non-buyer”. The starting point could be the data set of socio-economic characteristics of persons who have seen an online ad. It is also known who of these target persons bought the product (“buyer”) and who did not buy the product (“non-buyer”). This makes two categories: buyers and non-buyers. Classification methods, however, are not restricted to two (2) categories. Classification methods also work for more than two (2) categories; e.g. market segments.

Machine Learning / Artificial Intelligence – Logistic Regression

The first method to be introduced is a linear method: Logistic Regression (also: Logit-Model). It is a method that is applied in practice to so-called linear problems.

So, what’s a linear problem? – Take a two-dimensional coordinate system with numerous data points of two different categories (blue, red). If you can place a LINE in the coordinate system in such a way that the red points are (almost completely) separated from the blue points, then you have a linear problem. In three-dimensional space, it must be possible to separate areas from each other by means of a plane.

For the introduction of logistic regression we consider a simple starting situation: There are observation points that indicate whether a product was bought by a target person or not (“buyer” vs. “non-buyer”). The age of the persons is known. If this information is plotted in a coordinate system (cf. the following figure), it becomes clear: The higher the age of a person, the more likely it is that the person will buy the product (red crosses on a dotted line): The dotted line corresponds to a purchase probability of 100% (0% = product is not bought, 100% = product is bought for sure).

Fig.: Logistic Regression based on Sigmoid Function

An S-shaped curve approximates this probability distribution: Where you have the bend (or: curvature), the probability that the product will be purchased increases sharply. How can this be mathematically be expressed? – With the so-called sigmoid function! “P” means probability, “ln” stands for the logarithmic function, and there are also two coefficients (b0, b1). In short: sigmoid function is the mathematical basis for the logistic regression.

Let’s apply this method of Logistic Regression to a data set (which you can download the data once you’re registered for the course). The data refers to the scenario described above: An ad is played out to persons, and it is known whether a target person has bought the product (“1”) or not (“0”). There’s two independent variables, ie. two socio-economic characteristics: age and (estimated) salary:

Fig.: Data Set used for showcase of Classification method in Python

So, let’s implement the Python Code as per the Course Instructions: Import of the relevant library of SciKit-Learn. Generation of a class object. Application of the “fit” method to the data. Finally, you’ll get the following results of the classification method:

Fig.: Plotted graph based on classification with Logistic Regression

First of all, this plotted graph shows very clearly that the method of logistic regression is a linear method: Each data point corresponds to a combination of the two properties / independent variables “(estimated) salary” (Y axis) and “age” (X axis). The data points for the “buyer” properties (“green”) are separated from the data points for the “non-buyer” properties (“red”) by a line. Further down in this article, we’ll also look at non-linear methods, you’ll see the difference between the plotted graph for a linear and non-linear method very easily.

In the plotted graph (created in Python) you can also see where the actual observation points from the initial data set are located, which were fed into the machine learning algorithm. Green points (“buyers”) and red points (“non-buyers”) can be recognized. The prediction accuracy (“Accuracy Rate”) of this model for this method is slightly below 90% – which is quite good by the way.

Machine Learning / Artificial Intelligence – K-Nearest Neighbour

Now, let’s take a look at a non-linear method: K-Nearest Neighbour (K-NN). The principle is amazingly simple. Let us start out from a two-dimensional coordinate system in which the properties (i.e. independent variables) of a data point are plotted (see the following figure). Now we enter a new data point for which we want to make a prediction (“buy” or “not buy”). We take a simple approach: First, We draw a circle around this point and … Second, simply count the neighbours that lie within that circle. How many data points of type 1 (“red”)? And how many data points of type 2 (“green”). The next step is a no-brainer …

Fig.: The idea behind the K-Nearest Neighbor method for classification

The plotted graph for the result of this classification method looks as follows. Keep in mind: We use the same data for all classification methods that are explained in this blog post; that makes a comparison between these methods very simple.

Fig.: Plotted graph based on classification method K-Nearest Neighbor

It is clearly visible that the separation between the green area (prediction: “buyer”) and the red area is non-linear.

Machine Learning / Artificial Intelligence – Support Vector Machine (SVM)

Let’s move on to another linear method, namely the Support Vector Machine (SVM). This time we’ll start by looking at the result in the plotted graph:

Fig.: Plotted graph based on classification method Support Vector Machine (SVM)

At first glance, the plotted graph looks similar to the graph we saw in the chapter of logistic regression: A straight line dividing the green and the red area. However, at second glance you’ll see that starting point, end point and inclination of these two straight lines are not the same. That’s not surprise: The underlying mathematical methods to achieve at the graphs are different, so the result is different. Which algorithm (machine learning method) fits best to a given problem will be found out with a lot of experience or by Error&Trial.

Let’s take a closer look at the Method Support Vector Machine (compare also the following figure). Are you ready? It’s beyond High School mathematics. The method lays a straight line (called: “Maximum Margin Hyperplane”) between two points (called: “Support Vectors”). These two points are located where the “data point cloud” of “green” and “red” data points meet. Next, two parallel straight lines (here: dashed lines) are added, which just touch the two points (“Support Vectors”). The straight line is now positioned in such way (by adjusting the inclination of the straight line) that the distance between these dashed lines (called: “Maximum Margin”) is maximized.

Fig.: Explaining the method Support Vector Machine (SVM)

The special feature of this algorithm is that the position / inclination of the straight line is determined ONLY by those data points in the “border area”, that is: Where the “green” and the “red” clouds of data “come close” to each other. The other points are not taken into account in this algorithm. This approach comes with some advantages, that can be easiest explained with an analogy: Let’s assume that a machine learning algorithm is supposed to distinguish between tables and chairs. Its easy to describe the archetypical table. It has 4 table legs, a table top (100 cm x 200 cm), a table height (75 cm). In the same way the archetypical chair is described in its characteristics. An object is now assigned to these archetypical models of a table and a chair.

The procedure of the Support Vector Machine does NOT start from the archetypical model of a table or a chair, but compares a table and a chair in the “border area”, where we have a lot of similarities between real-life samples of chairs and tables. A table, for example, may be very low and has a table top with a small surface (e.g. a coffee table). A chair in the “border area” has no backrest (e.g. a stool). I think you get the idea behind that approach … This is why the machine learning procedure SVM also plays a role in machine learning.

Machine Learning / Artificial Intelligence – Kernel SVM

Next is a (non-linear) method that really thrilled me: Kernel SVM. The (mathematical) idea behind this method is brilliant, almost ingenious.

The starting point is a distribution of data points that cannot be separated by a linear method (see right graph in the following figure): There is no straight line to separate the “green” from the “red” points.

Fig.: Machine learning algorithm Kernel SMV – Step 1

The ingenious approach is to transform this “non-linear” problem into a “linear” problem. How is this done? – By projecting the data points into the next higher data dimension. This sounds complicated, but it can be done rather easily in practice. Let’s start with the following situation for illustration purposes, namely data points in a one-dimensional space (i.e.: a line). By definition a problem is “linear” in a one-dimensional space, if the “red” points can be separated from the “green” points with a dot (point) on the axis. A quick look is enough to see that this is not possible:

Fig: Machine learning algorithm Kernel SMV – Step 2

In the next step, these data points are “projected into the next higher dimension”. That is: The one-dimensional space is transformed into a two-dimensional space. For that purpose we use the parabola function. We insert the X-values on the axis into the parabola function, and we’ll get the Y-values in the two-dimensional space. What you can see now: The points in the two-dimensional space lie on a parabola in such a way that the “red” and the “green” points can be separated from each other by means of a straight line. We now have a “linear” problem!

Fig.: Machine learning algorithm Kernel SMV – Step 3

Let’s take a look how to project data points from a two-dimensional space into a three-dimensional space (compare following figure). The “green” points lie on the “cone-shaped” three-dimensional structure in such a way that the “green” points can be separated from the “red” points with a plane.

By the way, in machine learning this is also called the kernel trick.

Fig.: Machine learning algorithm Kernel SMV – Step 4

The mathematical function for the “Kernel Trick” is based on a e function. It is required to define a so-called “landmark” (corresponds to the small “l” in the mathematical equation). This marks the highest point of the cone. The further away “x” is from this “landmark”, the further below the cone’s apex the point lies in three-dimensional space. The diameter of the cone can be influenced via “sigma” (in the denominator of the “fraction”).

Fig.: Machine learning algorithm Kernel SMV – Step 5

Of course, you can also merge multiple “cone structures” in order to transform distribution patterns such as the following into a linear problem:

Fig.: Machine learning algorithm Kernel SMV – Step 6

In practice you have a portfolio of different Kernels, that can be applied. It is up to the data scientist to decide which Kernel fits best for a given distribution pattern of data points. The data scientist can choose among Kernels based on the Gaussian distribution, on the sigmoid function or polynomial kernels.

Fig.: Machine learning algorithm Kernel SMV – Step 7

When you apply the Kernel SVM method to our data set, this results in the following graphic:

Fig.: Plotted graph for the machine learning algorithm Kernel SMV

Machine Learning / Artificial Intelligence – Naive Bayes Theorem

The theorem of Bayes is a mathematical theorem / statistical theorem, that provides a formula for the calculation of conditional probabilities. It is actually one of the basics of statistics.

I’ll explain it by using an example: You have two machines that produce identical parts. The production manager wants to know the probability of a part being defective if (i.e.: under the precondition that) it was produced on machine 2. This is called a conditional probability (for our example: The precondition being that the part is an output from machine 2). The mathematical notation for the preconditions is the pipe (“|”).

This can be calculated as follows. There are two mathematical expressions in the numerator. P(Mach2|Defect), i.e.: The probability that a part is an output from machine 2, if I look ONLY at the defective parts (i.e.: of all rejected parts in the QA department, I determine the share that comes from machine 2). This term is multiplied with another probability, namely: The overall probability that one part is defective: P(Defect). In the denominator we have the probability that a part from the entire production output was manufactured on machine 2: P(Mach2).