**Machine learning** / **Artificial intelligence** will play an increasing role in software development in future; in the foreseeable future, there will probably be only a few software projects that can do without machine learning. According to *Google/Alphabet* boss *Sundar Pichai*: *“We will move from a mobile-first world to an AI-first world”*, where the computer will be a language assistant equipped with artificial intelligence (AI). Not every software developer, not every product owner will therefore necessarily need to become a certified data scientist or expert in machine learning. At the very least, however, he should have a **basic understanding of machine learning**. This blog describes the learning experience in an Online machine learning relevant course, you’ll be part of the learning journey.

The course I’m talking about can be found on the *udemy Academy*. Course title: **“Machine Learning A-Z™: Hands-On Python & R In Data Science”**. The initial price of the course was about 400 EUR, I actually booked the course for a mere 15 EUR. The course includes **about 40 hours of video training and various teaching material**. The course promises: teaching the basics of machine learning, building robust models for machine learning, conception/performance of powerful data analysis, dealing with specific topics like Natural Language Programming or Deep Learning.. The course definitely keeps the promises.

Here’s PART 1 of the series, which deals with data preprocessing for machine learning: * Machine Learning for Product Owner, IT Project Manager (Part 1): Introduction and data preparation*

The second part of the online course introduces a variety of linear and non-linear regression methods in order to analyze **correlations** within a given data set; the so-called **”matrix of features”** contains a set of independent variables as well as dependent variables. The tutors of the course start with the simple linear regression. This is based on mathematics that can be easily understood, it’s simple high school mathematics. The magic is done with the *sum of least squares*. For more sophisticated regression methods, the course aims to establish an intuitive understanding; actually, you cannot explain something like *information entropy* in a few minutes. However, this intuitive understanding is sufficient to apply these methods in practice. An that’s precisely the key focus of this course: Apply the methods in practice.

*Note: The source of all following illustrations is the course itself.*.

### Machine learning / artificial intelligence: Simple Linear Regresssion

The second part begins with the simplest regression method, namely **Simple Linear Regression**. There is a single independent variable and a single dependent variable. These data points can therefore be simply plotted in a two-dimensional coordinate system (axis x, axis y). The correlation between the data is represented by a straight line (y = constant + x*coefficient). The straight line is placed in such a way that the distance to the observation points is minimal. For this purpose, the **method of least squares** is applied: Take the distance between each data point and the straight line, then square this value, then sum up the values for all data points. In order to get a perfect regression, that sum must be minimized.

Fig.: Least squares Method

The implementation in *Python code* follows a scheme comparable for all regression methods: (a) the required library and class is loaded, (b) an object is generated, (c) the method *Fit and Transform* is used to fit the data set to the object or to apply the mathematical algorithm to the data set.

### Machine learning / artificial intelligence: Backward Elimination

If you move on to implementing a machine learning model that is based on **Multiple Regression**, then you come across a general challenge with the design of machine learning models. So, what’s that? Let’s assume that there is a data set with the socio-economic profile for buyers of a product consisting of 80 characteristics: salary, gender, place of residence, age, etc. A data scientist could now set up a machine learning model in which ALL properties are introduced in a function according to the following scheme:

Fig.: Mathematical Function for Multiple Regression

If I consider ALL properties in the machine learning model, then I take the assumption that ALL properties have a statistical relevance. In short: I assume that each property influences the actual behavior of a person. But is that true, it that a good assumption? Actually, a good data scientist will consider all statistically relevant independent variables, while leaving out those variables, that are statistically insignificant. But, how does he/she do that?

The data scientist determines the **statistical relevance of each individual property (of each independent variable)** by using the so-called **P value**. This value returns the **probability that the observed values will occur when the so-called NULL hypothesis is valid**. The NULL hypothesis: “The independent variable has NO statistically significant influence on the result”. A **level of significance** is defined as the decision criterion (*0.05* is used by default). If the P-value (i.e.: *the probability that, if the null hypothesis is valid, the observed values occur*) is above the significance level, then the null hypothesis is (still) considered valid; if the P-value is below the significance level, then the null hypothesis is rejected, i.e.: it is assumed that the independent variable has a statistically relevant influence on the result (i.e.: there is a statistically significant correlation between the independent variable and the dependent variable).

A common procedure for the elimination of statistically insignificant variables is called **Backward Elimination**: You start with a model containing ALL variables and eliminates step-by-step those variables whose P-value is OVER the significance level (by default: above 0.05). The calculation of the *P value* is made at the help of a *Python* library (*SciKit*). This has been applied to the machine learning model based on *Multiple Regression*, check the following screenshot of the *Python* IDE. You’ll find the relevant statistical information in the console window (right half of the screen, below):

Screenshot: Display of statistical values per independent variable in a machine learning model

### Machine learning / artificial intelligence: Polynomial Regression

Another regression method is the **polynomial regression**, which is a non-linear regression. The mathematical expression for this looks as follows (it’s the function that produces parabolas):

Fig.: Polynomial Formula

In the online course of the *udemy Academy* you will learn how you can generate a graphics to display the results. Actually, only a few lines of code are required to generate graphs that represent a polynomial function of second degree, third degree, fourth degree, and so on. When comparing the graphs of different polynomial functions, it’s quite obvious: The higher the degree of the polynomial function, the better the regression:

Fig.: Graph of a polynomial function of second degree

Fig.: Graph of a polynomial function of third degree

Fig.: Graph of a polynomial function of fourth degree

### Machine learning / artificial intelligence: Support Vector Regression

The next regression procedure for machine learning goes actually beyond *high school mathematics*: **Support Vector Regression**. The online course from *udemy Academy* must inevitably refrain from teaching the sophisticated mathematics behind this regression method for machine learning – that would simply exceed the time frame.

Fig.: Correlation Matrix for Support Vector Regression

However, the machine learning course does provide an intuitive understanding of the Support Vector regression method. In fact, this is sufficient to produce results with the help of the introduced libraries. In a first step, the user simply adopts the standard settings for the various parameters of this regression method (see the following screenshot of the parameters such as “epsilon=0.1” or “kernel=rbf” in the console window); those who wish to deepen their knowledge of this regression method can of course gain a deeper understanding of the various parameters through Internet research.

Fig.: Machine learning model in Python for Support Vector Regression

The course follows a pragmatic approach to machine learning: A deepening of the mathematical basics is not always possible or even required; the student, however, receives sufficient knowledge to put the methods of machine learning into practice – this pragmatic approach is maintained in the further course of the machine learning course.

### Machine learning / artificial intelligence: Regression Tree

A procedure that can be intuitively understood very easily (and is mathematically based on the concept of *information entropy*) is the **regression tree**, which is a method belonging to the family of **decision trees**. It can be easily understood. Let’s take the example of two independent variables (and one dependent variable): The data points (or: observation points) are plotted in a two-dimensional coordinate system as follows:

Fig.: Regression Tree, data points in a two-dimensional coordinate system

By applying the method of the regression tree, you’ll get **segments**. Within these segments the data points have a high “proximity” to each other. For the quality of the segmentation, the measure **MSE: Mean Squared Error** is mathematically used. *MSE* is used in applying the method of Regression Tree in the same way as the *least squares method* is used for the Simple linear regression.

It is now important that this segmentation is done iteratively: First, “Split 1” (or: segmentation 1) is performed. Then “Split 2”, and so on. Following this **iterative procedure** the **Regression tree** is finally generated:

Fig.: Machine Learning – Regression Tree

Such a regression tree procedure could be applied, for example, to salary data. Let’s take a simple example: We have two data dimensions, “professional experience in years” and “qualification”. You would obtain a decision tree, which could show the following segments after various ramifications: “technical university degree, professional experience 8 to 15 years”, “master craftsman’s examination technical profession, 15 to 23 years” and so on. The result for the dependent variable (for our example: “salary”) is then the average of the dependent variable for those observation points that fall within the corresponding cluster. The regression tree looks as follows:

fig.: Machine learning regression tree. Average values for dependent variable

By the way, this regression method for machine learning is the first method that is **non-continuous**. You can see that clearly when you look at the function graph (which can be easily generated with a few lines in *Python*):

Fig.: Regression tree – Non-linear, non-continuous.

There’s steps in the function graph. This can be traced back to the regression method, where the *average* of the dependent variable is formed for all data points in a segment. This average leads to the shape of a staircase of the function graph.

### Machine learning / artificial intelligence: Random Forest Regression

The last regression method presented in this part of the course is the so-called **Random Forest Regression**. This method uses a principle that has, in general, a high practical significance in machine learning. To put it simply: **The more different methods of machine learning are used, the better the quality of the results. And: The more different methods of machine learning used, the closer you come to a good correlation between the data.**

This general principle is used in the *Random Forest Regression* in a modified way: It’s not different *methods* that are used here; however, the Regression Tree method is applied to the data set by creating different subsets of data (Random Selection). For each subset of data the Regression Tree method is applied. Each iteration with a different subset yields a result for the combination of independent variables (i.e.: a tuple) – that is: a “tree”. If one looks at the results of multiple iterations, one gets several trees or a “rain forest”. Finally, the average is calculated.

### Machine learning / artificial intelligence: Coefficient of determination

The *udemyAcademy Course* Part 2 concludes with reflections on building models for machine learning in general. It is mainly concerned with the question: How can the quality of models be determined? For this purpose, the **coefficient of determination** is used.

Even if the mathematics/statistics lesson was a while ago, it is easy to understand: On the one hand, we have the *sum of the least squares*, which was already introduced above (a method which is used to calculate a simple linear regression). Using the same method, another sum is formed: We average the values of the dependent variable and square the distance between the observed values and this average. If the simple linear regression method was applied properly, the distance of the data points to the regression line should be (significantly) smaller than the distance to a line that simply shows the average value. And this is precisely how the quality of the machine learning model is determined: By comparing these two sums (of squares) of distances. The result is the **coefficient of determination**, in English: **R Squared**.

Fig: R squared

For any data scientist, however, the *coefficient of determination* is not yet practicable for the development of machine learning models: Even if there is no statistically significant relationship between an independent variable and a dependent variable, the extension of a mathematical model by such a statistically insignificant variable increases the measure of determination; a mathematical regression algorithm finds a however random relationship, resulting in a coefficient of the 0.000001 for such independent variable. In the worst case (if there is no correlation at all, i.e. not even a purely random correlation) the coefficient is set to “0”; **in no case does another independent variable make a mathematical model worse**.

Fig.: Increase of “R squared” with each additional independent variable

The (simple) *coefficient of determination* is therefore only of limited use as a measure to determine the quality of a machine learning model. What’s used instead: The so-called **adjusted coefficient of determination (“Adjusted R squared”)**. The formula for the coefficient of determination is extended by a term that reduces the *adjusted coefficient of determination* with each increase in the number of independent variables in the model. Thus, it only makes sense to introduce a new independent variable into the model if the regression improves with the help of the new independent variable significantly. Significantly means: The regression improves in such a way that this improvement overcompensates the reduction effect from the increase in the number of variables. In other words, it comes down to this: It only makes sense to include the statistically significant variables in the model.

Fig.: Adjusted R squared

### Machine learning / artificial intelligence: Part 3

Part 3 of the Online Course introduces several methods for classification: *Machine Learning for Product Owner, IT Project Manager (Part 3): Classification Methods*