Machine learning / Artificial intelligence will play an increasing role in software development in future; in the foreseeable future, there will probably be only a few software projects that can do without machine learning. According to Google/Alphabet boss Sundar Pichai: “We will move from a mobile-first world to an AI-first world”, where the computer will be a language assistant equipped with artificial intelligence (AI). Not every software developer, not every product owner will therefore necessarily need to become a certified data scientist or expert in machine learning. At the very least, however, he should have a basic understanding of machine learning. This blog describes the learning experience in an Online machine learning relevant course, you’ll be part of the learning journey.
The course I’m talking about can be found on the udemy Academy. Course title: “Machine Learning A-Z™: Hands-On Python & R In Data Science”. The initial price of the course was about 400 EUR, I actually booked the course for a mere 15 EUR. The course includes about 40 hours of video training and various teaching material. The course promises: teaching the basics of machine learning, building robust models for machine learning, conception/performance of powerful data analysis, dealing with specific topics like Natural Language Programming or Deep Learning.. The course definitely keeps the promises.
Here’s PART 1 of the series, which deals with data preprocessing for machine learning: Machine Learning for Product Owner, IT Project Manager (Part 1): Introduction and data preparation
The second part of the online course introduces a variety of linear and non-linear regression methods in order to analyze correlations within a given data set; the so-called ”matrix of features” contains a set of independent variables as well as dependent variables. The tutors of the course start with the simple linear regression. This is based on mathematics that can be easily understood, it’s simple high school mathematics. The magic is done with the sum of least squares. For more sophisticated regression methods, the course aims to establish an intuitive understanding; actually, you cannot explain something like information entropy in a few minutes. However, this intuitive understanding is sufficient to apply these methods in practice. An that’s precisely the key focus of this course: Apply the methods in practice.
Note: The source of all following illustrations is the course itself..
Machine learning / artificial intelligence: Simple Linear Regresssion
The second part begins with the simplest regression method, namely Simple Linear Regression. There is a single independent variable and a single dependent variable. These data points can therefore be simply plotted in a two-dimensional coordinate system (axis x, axis y). The correlation between the data is represented by a straight line (y = constant + x*coefficient). The straight line is placed in such a way that the distance to the observation points is minimal. For this purpose, the method of least squares is applied: Take the distance between each data point and the straight line, then square this value, then sum up the values for all data points. In order to get a perfect regression, that sum must be minimized.
Fig.: Least squares Method
The implementation in Python code follows a scheme comparable for all regression methods: (a) the required library and class is loaded, (b) an object is generated, (c) the method Fit and Transform is used to fit the data set to the object or to apply the mathematical algorithm to the data set.
Machine learning / artificial intelligence: Backward Elimination
If you move on to implementing a machine learning model that is based on Multiple Regression, then you come across a general challenge with the design of machine learning models. So, what’s that? Let’s assume that there is a data set with the socio-economic profile for buyers of a product consisting of 80 characteristics: salary, gender, place of residence, age, etc. A data scientist could now set up a machine learning model in which ALL properties are introduced in a function according to the following scheme:
Fig.: Mathematical Function for Multiple Regression
If I consider ALL properties in the machine learning model, then I take the assumption that ALL properties have a statistical relevance. In short: I assume that each property influences the actual behavior of a person. But is that true, it that a good assumption? Actually, a good data scientist will consider all statistically relevant independent variables, while leaving out those variables, that are statistically insignificant. But, how does he/she do that?
The data scientist determines the statistical relevance of each individual property (of each independent variable) by using the so-called P value. This value returns the probability that the observed values will occur when the so-called NULL hypothesis is valid. The NULL hypothesis: “The independent variable has NO statistically significant influence on the result”. A level of significance is defined as the decision criterion (0.05 is used by default). If the P-value (i.e.: the probability that, if the null hypothesis is valid, the observed values occur) is above the significance level, then the null hypothesis is (still) considered valid; if the P-value is below the significance level, then the null hypothesis is rejected, i.e.: it is assumed that the independent variable has a statistically relevant influence on the result (i.e.: there is a statistically significant correlation between the independent variable and the dependent variable).
A common procedure for the elimination of statistically insignificant variables is called Backward Elimination: You start with a model containing ALL variables and eliminates step-by-step those variables whose P-value is OVER the significance level (by default: above 0.05). The calculation of the P value is made at the help of a Python library (SciKit). This has been applied to the machine learning model based on Multiple Regression, check the following screenshot of the Python IDE. You’ll find the relevant statistical information in the console window (right half of the screen, below):
Screenshot: Display of statistical values per independent variable in a machine learning model
Machine learning / artificial intelligence: Polynomial Regression
Another regression method is the polynomial regression, which is a non-linear regression. The mathematical expression for this looks as follows (it’s the function that produces parabolas):
Fig.: Polynomial Formula
In the online course of the udemy Academy you will learn how you can generate a graphics to display the results. Actually, only a few lines of code are required to generate graphs that represent a polynomial function of second degree, third degree, fourth degree, and so on. When comparing the graphs of different polynomial functions, it’s quite obvious: The higher the degree of the polynomial function, the better the regression:
Fig.: Graph of a polynomial function of second degree
Fig.: Graph of a polynomial function of third degree
Fig.: Graph of a polynomial function of fourth degree
Machine learning / artificial intelligence: Support Vector Regression
The next regression procedure for machine learning goes actually beyond high school mathematics: Support Vector Regression. The online course from udemy Academy must inevitably refrain from teaching the sophisticated mathematics behind this regression method for machine learning – that would simply exceed the time frame.
Fig.: Correlation Matrix for Support Vector Regression
However, the machine learning course does provide an intuitive understanding of the Support Vector regression method. In fact, this is sufficient to produce results with the help of the introduced libraries. In a first step, the user simply adopts the standard settings for the various parameters of this regression method (see the following screenshot of the parameters such as “epsilon=0.1” or “kernel=rbf” in the console window); those who wish to deepen their knowledge of this regression method can of course gain a deeper understanding of the various parameters through Internet research.
Fig.: Machine learning model in Python for Support Vector Regression
The course follows a pragmatic approach to machine learning: A deepening of the mathematical basics is not always possible or even required; the student, however, receives sufficient knowledge to put the methods of machine learning into practice – this pragmatic approach is maintained in the further course of the machine learning course.
Machine learning / artificial intelligence: Regression Tree
A procedure that can be intuitively understood very easily (and is mathematically based on the concept of information entropy) is the regression tree, which is a method belonging to the family of decision trees. It can be easily understood. Let’s take the example of two independent variables (and one dependent variable): The data points (or: observation points) are plotted in a two-dimensional coordinate system as follows:
Fig.: Regression Tree, data points in a two-dimensional coordinate system
By applying the method of the regression tree, you’ll get segments. Within these segments the data points have a high “proximity” to each other. For the quality of the segmentation, the measure MSE: Mean Squared Error is mathematically used. MSE is used in applying the method of Regression Tree in the same way as the least squares method is used for the Simple linear regression.
It is now important that this segmentation is done iteratively: First, “Split 1” (or: segmentation 1) is performed. Then “Split 2”, and so on. Following this iterative procedure the Regression tree is finally generated:
Fig.: Machine Learning – Regression Tree
Such a regression tree procedure could be applied, for example, to salary data. Let’s take a simple example: We have two data dimensions, “professional experience in years” and “qualification”. You would obtain a decision tree, which could show the following segments after various ramifications: “technical university degree, professional experience 8 to 15 years”, “master craftsman’s examination technical profession, 15 to 23 years” and so on. The result for the dependent variable (for our example: “salary”) is then the average of the dependent variable for those observation points that fall within the corresponding cluster. The regression tree looks as follows:
fig.: Machine learning regression tree. Average values for dependent variable
By the way, this regression method for machine learning is the first method that is non-continuous. You can see that clearly when you look at the function graph (which can be easily generated with a few lines in Python):
Fig.: Regression tree – Non-linear, non-continuous.
There’s steps in the function graph. This can be traced back to the regression method, where the average of the dependent variable is formed for all data points in a segment. This average leads to the shape of a staircase of the function graph.
Machine learning / artificial intelligence: Random Forest Regression
The last regression method presented in this part of the course is the so-called Random Forest Regression. This method uses a principle that has, in general, a high practical significance in machine learning. To put it simply: The more different methods of machine learning are used, the better the quality of the results. And: The more different methods of machine learning used, the closer you come to a good correlation between the data.
This general principle is used in the Random Forest Regression in a modified way: It’s not different methods that are used here; however, the Regression Tree method is applied to the data set by creating different subsets of data (Random Selection). For each subset of data the Regression Tree method is applied. Each iteration with a different subset yields a result for the combination of independent variables (i.e.: a tuple) – that is: a “tree”. If one looks at the results of multiple iterations, one gets several trees or a “rain forest”. Finally, the average is calculated.
Machine learning / artificial intelligence: Coefficient of determination
The udemyAcademy Course Part 2 concludes with reflections on building models for machine learning in general. It is mainly concerned with the question: How can the quality of models be determined? For this purpose, the coefficient of determination is used.
Even if the mathematics/statistics lesson was a while ago, it is easy to understand: On the one hand, we have the sum of the least squares, which was already introduced above (a method which is used to calculate a simple linear regression). Using the same method, another sum is formed: We average the values of the dependent variable and square the distance between the observed values and this average. If the simple linear regression method was applied properly, the distance of the data points to the regression line should be (significantly) smaller than the distance to a line that simply shows the average value. And this is precisely how the quality of the machine learning model is determined: By comparing these two sums (of squares) of distances. The result is the coefficient of determination, in English: R Squared.
Fig: R squared
For any data scientist, however, the coefficient of determination is not yet practicable for the development of machine learning models: Even if there is no statistically significant relationship between an independent variable and a dependent variable, the extension of a mathematical model by such a statistically insignificant variable increases the measure of determination; a mathematical regression algorithm finds a however random relationship, resulting in a coefficient of the 0.000001 for such independent variable. In the worst case (if there is no correlation at all, i.e. not even a purely random correlation) the coefficient is set to “0”; in no case does another independent variable make a mathematical model worse.
Fig.: Increase of “R squared” with each additional independent variable
The (simple) coefficient of determination is therefore only of limited use as a measure to determine the quality of a machine learning model. What’s used instead: The so-called adjusted coefficient of determination (“Adjusted R squared”). The formula for the coefficient of determination is extended by a term that reduces the adjusted coefficient of determination with each increase in the number of independent variables in the model. Thus, it only makes sense to introduce a new independent variable into the model if the regression improves with the help of the new independent variable significantly. Significantly means: The regression improves in such a way that this improvement overcompensates the reduction effect from the increase in the number of variables. In other words, it comes down to this: It only makes sense to include the statistically significant variables in the model.
Fig.: Adjusted R squared
Machine learning / artificial intelligence: Part 3
Part 3 of the Online Course introduces several methods for classification: Machine Learning for Product Owner, IT Project Manager (Part 3): Classification Methods