Machine learning / Artificial intelligence will play an increasing role in software development in future; in the foreseeable future, there will probably be only a few software projects that can do without machine learning. According to Google/Alphabet boss Sundar Pichai: “We will move from a mobile-first world to an AI-first world”, where the computer will be a language assistant equipped with artificial intelligence (AI). Not every software developer, not every product owner will therefore necessarily need to become a certified data scientist or expert in machine learning. At the very least, however, he should have a basic understanding of machine learning. This blog describes the learning experience in an Online machine learning relevant course, you’ll be part of the learning journey.
The course I’m talking about can be found on the udemy Academy. Course title: “Machine Learning A-Z™: Hands-On Python & R In Data Science”. The initial price of the course was about 400 EUR, I actually booked the course for a mere 15 EUR. The course includes about 40 hours of video training and various teaching material. The course promises: teaching the basics of machine learning, building robust models for machine learning, conception/performance of powerful data analysis, dealing with specific topics like Natural Language Programming or Deep Learning.. The course definitely keeps the promises.
My take on the course: It’s didactically very well structured, the contents are (across all learning units) quite easy to understand and thus allow a very good introduction to machine learning.
Machine Learning / Artificial Intelligence – Introduction
In this series of articles about the udemy Academy online course I will provide some knowledge nuggets, which give a good understanding of the content of the online course. Of course, this overview cannot replace the online course – the Knowledge Nuggets are intended as an appetizer for a long exciting learning experience.
Let’s get started! Let me first give you an overview of where machine learning is already being used today. This ranges from face recognition, virtual reality glasses, text-to-speech, speech-to-text, Voice Assistants, walking robot dogs (keyword: “reinforcement learning”) to the recommendation algorithms on amazon and other online retailers. It is also worth mentioning that machine learning will help us to analyze the huge amounts of data generated in Today’s Data Economy. To put this into perspective: Mankind produced a TOTAL of 130 exabytes of data by the year 2005 (all books, all songs, all speeches, etc.). However, in the following 10 years alone the amount of data generated by mankind was 60 times (!) that figure. That is: 7,900 exabytes.
Do you know what exabyte really means? – A letter (that is: a character) has one byte. A book page has about 1,024 characters, that’s equivalent to 1 KiloByte (actually: 1 Kilybyte = 1,024 Byte). A book with 1,024 pages has one MegaByte. One million megabytes are one TeraByte. If you film your entire life with an HD camera (every second!), then you get about one terabyte. And a million terabytes is equal to one exabyte.
Machine Learning / Artificial Intelligence – Setting up the IDE
After the introduction, we (= participants of the online course) download the freemium open source distribution package ANACONDA , which contains the development environment (IDE) Spyder for the programming language Python, plus several libraries. It is amazingly simple. In the next step, the download and installation process for the development environment for the programming language R is also explained (www.rstudio.com); but I am not following that step. The course offers tutorials on machine learning for both programming languages (Python, R); I, however, decided to follow only the implementation in Python.
During the course we will use various libraries that provide the required functions. The most important of these are numpy (various mathematical functions), matplotlib.pyplot (pyplot is a sub-library of MatPlotLib; with it you can easily create charts) and finally pandas for data import and data management. We start the machine learning course with the import of a simple data set as a data table (or matrix). The data comes from a (fictitious) online dealer. For each data set (i.e.: for each row in a data table) there are on the one hand independent variables (here: customer parameters such as age, salary, gender) and on the other hand dependent variables (here: buying behaviour).
Machine Learning / Artificial Intelligence – Basic Data Preparation
A first step in data preparation is to find an answer to a typical challenge: Missing data. For example, the age of a buyer is missing from a data set. Or the salary information for a particular buyer may be missing. One could simply delete such incomplete data sets (not recommended!), or one could fill these gaps with plausible values: That is, the average, median, or approximately the most frequently occurring value is used. For this we use a library, namely SciKit (“Simple and efficient tools for data mining and data analysis”).
Machine learning uses mathematical algorithms, and these only process numerical data. So-called categories (such as: countries of origin of buyers like “Germany”, “France” or “France”) must therefore be converted into numerical data. You may be tempted (as common in relational data models) to simply assign a numerical value to each category. For example: Germany = “1”, France = “2”, Spain = “3” and so on. However, such an approach is completely unsuitable for the mathematics of machine learning. Why? – This would result in a hierarchical structure or ordinal scale (3 is greater than 2, 2 is greater than 1). Instead, in machine learning, this transformation of category data is handled as follows: A separate column is generated for each category characteristic. We will have then three columns, one each for “Germany”, “France”, “Spain”. Per data set a “0” or “1” in each column provides the relevant information. If the buyer is from Spain, the values for the three columns are “0” (column: Germany), “0” (column: France), “1” (column: Spain).
Another important step in data preparation: The data is divided into a dataset for “training” (i.e.: the learning process during machine learning) and a dataset for “testing” (validation whether the result of machine learning is good). In practice, the typical split is 75:25 (Training:Testing). This testing during machine learning (and subsequent optimization of the machine learning process) is particularly important because phenomena such as overfitting must be avoided. Overfitting in machine learning refers to a situation in which the “machine” has simply “memorized” the various data sets, but has not understood the crucial correlations between the data (the “logic of correlations”).
Machine Learning / Artificial Intelligence – Feature Scaling
Let’s take a look at the last step in data preparation. For that purpose we need to understand a basic characteristic of machine learning: The algorithms for determining correlations between the data are based on the so-called Euclidean distance.
Fig.: Why Feature Scaling? – The Euclidian distance in mathematical models for Machine Learning
For example, if we have data from shoppers on age and salary, then the ages are typically between 15 and 100, the annual salary is between 20,000 and 200,000. If the “distance” in age (the difference in age) between two shoppers is 50, then that’s a lot. In terms of annual salary, that is not much. Without going into deeper mathematics, it should be intuitively clear that in order to compare the data dimensions “salary” and “age” a uniform scaling is required. One scales the values so that the values lie between “-1” and “+1”.
There are two procedures for this type of uniform scaling: Scaling using standard deviation and scaling using normalisation:
Fig.: Two methods for Features Scaling
Of course there is also a library for this task. You’ll be guided in the course to complete that task successfully.
Machine Learning / Artificial Intelligence – Part 2: Regression Methods
Part 2 of the Online Course guides us in applying regression methods: From simple linear regression to the method of “Random Forest Regression”: Machine Learning for Product Owner, IT Project Manager (Part 2): Regression Methods