Introduction to Machine Learning in Python

Data sets and Visualization

What is Machine Learning?

Machine learning is a subfield of Artificial Intelligence (AI). So what is Artificial Intelligence?

Andrew Moore, former Dean of the School of Computer Science at Carnegie Mellon University, defined it as follows: "Artificial intelligence is the science and engineering of making computers behave in ways that, until recently, we thought required human intelligence."

The question "What is artificial intelligence?" depends on the answer to a more general question: "What is intelligence?"

It shows extremels hard to answer the previous question.

To get closer to the answers we can divide AI into to partitions:

weak AI and strong AI

weak AI:

  • deals with specific application problems
  • Supporting human thinking in certain areas
  • capable of learning in sub-areas
  • no awareness

strong AI:

  • "general intelligence" (reason, logical thinking, use strategy, solve puzzles, and make judgments under uncertainty)
  • Comparable to human intelligence, but need not be the same, could be different
  • making plans
  • generally capable of learning
  • Communication skills, natural language
  • Awareness?
  • sentience, emotions?
  • self-perception?

We know now about Artificial Intelligence and Weak and Strong AI, but what about Machine Learning?

Let's start with a very "old" attempt at a definition by Arthur Samuek, an IBM pioneer:

"Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed."

A good attempt, but many questions remain unanswered. Almost 40 years later, in 1998, Tom Mitchell shaped a "well-off learning problem" as follows:

"Well posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

Annotation: A mathematical problem is called correctly (also well-posed, well-posed or properly posed) if the following conditions are met:

  • The problem has a solution (existence).
  • This solution is clearly defined (uniqueness).
  • the solution's behaviour changes continuously with the initial input data (stability).

Machine Learning:

Machine learning means that an algorithm (the machine) learns automatically. This means that it is capable of extracting the necessary knowledge from given data automatically. The goal is to make predictions on new, unseen data. There is another way of putting it: In traditional heuristic decision-making algorithms, the programmers set the rules according to which the decisions are made. With machine learning, this is done independently by the program without interence from human beings!



Machine learning taxonomy

There are two different approaches to Machine Learning:

  • Unsupervised Learning
  • Supervised Learning

We will solely cover "supervised learning in this tutorial".

Examples for machine learning:

  • spam filter: the algorithm learns a predictive model from data labelled as spam and "no spam" (ham). After training it can predict for new emails whether they are spam or not.
  • character recognition
  • object recognition in images
  • and many more

As already mentioned, a spam filter could be implemented using a classifier based on machine learning.

At the heart of machine learning is the concept of automating decision making from data without the user specifying explicit rules on how to make that decision. In the case of emails, the user does not provide a list of words or features that spam an email. Instead, the user provides examples of spam and non-spam emails that are marked as such. This is the so-called learning set.

The goal of a machine learning model is to predict new, previously invisible data. In a real application, we are not interested in marking an already marked email as spam or not. Instead, we want to make life easier for users by automatically classifying new incoming emails.

These examples are then learned or trained by the algorithm:



After the learning phase, we have to evaluate the classifier. We test both on labeled learning data and on non-learned labeled test data:



If we are satisfied with the results, the classifier is ready to classify completely new documents:



The data is presented to the algorithm usually as a two-dimensional array (or matrix) of numbers. Each data point (also known as a sample or training instance) that we want to either learn from or make a decision on is represented as a list of numbers, a so-called feature vector, and its containing features represent the properties of this point.