So we’re going to start by one of the most commonly used classifiers, called Linear Classifier, and in particular we’re going to use Logistic Regression, the most commonly used Linear Classifier and one of the most useful ones. However, these concepts are not limited to linear classifier and are the core ones in classification.
Our example in this part would be a new restaurant review system, where we’d like to apply it to chose a great Japanese restaurant that has awesome Sushi. There are many aspects about restaurant, but what I really care is amazing food and ambience. So by just looking at a review I might not get what I want, instead, it might give me some info about the location, service, parking lot and etc. which are not important to me, as well as some info about other foods than Sushi, which again I don’t care at all. The only thing that I really care about is Sushi and I only care about reviews about the sushi.
However, to make it simple for now, we’re going to send every sentence to a classifier and it tells us whether it’s a positive or a negative one.

A linear classifier takes some quantity X (in our case sentences from reviews), feeds it to its model, and then predicts the output label and tells us whether it’s a negative or positive review.
We give a sentence as input to our classifier and predicts the output class, which in our example is either positive or negative.
For doing so a linear classifier assign a weight to each word and defines how positive or negative influence has each word in our review. These weights or coefficients are learned from the training data.

Note: The model is called a linear classifier because the output is the weighted sum of the inputs.

Now let’s have an example and calculate the score. Imagine we have this review:
Sushi was great, the food was awesome, but the service was terrible.
 As only three words in this review have coefficients of not zero, the score would be:

Score = 1.2+1.7-2.1 = 0.8 
Because the score is bigger than zero, it will be considered as a positive review.

So generally our simple linear classifier calculates the score of the input based on its weights and coefficients for the words, and if it’s greater than zero, it’s positive and vice versa.

Now let's see what is our learning box, and how we can calculate the coefficients for each word.
To train the weights of our linear classifier we use some training data, which are some reviews with label of either positive and negative ones. We split them to training set and validation set, and then feed them to a learning algorithm to learn the weight for each word. After we learn our classifier we go back and evaluate accuracy on the validation set.

To understand the coefficients, let's review the notion of decision boundary, which is the boundary between positive and negative predictions.
Imagine that every word has the weights of zero except awesome and awful, which have +1 and -1.5 respectively. Let's plot it into a graph and see what happens. 
For example if we plot the following sentence into a space of these two features:
The sushi was awesome, the food was awesome, but the service was awful.
it get in the 2,1 point, and if we plot some other points as well, we end up with a plot as below:

The decision boundary for this classifier with coefficients of +1 and -1.5 is a Line, where everything below this line is positive and everything above this line is negative, and for this line the number of awesome times +1 and the number of awful times -1.5 is equal to zero (line equation).  

Note: For the decision boundary, for 2 features our decision boundary is a LINE, for 3 features instead of a line we have a PLANE, and for more than 3 features, we have a HYPERPLANE.

So in a three dimensional space, we have a plane and if we write the plane equation, or in other word our score function, it would be like this:
\(Score(x) = w_0 + w_1 \#awesome + w_2 \#awful + w_3 \#great\)

Note: Our General Notations are as follow: 

Now our model tries to give the score function a sign, as positive or negative,
$$Score(x) = w_0 + w_1 x_i[1] + w_2 x_i[2]  + w_3 x_i[3] + ... + w_d x_i[d] = W^T X_i$$
$$Model : \hat y_i = sign(Score(x_i))$$

Effect of Changing Coefficients:
Note that if we change our coefficients, our decision boundary line (plane or hyperplane) will change, for example if we change our coefficients in the previous example we will end up with a different decision boundary like this:

Also we do some feature extraction and feature engineering on our input data x and use them instead of using raw inputs, for example instead of using \(x_i[1]\) we can use \( h_1[x_i]\), where \( h_1[x_i]\) could be \(log(x_i[1])\) or \(\sqrt x_i[1]\) or even better, the TFIDF of it.
Thus we will rewrite our score function like this:

$$ Score(x) = w_0 h_0(x_i) + w_1 h_1(x_i) + ... + w_D h_D(x_i) = \sum_{j=1}^D w_j h_j(x_i) = w^T h(x_i) \\
Model: \hat y_i = sign(Score(x_i))$$

Logistic regression doesn't just tell us that a review is positive or negative, but it also tells us how we can be sure that it is a positive or negative review with the notion of probability.

If we say something like the probability of the output be positive is 0.7  :
$$P(y=+1) = 0.7 $$
it means that in our dataset of reviews 70 percent of the reviews are positive.

Also probabilities are always between 0 and 1 and they're sum up to 1.
Conditional probabilities are the probability of something given an input, for example the following is the probability of output label is 1 given than input sentence has 3 awesome and 1 awful in it:

Now how we can map our score that is between \(-\infty \) to \(+\infty\) to 0 and 1?
Here we use the generalized linear model that uses a link function to link Score function to the probability. It squeezes \(-\infty \) to \(+\infty\) to 0 and 1.

This link function for logistic regression is called Logistic Function or Sigmoid or Logit, and its formula is:
$$\frac {1}{1+e^{-Score}} $$

For \(-\infty \) it is 0, for \(+\infty\) it is 1, and for 0 it is 0.5, which is exactly what we want.

Also note than when we change our coefficients, the S shape of our sigmoid function changes, so if we add a constant we can shift it to the left or right, and if we change the magnitude of other coefficients we can make it sharper or flatter.

So far our model is like the follow:

To find the best value for \(\hat w\) we use a concept called Likelihood where I'll describe it in the next post, and by using that we are able to find the best decision boundary and best coefficients w with a technique calls gradient ascent.

If you have numeric inputs it makes sense when they are multiplied by the coefficients, but when you have categorical data, for example zipcodes of houses, it doesn't make sense thinking like that, for example zipcode of 99150 is not 9 times bigger than another one with 10432, they're just for two different zone. Or if you have categories of animals, you can't say that the category of dogs are greater than kangaroos.
Here we use a concept called 1-hot encoding, where we generate one boolean column for each category and only one of these columns can get the value 1 for each sample.
Bag of words is another example of transferring categorical data into numbers.

For multiclass classification we use a concept called 1 versus all and each time we create two categories of one class and the rest of classes. 
For example if we have three classes of hearts, doughnuts and triangles, each time we train our classifier for one of them against the rest.
You can see these steps in the following images;
This is our data:

For example here we select triangles against doughnuts and hearts:

 But in general we do the same procedure for all of the classes:

Then when we want to predict a new instance, we go over all of the classifiers (where they have different \( \hat w\)s ) and look for the highest probability that is predicted.

Classification is one of the fundamental concepts and widely used areas of machine learning. With classifiers, we learn a model that takes the input and predicts a class for output, which can be considered as labels or categories. In other words it’s a mapping from x to y.

For this part, we use a sentiment classifier that receives a review sentence as input and classifies it as a positive or negative sentence. Other examples for our classifier would be a spam or not spam classification, where it receives an email as the input and predicts the probability that the email is Spam or Not Spam as the output.

But classification is not just binary, it can be about predicting multiple classes or categories. for example you want to show some ads on a web page and would like to know what kind of ads you have to put on this page. In this example, our classifier receives a web page as input and categorizes it as a web page about education, finance, technology, ….

As we mentioned earlier, another great example of classifiers are spam filters, where a classifier categorizes an email based on not only text of the email, but also on IP address of the sender, other emails that this sender has already sent, receivers and so on.

Image classification is another example in many different areas, and based on the image, the classifier predicts the output;

Personalized medical diagnosis is an example in this area, where patients are not treated as they are in the same conditions. Because everybody has different lifestyle, DNA sequence, conditions and so on and taking into account those information make a lot of differences when predicting a patient’s condition and the kind of treatment that is personalized and the most effective for her or him.

Literally the classification is the core of every technology that we are using nowadays, whether it’s the spam filtering, search engine results, product recommendation, ads shown, personalized medical treatment and so on.

Now to make you understand the classification concept better, I provide you a real and useful application in my next post, which is An intelligent restaurant review system! 

Hope you enjoy it ;)

Welcome to the Machine Learning world;

It is said that the first post must be a hello world! SO, BE IT! ;)

I'm going to provide you some simple step by step tutorials about Machine Learning and its usage in real world with our lovely open source libraries, Scikit-Learn and Tensorflow, two of the best ML Libraries in the world!

So let's get started with our simple HELLO WORLD! It will be very simple and interesting, I promise.

First things first!

Let's start by a simple definition of Machine Learning;

Machine Learning can be assumed as a sub field of Artificial Intelligence, and it's the study of algorithms that learn from examples and experience, instead of relying on hard-code rules.

Imagine that you want to write a program that takes an image file as the input and tell you whether it’s an orange or apple. So how can you do this?

Maybe by defining a set of rules that count green and orange pixels and ratio of the image and so on. But what if the image is in black and white? Or what if you have an image of a watermelon instead? 
you can write as much rules as you want, but I can give you an exception for each one of them. 

We have two classes,
Orange and Apple;
And we have to write a long list of rules
to identify them.

So clearly, we need an algorithm that can figure out these rules for us, so we don’t need to write them by hand. Such an algorithm is called a classifier. You can imagine a classifier as a function, where it gets an input, process it and generates an output. For our fruit example, it takes an image and assign a label (apple and orange) to it as an output. The technique to write the classifier automatically is called supervised learning and it starts with the examples of the problem you want to solve.

The recipe is very simple and is comprised of 3 steps;

Step1: Collecting our training data
So we put our examples as the training data, and the output would be our predictions. Our training data would be the description about the apples and oranges, such as their weights and textures, where we call them features. The description would contain features such as the sizes, textures, weights and so on.

We create a table from our examples. Each row of our table is an example that describes one piece of fruit, and the last column of it is the label and identifies which type of fruit is in each row. Except the last column where we call it Label, the other ones are our Features. A good feature is the one that makes it easy to discriminate between different types of fruit. 
The more training data you have, the better a classifier you can create.

Step 2: Training our classifier
For this simple example our classifier is a decision tree, and for now we can think of our classifier as a box of rules.

After we import tree from Scikit-Learn, we can start our training by creating an empty classifier. The following code is doing it.
Note that all of the codes are available in my Github page.

from sklearn import tree
clf = tree.DecisionTreeClassifier()

We also create some imaginary training data as below;
Just note that you have to use real-valued feature for Scikit-Learn, and you cannot use "bumpy" or "smooth" for the values in feature, instead, use real-valued such as 0 or 1. 
features = [[110, 1], [114,0], [112,0], [143,1], [150,1],[114,1]]
labels = ['apple','apple','apple','orange','orange','orange']
In order to train it, we need a learning algorithm. If the classifier is a box of rules, you can think of the learning algorithm as the procedure that creates them. It does that by finding patter in your training data. For example it might find out that the heavier fruits are more likely to be oranges and that’s a rule for our classifier.

In ScikitLearn the learning algorithm is in the classifier object and it’s called “fit”. You can train your classifier by the following code:
clf =, labels)

fit can be considered as a synonym for “find pattern in data”.

And voila! We have a trained classifier.

Step 3: Make Prediction
Now let’s use our classifier to predict some new data;
For example we have the following features and we’d like to find out which fruits they are.
newData = [[141,1],[118,0]]
and now if we use our classifier, it says that they are:
print clf.predict(newData) 
['orange' 'apple'] 

That's it!

  • I. We can change our training data to have a new classifier for a new problem such as type of cars. And it means this approach is reusable and you don’t have to rewrite your rules for each new problem every time.
  • II. Programming in Machine Learning is simple, but it’s not easy!