So we’re going to start by one of the most commonly used classifiers, called Linear Classifier, and in particular we’re going to use Logistic Regression, the most commonly used Linear Classifier and one of the most useful ones. However, these concepts are not limited to linear classifier and are the core ones in classification.
Our example in this part would be a new restaurant review system, where we’d like to apply it to chose a great Japanese restaurant that has awesome Sushi. There are many aspects about restaurant, but what I really care is amazing food and ambience. So by just looking at a review I might not get what I want, instead, it might give me some info about the location, service, parking lot and etc. which are not important to me, as well as some info about other foods than Sushi, which again I don’t care at all. The only thing that I really care about is Sushi and I only care about reviews about the sushi.
However, to make it simple for now, we’re going to send every sentence to a classifier and it tells us whether it’s a positive or a negative one.

A linear classifier takes some quantity X (in our case sentences from reviews), feeds it to its model, and then predicts the output label and tells us whether it’s a negative or positive review.
We give a sentence as input to our classifier and predicts the output class, which in our example is either positive or negative.
For doing so a linear classifier assign a weight to each word and defines how positive or negative influence has each word in our review. These weights or coefficients are learned from the training data.

Note: The model is called a linear classifier because the output is the weighted sum of the inputs.

Now let’s have an example and calculate the score. Imagine we have this review:
Sushi was great, the food was awesome, but the service was terrible.
 As only three words in this review have coefficients of not zero, the score would be:

Score = 1.2+1.7-2.1 = 0.8 
Because the score is bigger than zero, it will be considered as a positive review.

So generally our simple linear classifier calculates the score of the input based on its weights and coefficients for the words, and if it’s greater than zero, it’s positive and vice versa.

Now let's see what is our learning box, and how we can calculate the coefficients for each word.
To train the weights of our linear classifier we use some training data, which are some reviews with label of either positive and negative ones. We split them to training set and validation set, and then feed them to a learning algorithm to learn the weight for each word. After we learn our classifier we go back and evaluate accuracy on the validation set.

To understand the coefficients, let's review the notion of decision boundary, which is the boundary between positive and negative predictions.
Imagine that every word has the weights of zero except awesome and awful, which have +1 and -1.5 respectively. Let's plot it into a graph and see what happens. 
For example if we plot the following sentence into a space of these two features:
The sushi was awesome, the food was awesome, but the service was awful.
it get in the 2,1 point, and if we plot some other points as well, we end up with a plot as below:

The decision boundary for this classifier with coefficients of +1 and -1.5 is a Line, where everything below this line is positive and everything above this line is negative, and for this line the number of awesome times +1 and the number of awful times -1.5 is equal to zero (line equation).  

Note: For the decision boundary, for 2 features our decision boundary is a LINE, for 3 features instead of a line we have a PLANE, and for more than 3 features, we have a HYPERPLANE.

So in a three dimensional space, we have a plane and if we write the plane equation, or in other word our score function, it would be like this:
\(Score(x) = w_0 + w_1 \#awesome + w_2 \#awful + w_3 \#great\)

Note: Our General Notations are as follow: 

Now our model tries to give the score function a sign, as positive or negative,
$$Score(x) = w_0 + w_1 x_i[1] + w_2 x_i[2]  + w_3 x_i[3] + ... + w_d x_i[d] = W^T X_i$$
$$Model : \hat y_i = sign(Score(x_i))$$

Effect of Changing Coefficients:
Note that if we change our coefficients, our decision boundary line (plane or hyperplane) will change, for example if we change our coefficients in the previous example we will end up with a different decision boundary like this:

Also we do some feature extraction and feature engineering on our input data x and use them instead of using raw inputs, for example instead of using \(x_i[1]\) we can use \( h_1[x_i]\), where \( h_1[x_i]\) could be \(log(x_i[1])\) or \(\sqrt x_i[1]\) or even better, the TFIDF of it.
Thus we will rewrite our score function like this:

$$ Score(x) = w_0 h_0(x_i) + w_1 h_1(x_i) + ... + w_D h_D(x_i) = \sum_{j=1}^D w_j h_j(x_i) = w^T h(x_i) \\
Model: \hat y_i = sign(Score(x_i))$$

Logistic regression doesn't just tell us that a review is positive or negative, but it also tells us how we can be sure that it is a positive or negative review with the notion of probability.

If we say something like the probability of the output be positive is 0.7  :
$$P(y=+1) = 0.7 $$
it means that in our dataset of reviews 70 percent of the reviews are positive.

Also probabilities are always between 0 and 1 and they're sum up to 1.
Conditional probabilities are the probability of something given an input, for example the following is the probability of output label is 1 given than input sentence has 3 awesome and 1 awful in it:

Now how we can map our score that is between \(-\infty \) to \(+\infty\) to 0 and 1?
Here we use the generalized linear model that uses a link function to link Score function to the probability. It squeezes \(-\infty \) to \(+\infty\) to 0 and 1.

This link function for logistic regression is called Logistic Function or Sigmoid or Logit, and its formula is:
$$\frac {1}{1+e^{-Score}} $$

For \(-\infty \) it is 0, for \(+\infty\) it is 1, and for 0 it is 0.5, which is exactly what we want.

Also note than when we change our coefficients, the S shape of our sigmoid function changes, so if we add a constant we can shift it to the left or right, and if we change the magnitude of other coefficients we can make it sharper or flatter.

So far our model is like the follow:

To find the best value for \(\hat w\) we use a concept called Likelihood where I'll describe it in the next post, and by using that we are able to find the best decision boundary and best coefficients w with a technique calls gradient ascent.

If you have numeric inputs it makes sense when they are multiplied by the coefficients, but when you have categorical data, for example zipcodes of houses, it doesn't make sense thinking like that, for example zipcode of 99150 is not 9 times bigger than another one with 10432, they're just for two different zone. Or if you have categories of animals, you can't say that the category of dogs are greater than kangaroos.
Here we use a concept called 1-hot encoding, where we generate one boolean column for each category and only one of these columns can get the value 1 for each sample.
Bag of words is another example of transferring categorical data into numbers.

For multiclass classification we use a concept called 1 versus all and each time we create two categories of one class and the rest of classes. 
For example if we have three classes of hearts, doughnuts and triangles, each time we train our classifier for one of them against the rest.
You can see these steps in the following images;
This is our data:

For example here we select triangles against doughnuts and hearts:

 But in general we do the same procedure for all of the classes:

Then when we want to predict a new instance, we go over all of the classifiers (where they have different \( \hat w\)s ) and look for the highest probability that is predicted.

Classification is one of the fundamental concepts and widely used areas of machine learning. With classifiers, we learn a model that takes the input and predicts a class for output, which can be considered as labels or categories. In other words it’s a mapping from x to y.

For this part, we use a sentiment classifier that receives a review sentence as input and classifies it as a positive or negative sentence. Other examples for our classifier would be a spam or not spam classification, where it receives an email as the input and predicts the probability that the email is Spam or Not Spam as the output.

But classification is not just binary, it can be about predicting multiple classes or categories. for example you want to show some ads on a web page and would like to know what kind of ads you have to put on this page. In this example, our classifier receives a web page as input and categorizes it as a web page about education, finance, technology, ….

As we mentioned earlier, another great example of classifiers are spam filters, where a classifier categorizes an email based on not only text of the email, but also on IP address of the sender, other emails that this sender has already sent, receivers and so on.

Image classification is another example in many different areas, and based on the image, the classifier predicts the output;

Personalized medical diagnosis is an example in this area, where patients are not treated as they are in the same conditions. Because everybody has different lifestyle, DNA sequence, conditions and so on and taking into account those information make a lot of differences when predicting a patient’s condition and the kind of treatment that is personalized and the most effective for her or him.

Literally the classification is the core of every technology that we are using nowadays, whether it’s the spam filtering, search engine results, product recommendation, ads shown, personalized medical treatment and so on.

Now to make you understand the classification concept better, I provide you a real and useful application in my next post, which is An intelligent restaurant review system! 

Hope you enjoy it ;)