So we’re going to start by

**one of the most commonly used classifiers**, called

**Linear Classifier**, and in particular we’re going to use

**Logistic Regression**,

**the most commonly used Linear Classifier**and one of the most useful ones. However, these concepts are not limited to linear classifier and are the core ones in classification.

Our example in this part would be a new

*restaurant review system*, where we’d like to apply it to chose a great Japanese restaurant that has awesome Sushi. There are many aspects about restaurant, but what I really care is

*amazing food and ambience*. So by just looking at a review I might not get what I want, instead, it might give me some info about the location, service, parking lot and etc. which are not important to me, as well as some info about other foods than Sushi, which again I don’t care at all. The only thing that I really care about is Sushi and I only care about reviews about the sushi.

However, to make it simple for now, we’re going to send every sentence to a classifier and it tells us whether it’s a positive or a negative one.

A l

**inear classifier**

**takes**some

*quantity***(in our case sentences from reviews),**

*X***feeds it to its model**, and then

**predicts the output label**and tells us whether it’s a negative or positive review.

We give a

*sentence*as

*input*to our classifier and

*predicts*the o

*utput class*, which in our example is

*either positive or negative*.

For doing so a

**linear classifier**

**assign**a

**weight**to

**each word**and

**defines how positive or negative influence**has each word in our review. These

**weights or coefficients**are

**learned**

**from the training data.**

**: The model is called a**

*Note***because**

*linear classifier**.*

**the output is the weighted sum of the inputs**Now let’s have an example and calculate the score. Imagine we have this review:

Sushi was great, the food was awesome, but the service was terrible.As only three words in this review have coefficients of not zero, the score would be:

Score = 1.2+1.7-2.1 = 0.8Because the score is bigger than zero, it will be considered as a positive review.

So generally our simple linear classifier calculates the score of the input based on its weights and coefficients for the words, and if it’s greater than zero, it’s positive and vice versa.

Now let's see what is our learning box, and how we can calculate the coefficients for each word.

To

**train the weights**of our linear classifier we use some

**training data**, which are some

**reviews with label**of either positive and negative ones. We

**split**them to

**training set**and

**validation set**, and then

**feed**them to a

**learning algorithm**to learn the weight for each word. After we learn our classifier we go back and

**evaluate accuracy on the validation set**.

To understand the coefficients, let's review the notion of decision boundary, which is the boundary between positive and negative predictions.

Imagine that every word has the weights of zero except awesome and awful, which have +1 and -1.5 respectively. Let's plot it into a graph and see what happens.

For example if we plot the following sentence into a space of these two features:

The sushi wasit get in the 2,1 point, and if we plot some other points as well, we end up with a plot as below:awesome, the food wasawesome, but the service wasawful.

The

**decision boundary**for this classifier with coefficients of +1 and -1.5 is a**Line**, where everything**below**this line is**positive**and everything**above**this line is**negative**, and for this line**the number of awesome**times**and***+1**the number of awful**times**-1.5*is**equal to zero**(line equation).*Note:*For the decision boundary, for 2 features our decision boundary is a

**LINE,**for 3 features instead of a line we have a

**PLANE**, and for more than 3 features, we have a

**HYPERPLANE.**

So in a three dimensional space, we have a

**plane**and if we write the plane equation, or in other word our score function, it would be like this:
\(Score(x) = w_0 + w_1 \#awesome + w_2 \#awful + w_3 \#great\)

**Note:**Our General Notations are as follow:

Now our model tries to give the score function a sign, as positive or negative,

$$Score(x) = w_0 + w_1 x_i[1] + w_2 x_i[2] + w_3 x_i[3] + ... + w_d x_i[d] = W^T X_i$$

$$Model : \hat y_i = sign(Score(x_i))$$

Note that if we change our coefficients, our decision boundary line (plane or hyperplane) will change, for example if we change our coefficients in the previous example we will end up with a different decision boundary like this:

Also we do some

Thus we will rewrite our score function like this:

$$ Score(x) = w_0 h_0(x_i) + w_1 h_1(x_i) + ... + w_D h_D(x_i) = \sum_{j=1}^D w_j h_j(x_i) = w^T h(x_i) \\

Model: \hat y_i = sign(Score(x_i))$$

$$Score(x) = w_0 + w_1 x_i[1] + w_2 x_i[2] + w_3 x_i[3] + ... + w_d x_i[d] = W^T X_i$$

$$Model : \hat y_i = sign(Score(x_i))$$

**Effect of Changing Coefficients:**Note that if we change our coefficients, our decision boundary line (plane or hyperplane) will change, for example if we change our coefficients in the previous example we will end up with a different decision boundary like this:

Also we do some

**feature extraction and feature engineering**on our input data*x*and use them instead of using raw inputs, for example instead of using \(x_i[1]\) we can use \( h_1[x_i]\), where \( h_1[x_i]\) could be \(log(x_i[1])\) or \(\sqrt x_i[1]\) or even better, the**TFIDF**of it.Thus we will rewrite our score function like this:

$$ Score(x) = w_0 h_0(x_i) + w_1 h_1(x_i) + ... + w_D h_D(x_i) = \sum_{j=1}^D w_j h_j(x_i) = w^T h(x_i) \\

Model: \hat y_i = sign(Score(x_i))$$

Logistic regression doesn't just tell us that a review is positive or negative, but it also tells us how we can be sure that it is a positive or negative review with the notion of probability.

Review:

If we say something like the probability of the output be positive is 0.7 :

$$P(y=+1) = 0.7 $$

it means that in our dataset of reviews 70 percent of the reviews are positive.

Now how we can map our score that is between \(-\infty \) to \(+\infty\) to 0 and 1?

Here we use the

But in general we do the same procedure for all of the classes:

Then when we want to

Review:

If we say something like the probability of the output be positive is 0.7 :

$$P(y=+1) = 0.7 $$

it means that in our dataset of reviews 70 percent of the reviews are positive.

Also probabilities are always between 0 and 1 and they're sum up to 1.

Conditional probabilities are the probability of something given an input, for example the following is the probability of output label is 1 given than input sentence has 3 awesome and 1 awful in it:

Here we use the

**generalized linear model**that uses a**link function**to link*to the***Score function****probability**.**It squeezes \(-\infty \) to \(+\infty\) to 0 and 1.**

This

**link function**for**logistic regression**is called*Logistic Function or Sigmoid or Logit,*and its formula is:
$$\frac {1}{1+e^{-Score}} $$

For \(-\infty \) it is 0, for \(+\infty\) it is 1, and for 0 it is 0.5, which is exactly what we want.

Also note than when we change our coefficients, the

**S shape**of our sigmoid function changes, so if we**add a constant**we can**shift it to the left or right**, and if we change the**magnitude**of other**coefficients**we can make it**sharper or flatter**.
So far our model is like the follow:

**NOTE:**

To find the best value for \(\hat w\) we use a concept called

**Likelihood**where I'll describe it in the next post, and by using that we are able to find the best decision boundary and best coefficients w with a technique calls gradient ascent.**NOTE:**

If you have

**numeric**inputs it makes sense when they are**multiplied by**the**coefficients**, but when you have**categorical data**, for example**zipcodes**of houses, it**doesn't make sense**thinking like that, for example zipcode of 99150 is not 9 times bigger than another one with 10432, they're just for two different zone. Or if you have categories of animals, you*.***can't say that the category of dogs are greater than kangaroos**
Here we use a concept called

**1-hot encoding**, where we generate**one boolean column for each category**and**only one of these columns**can get the**value 1**for each sample.**Bag of words**is another example of transferring categorical data into numbers.

**NOTE:**

For

**multiclass classification**we use a concept called**1 versus all**and each time we create two categories of one class and the rest of classes.
For example if we have three classes of hearts, doughnuts and triangles, each time we

**train our classifier for one of them against the rest.**
You can see these steps in the following images;

This is our data:

For example here we select triangles against doughnuts and hearts:

Then when we want to

**predict**a new instance, we**go over all of the classifiers**(where they have different \( \hat w\)s ) and look for the**highest probability**that is predicted.