Technical: April 2019

Saturday, April 27, 2019

Linear Regression Implementation with python

Hello all, I hope from last few posts you already have good theoretical concept about the Machine Learning Algorithms. Today, we will do a Simple Linear Regression implementation with python. It won’t take much time and I will try to explain every step with simple words.

It is called Simple Linear Regression as it considers only one feature of input data and make the prediction. For example, here we will consider a housing price data set. As it is Simple Regression, it will only consider the size of the house to predict the price of it. But Multiple Regression, to predict the price it may consider several features such as locality, Front/back facing house etc. Below is the input data which we will use for the prediction, here house_size(x) is the input ranging from 1k sqr meter to 14k sqr meter and price(y) of the house ranging from 300 to 1100 dollar.

A scattered plot of the housing data looks like this:

Now we must find a line, which fits this scattered plot known as Regression line, so that we can predict house price for any given size(x). The equation for the Regression line looks like this

- h(x_ith)= B0 + B1*x_ith

where, h(x_ith) represents prediction for x_ith and B0,B1 are the regression coefficients. To make the prediction, we need to estimate the regression coefficients (B0, B1). For implementation we need to follow the below steps:

Step1: Import the libraries. NumPy is the fundamental package for scientific computing with Python. In machine learning as we need to deal with a huge amount of data we use NumPy, which is faster than normal array. Matplotlib is a plotting library in python, we will use it for visualization.

Step2: Take the mean of the house_size(x) and the price(y). Calculate cross-deviation and deviation by calculating Sum of Squared Errors.
Step3: Calculate regression coefficients or the prediction error(explained in previous block here: )
Step4: Plot the scattered points on the graph with red colors. The x-axis represents the size of the house(house_size) and the y-axis represents the price. (figure above)

Step5: Predict the regression line with minimum error and plot it with purple color.

Step6: Lastly, write the main and call the main function. And the final output of the code is

Estimated coefficients:

b_0 = 295.95147839272175

b_1 = 57.31614859742229

And the graph should look like this-

You can download the full code(linearRegression.py) from github here: source code

Hope you enjoyed today’s post. Stay tuned for more python implementation. Do let me know your feedbacks and comments below.

I want to share a good news, my blog was featured in the top4 machine learningblogs, please look at number 19 here: https://blog.feedspot.com/machine_learning_blogs/

Next blog:Harry Potter's magical Cloak with opencv

Saturday, April 13, 2019

Top 10 interview questions in Machine Learning(ML) and Artificial Intelligence (AI)

Hello Folks, if you read my previous three posts on Artificial Intelligence (AI), then congratulations you have the basic knowledge about the Machine Learning algorithms if not please read them. Today I would like to discuss about some most commonly used interview question on the field of Machine Learning and AI. Which would help you crack your interviews in machine Learning. Most of the basic things are already covered, remaining we will learn here.

Let’s get started

What is Gradient Decent?

- Gradient decent is an optimization algorithm which minimizes any given function. Given a function Gradient decent starts with an initial set of parameters and iteratively move to the set of parameters which provides minimum for that particular function. It is little difficult to visualize; I will try to give an example with figures for better understanding.
- In the above figure the blue dots are actual house prices(y_Actual) corroding to the house size, green line is the predicted house price(y_Prediction) and yellow dotted lines are prediction errors (prediction error= y_Prediction - y_Actual). So, the aim is to improve the prediction by minimizing the prediction error (y_Predict - y_Actual). Gradient decent is the algorithm which is used to minimize the prediction error and optimize the function.

What are the differences between Random forest and Gradient boosting? Or explain the difference between bagging and boosting algorithms.

The difference between Random Forest and Gradient boosting is as follows-

- Randam forest uses bagging and samples randomly, whereas gradient boosting uses bagging, boosting samples with an increased weight on the ones that it got wrong previously

- Because all the trees in random forest are built without any consideration for any of the other trees, this is incredibly easy to parallelize, which means that it can train really quick. Whereas gradient boosting is iterative in that it relies on the results of the tree before it, in order to apply a higher weight to the ones that the previous tree got incorrect. So, boosting can't be parallelized, and it takes much longer to train.

- The final predictions for random forest are typically an unweighted average or an unweighted voting, while boosting uses a weighted voting.

- Lastly, random forest is easier to tune, faster to train and harder to overfit, while gradient boosting is harder to tune, slower to train, and easier to overfit.

So, with that why would you go with gradient boosting? Well, the trade-off is that gradient boosting is typically more powerful and better-performing if tuned properly.

What are the benefits of using gradient boosting?

- Well, it's one of the most powerful machine learning classifiers out there. It also accepts various types of inputs just like random forest, so it makes it very flexible. It can also be used for classification or regression, and the outputs feature importance which can be super useful. But it's not perfect. Some of the drawbacks are that it takes longer to train because it can't be parallelized, it's more likely to overfit because it obsesses over those ones that it got wrong, and it can get lost pursuing those outliers that don't really represent the overall population.

What are Bias and Variance?

- The prediction error in machine learning algorithms can be divided into three types-

o Bias error,

o Variance error and

o Irreducible error

- The irreducible error cannot be reduced whatever algorithm is used. So, we will focus into Bias and variance error.

- Bias is the assumptions made by the model to make the target function easier to approximate. High bias can cause an algorithm to miss the relevant relations between features and target outputs (under fitting).

- Variance is the amount that the estimate of the target function will change given different training data. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (over-fitting).

What is Bias Variance trade-off?

- The bias and variance trade-off is an import aspect of machine learning algorithm. To get an accurate model, an engineer’s goal is to reduce the bias and variance as much as possible. However, it is not feasible in real life. If a learning algorithm has low bias it must be very flexible so the it can fir any data. But if the learning algorithm is too flexible it will fit ever training data set and increase the variance error. So, there should be a trade-off between bias and variance when selecting models of different flexibility or complexity and in selecting appropriate training sets to minimize these sources of error!

Explain the difference between L1 and L2 regularization

- L2 regularization tends to spread error among all the terms, while L1 is more binary/sparser, with many variables either being assigned a 1 or 0 in weighting.

Difference between KMEAN and KNN(K Nearest Neighbor) algorithms

- The main difference is Kmean clustering is unsupervised whereas KNN is supervised machine learning algorithm. Which means KNN needs labelled data for prediction but Kmean doesn’t need as it is unsupervised.

- Kmean is used for clustering problem whereas KNN is a supervised learning algorithm used for classification and regression problem.

What are different Machine Learning techniques?

- The different type of machine learning algorithms are-

o Supervised Machine Learning Algorithms,

o Unsupervised Machine Learning ALgoritms,

o Semi-Supervised Machine Learning Algorithms,

o Re-inforcement Machine Learning algorithms

For details please read my previous post here:Supervised, Un-Supervised, Semi-Supervised machine and Reinforcement Learning algorithms

Difference Between Supervised and Unsupervised machine learning algorithms

- please read my previous post here :Supervised, Un-Supervised, Semi-Supervised machine and Reinforcement Learning algorithms

What are most commonly used Machine Learning Algorithms?

- please read my previous post here:10 Most Commonly Used Machine Learning Algorithms

If you have any other question which I can add to this list, please let me know in the comment section. Any feedback or suggestion is always welcome. Stay tuned for next post. Regards, Mostafiz

Next post:Linear Regression Implementation with python

Thursday, April 4, 2019

10 Most Commonly Used Machine Learning Algorithms

In machine learning, there’s something called the “No Free Lunch” theorem, which means no algorithm performs best for every problem. So, you need to figure out which algorithm is best for your problem with the available data set. In today’s blog I will focus on 10 most commonly used machine learning algorithms. As we are going to learn 10 different algorithms in this post, it will be little longer than usual, but have patient I will try to make it as simple as possible. So, let’s get started~

Linear Regression

- Linear Regression is supervised learning as you may remember from our last lession, regression is supervised machine learning algorithm. Linear Regression is a model that assumes a linear relationship between the input variables (x) and the single output variable (y) and can predict the output. The representation of linear regression is an equation that describes a line that best fits the relationship between the input variables (x) and the output variables (y), by finding specific weightings for the input variables called coefficients (B). For example: y = B0 + B1 * x. Example: We will consider the same regression example here(figure below), if we have a data set of house prices with respect to house size, it can predict an unknown house price(q), if given the house size(P).

- Some good rules of thumb when using this technique are to remove variables that are very similar (correlated) and to remove noise from your data, if possible. It is a fast and simple technique and good first algorithm to try.

Logistic Regression

- Logistic regression is like linear regression, but instead of fitting a straight line or hyperplane, the prediction for the output is transformed using a non-linear function called the logistic function or sigmoid function. The function looks like a big S and transforms any output to 0 to 1 range. For your reference please see the below figure(taken from wiki https://en.wikipedia.org/wiki/Logistic_regression#/media/File:Exam_pass_logistic_curve.jpeg)

- Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other.

Linear discriminate Analysis

- It consists of statistical properties of your data, calculated for each class. For a single input variable this includes: The mean value for each class, The variance calculated across all classes. Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value

- so it is a good idea to remove outliers from your data beforehand. It’s a simple and powerful method for classification predictive modelling problems.

Classification and Regression Trees or Decision Trees

- Decision trees are important type of algorithm for predicting models. Each node represents a single input variable(x) and a split point on that variable. The leaf node of the tree contains an output (y) and the prediction for the model. Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.

Naïve Bais Algorithm

- The definition of Bayes theorem is- P(A|B)=P(B|A)P(A)/P(B), where A,B are events and P(A|B)- is a conditional probability: the likelihood of event A occurring given that B is true. P(A) and P(B) are the probabilities of observing A and B independently of each other; this is known as the marginal probability.

- Naive Bayes is called ‘naïve’ because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.

- The model is consist of to two types of probabilities that can be calculated directly from the training data. They are - A. probability of each class, B. Conditional probability of each class given each x value. Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem.

K-NN(K Nearest Neighbor) algorithm

- K nearest neighbors algorithm is a simple procedure to store all available cases and classifies new cases based on a similarity measure. It is a simple, easy-to-implement supervised machine learning algorithm which can be used for both classification and regression algorithms. Predictions are made for a new data point after searching through the entire training set for the K most similar neighbors and by summarizing the output variable for those K instances. The idea of distance or closeness with neighbors can be break down in very high dimensions (lots of input variables) and that also can negatively affect the performance of the algorithm. This is called the curse of dimensionality. Which means you only use those input variables that are most relevant to predicting the output variable.

- As this algorithm is frequently used and easy to implement, I will try to explain it with the following diagrams and data set. Suppose, we have a data set with two groups, group A(blue) and group B(yellow) as shown in the figure below and we want to classify the unknown point p1(red). Do to so, the algorithm will try and find 4 nearest distanced neighbour(as k=4) for the point p1 and label them accordingly.

Learning Vector Quantization(LVQ)

- A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.

- The representation for LVQ is a collection of codebook vectors. These are selected randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can be used to make predictions just like K-Nearest Neighbors. The most similar neighbor (best matching codebook vector) is found by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction.

Support Vector Machine(SVM)

- Support Vector Machine” (SVM) is a supervised machine learning algorithm. This is another algorithm which can be used for both classification and regression problems. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features we have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very. In SVM, the hyperplane is selected input points that to best separates the input variables into the points in the input variable space by their class, either class 0 or class 1. In two-dimensions, you can visualize this as a line and let’s assume that all of our input points can be completely separated by this line. The SVM learning algorithm finds the coefficients that results in the best separation of the classes by the hyperplane.

- The best or optimal hyperplane that can separate the two classes is the line that has the largest margin. Only these points are relevant in defining the hyperplane and in the construction of the classifier. These points are called the support vectors. They support or define the hyperplane. In practice, an optimization algorithm is used to find the values for the coefficients that maximizes the margin.

Bagging and random forest

- The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean and then average all of your mean values to give you a better estimation of the true mean value.

- In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees. Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value.

- Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, sub-optimal splits are made by introducing randomness.

Adaboost classification

- Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

- AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.

Congratulation guys, now you know the 10 most commonly used machine learning algorithms. Next post I am planning to write some commonly asked interview questions on machine learning algorithms. So stay tuned, will share the next link soon. And don't forget to comment below for any suggestion and feedback. Till then bye, see you soon.

Next Topic is here :Top 10 interview question in ML/AI