Machine learning

Mahout - Machine Learning

Apache Mahout is a highly scalable machine learning library that enables developers to use optimized algorithms. Mahout implements popular machine learning techniques such as recommendation, classification, and clustering.

What is Machine Learning?

Machine learning is a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. Here, learning means recognizing and understanding the input data and making wise decisions based on the supplied data.
It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem, algorithms are developed. These algorithms build knowledge from specific data and past experience with the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement learning, and control theory.
The developed algorithms form the basis of various applications such as: 
  • Vision processing
  • Language processing
  • Forecasting (a.g., stock market trends)
  • Pattern recognition
  • Games
  • Robotics
There are several ways to implement machine learning techniques, however the most commonly used ones are supervised and unsupervised learning .

Supervised Learning

Supervised learning deals with learning a function from available training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Common examples of supervised learning include: 
  • classifying e-mails as spam,
  • labeling webpages based on their content, and
  • voice recognition.

There are many supervised learning algorithms such as neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier.

Unsupervised Learning

Unsupervised learning makes sense of unlabeled data without having any predefined dataset for its training.
Common approaches to unsupervised learning include:
  • k-means
  • self-organizing maps, and
  • hierarchical clustering

Recommendation

Recommendation is a popular technique that provides close recommendations based on user information such as previous purchases, clicks, and ratings.
  • Amazon uses this technique to display a list of recommended items that you might be interested in, drawing information from your past actions. There are recommender engines that work behind Amazon to capture user behavior and recommend selected items based on your earlier actions.
  • Facebook uses the recommender technique to identify and recommend the “people you may know list”.

Classification

Classification, also known as categorization, is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. Classification is a form of supervised learning.
  • Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.
  • iTunes application uses classification to prepare playlists.

Clustering

Clustering is used to form groups or clusters of similar data based on common characteristics. Clustering is a form of unsupervised learning.
  • Search engines such as Google and Yahoo! use clustering techniques to group data with similar characteristics.
  • Newsgroups use clustering techniques to group various articles based on related topics.

Top Machine Learning Algorithm You Must Know

Basically, there are two ways to categorize Machine Learning algorithms you may come across in the field.
  • The first is a grouping of algorithms by the learning style.
  • The second is a grouping of algorithms by a similarity in form or function.

a. Machine Learning Algorithms Grouped by Learning Style

i. Supervised Learning
Basically, in this Supervised Machine Learning, input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time. In this, a model is prepared through a training process. Also, in this required to make predictions. And is corrected when those predictions are wrong. The training process continues until the model achieves the desired level.

ii. Unsupervised Learning
In this Unsupervised Machine Learning, input data is not labeled and does not have a known result.

We have to prepare model by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to reduce redundancy.

iii. Semi-Supervised Learning
There is a desired prediction problem. But the model must learn the structures to organize the data as well as make predictions.

b. Algorithms Grouped By Similarity

i. Regression Algorithms
Regression Algorithms is concerned with modeling the relationship between variables. That we use to refine using a measure of error in the predictions made by the model.
The most popular regression algorithms are:
  • Ordinary Least Squares Regression (OLSR)
  • Logistic Regression
  • Stepwise Regression
  • Multivariate Adaptive Regression Splines (MARS)
  • Locally Estimated Scatterplot Smoothing (LOESS)

ii. Instance-based Algorithms 

This model is a decision problem with instances training data. That is deemed important or required to the model.
Such methods build up a database of example data. And it needs to compare new data to the database. For comparison, we use a similarity measure to find the best match and make a prediction. For this reason, instance-based methods are also called winner-take-all methods and memory-based learning. The focus is put on the representation of the stored instances. Thus, similarity measures used between instances.
The most popular instance-based algorithms are:
  • k-Nearest Neighbor (kNN)
  • Learning Vector Quantization (LVQ)
  • Self-Organizing Map (SOM)
  • Locally Weighted Learning (LWL)


iii. Decision Tree Algorithms 
Decision tree methods construct a model of decisions. That is made based on actual values of attributes in the data.
Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning. The most popular decision tree algorithms are:
  • Classification and Regression Tree (CART)
  • Iterative Dichotomiser 3 (ID3)
  • Decision Stump
  • Conditional Decision Trees


iv. Bayesian Algorithms
These methods are those that apply Bayes’ Theorem for problems. Such as classification and regression.
The most popular Bayesian algorithms are: 
  • Naive Bayes
  • Gaussian Naive Bayes
  • Multinomial Naive Bayes
  • Averaged One-Dependence Estimators (AODE)
  • Bayesian Belief Network (BBN)
  • Bayesian Network (BN)


v. Clustering Algorithms 
Clustering, like regression, describes the class of problem and the class of methods.

The Clustering methods are organized by the modeling approaches such as centroid-based and hierarchal. All methods are concerned with using the inherent structures in the data. That is a need to best organize the data into groups of maximum commonality. 
The most popular clustering algorithms are: 
  • k-Means
  • k-Medians
  • Expectation Maximisation (EM)
  • Hierarchical Clustering


vi. Association Rule Learning Algorithms
Association rule learning methods extract rules. That best explain observed relationships between variables in data.

These rules can discover important and useful associations in large multidimensional datasets. That can be exploited by an organization.
The most popular association rule learning algorithms are:
  • Apriori algorithm
  • Eclat algorithm


vii. Artificial Neural Network Algorithms
These are models that are inspired by the structure of biological neural networks.
They are a class of pattern matching. That we use for regression and classification problems. Although, there is an enormous subfield. As it combines hundreds of algorithms and variations. 
The most popular artificial neural network algorithms are:
  • Perceptron
  • Back-Propagation
  • Hopfield Network
  • Radial Basis Function Network (RBFN)


viii. Deep Learning Algorithms 
Deep Learning methods are a modern update to Artificial Neural Networks. That are exploiting abundant cheap computation.
They are concerned with building much larger and more complex neural networks.
The most popular deep learning algorithms are:
  • Deep Belief Networks (DBN)
  • Convolutional Neural Network (CNN)
  • Stacked Auto-Encoders
  • Deep Boltzmann Machine (DBM)


ix. Dimensionality Reduction Algorithms 
Like clustering methods, dimensionality reduction seeks inherent structure in the data. Although, in this case, to order to summarize.
Generally, it can be useful to visualize dimensional data. Also, we can use it in a supervised learning method.
Many of these methods we adopt for use in classification and regression.
  • Principal Component Regression (PCR)
  • Partial Least Squares Regression (PLSR)
  • Sammon Mapping
  • Multidimensional Scaling (MDS)
  • Projection Pursuit
  • Linear Discriminant Analysis (LDA)
  • Mixture Discriminant Analysis (MDA)
  • Quadratic Discriminant Analysis (QDA)
  • Flexible Discriminant Analysis (FDA)

List of Common Machine Learning Algorithms

a. Naïve Bayes Classifier Machine Learning Algorithm

Generally, it would be difficult and impossible to classify a web page, a document, an email. Also, other lengthy text notes manually. This is where Naïve Bayes Classifier machine learning algorithm comes to the rescue. Also, a classifier is a function that allocates a population’s element value. For instance, Spam Filtering is a popular application of Naïve Bayes algorithm. Thus, spam filter here is a classifier that assigns a label “Spam” or “Not Spam” to all the emails.
Basically, it is amongst the most popular learning method grouped by similarities. That works on the popular Bayes Theorem of Probability. It is a simple classification of words. Also, is defined for the subjective analysis of content.

b. K Means Clustering Machine Learning Algorithm

Generally, K-means is a used unsupervised machine learning algorithm for cluster analysis. Also, K-Means is a non-deterministic and iterative method. Besides, the algorithm operates on a given data set through a pre-defined number of clusters, k. Thus, the output of K Means algorithm is k clusters with input data that is separated among the clusters.

c. Support Vector Machine Learning Algorithm

Basically, it is a supervised machine learning algorithm for classification or regression problems. As in this, the dataset teaches SVM about the classes. So that SVM can classify any new data. Also, it works by classifying the data into different classes by finding a line. That we use to separates the training data set into classes. Moreover, there are many such linear hyperplanes. Further, in this, SVM tries to maximize a distance between various classes. As that has to involve and this is referred as margin maximization. Also, if the line that maximizes the distance between the classes is identified. Then the probability to generalize well to unseen data is increased.
SVM’s are classified into two categories:
  • Linear SVM’s – Basically, in linear SVM’s the training data i.e. have to separate classifier by a hyperplane.
  • Non-Linear SVM’s- Basically, in non-linear SVM’s it is not possible to separate the training data using a hyperplane.

d. Apriori Machine Learning Algorithm

Basically, it is an unsupervised machine learning algorithm. That we use to generate association rules from a given data set. Also, association rule implies that if an item A occurs, then item B also occurs with a certain probability. Moreover, most of the association rules generated are in the IF_THEN format.
For example, IF people buy an iPad THEN they also buy an iPad Case to protect it. The basic principle on which Apriori Machine Learning Algorithm works: If an item set occurs frequently then all the subsets of the item set, also occur frequently.
If an item set occurs infrequently. Then all the supersets of the item set have infrequent occurrence.

a. Linear Regression Machine Learning Algorithm

It shows the relationship between 2 variables. Also, shows how the change in one variable impacts the other. Basically, the algorithm shows the impact on the dependent variable. That depends on changing the independent variable. Thus, the independent variables as explanatory variables. As they explain the factors impact the dependent variable. Moreover, a dependent variable has often resembled the factor of interest or predictor.

f. Decision Tree Machine Learning Algorithm

Basically, a decision tree is a graphical representation. That makes use of branching method to exemplify all possible outcomes of a decision.
Basically, in a decision tree, the internal node represents a test on the attribute. As each branch of the tree represents the outcome of the test. And also the leaf node represents a particular class label. i.e. the decision made after computing all the attributes. Further, we have to represent classification through the path from a root to the leaf node.

g. Random Forest Machine Learning Algorithm

It is the go-to machine learning algorithm. That we use a bagging approach to create a bunch of decision trees with a random subset of the data. Although, we have to train a model several times on random sample of the dataset. That need to achieve good prediction performance from the random forest algorithm. Also, in this ensemble learning method, we have to combine the output of all the decision tree. That is to make the final prediction. Moreover, we derive the final prediction by polling the results of each decision tree.

h. Logistic Regression Machine Learning Algorithm

Generally, the name of this algorithm could be a little confusing. As Logistic Regression algorithm is for classification tasks and not regression problems. Also, the name ‘Regression’ here implies that a linear model is fit into the feature space.
Further, this algorithm applies a logistic function to a linear combination of features. That need to predict the outcome of a categorical dependent variable. Moreover, it was based on predictor variables.
The probabilities that describe the outcome of a single trial are modeled as a function. Also, the function of explanatory variables.

Artificial Neural Network (ANN) in Machine Learning

An Artificial Neurol Network (ANN) is a computational model. It is based on the structure and functions of biological neural networks. It works like the way human brain processes information. It includes a large number of connected processing units that work together to process information. They also generate meaningful results from it. In this tutorial, we will take you through the complete introduction to Artificial Neural Network, Artificial Neural Network Structure layers of ANN, and the benefits and limitations of ANN.

Artificial Neural networks (ANN) or neural networks are computational algorithms. It intended to simulate the behavior of biological systems composed of “neurons”. ANNs are computational models inspired by an animal’s central nervous systems. It is capable of machine learning as well as pattern recognition. These presented as systems of interconnected “neurons” which can compute values from inputs.

A neural network is an oriented graph. It consists of nodes which in the biological analogy represent neurons, connected by arcs. It corresponds to dendrites and synapses. Each arc associated with a weight while at each node. Apply the values received as input by the node and define Activation function along the incoming arcs, adjusted by the weights of the arcs.
A neural network may contain the following 3 layers:
  • Input layer – The activity of the input units represents the raw information that can feed into the network.
  • Hidden layer – To determine the activity of each hidden unit. The activities of the input units and the weights on the connections between the input and the hidden units. There may be one or more hidden layers.
  • Output layer – The behavior of the output units depends on the activity of the hidden units and the weights between the hidden and output units.

Artificial Neural Network Layers

Artificial Neural network is typically organized in layers. Layers are being made up of many interconnected ‘nodes’ which contain an ‘activation function’. A neural network may contain the following 3 layers: 

a. Input layer

The purpose of the input layer is to receive as input the values of the explanatory attributes for each observation. Usually, the number of input nodes in an input layer is equal to the number of explanatory variables. ‘input layer’ presents the patterns to the network, which communicates to one or more ‘hidden layers’.
The nodes of the input layer are passive, meaning they do not change the data. They receive a single value on their input and duplicate the value to their many outputs. From the input layer, it duplicates each value and sent to all the hidden nodes.

b. Hidden layer

The Hidden layers apply given transformations to the input values inside the network. In this, incoming arcs that go from other hidden nodes or from input nodes connected to each node. It connects with outgoing arcs to output nodes or to other hidden nodes. In hidden layer, the actual processing is done via a system of weighted ‘connections’. There may be one or more hidden layers. The values entering a hidden node multiplied by weights, a set of predetermined numbers stored in the program. The weighted inputs are then added to produce a single number.

c. Output layer

The hidden layers then link to an ‘output layer‘. Output layer receives connections from hidden layers or from input layer. It returns an output value that corresponds to the prediction of the response variable. In classification problems, there is usually only one output node. The active nodes of the output layer combine and change the data to produce the output values.
The ability of the neural network to provide useful data manipulation lies in the proper selection of the weights. This is different from conventional information processing.

Structure of a Neural Network

4. Structure of a Neural Network The structure of a neural network also referred to as its ‘architecture’ or ‘topology’. It consists of the number of layers, Elementary units. It also consists of Interconchangend Weight adjustment mechanism. The choice of the structure determines the results which are going to obtain. It is the most critical part of the implementation of a neural network.
The simplest structure is the one in which units distributes in two layers: An input layer and an output layer. Each unit in the input layer has a single input and a single output which is equal to the input. The output unit has all the units of the input layer connected to its input, with a combination function and a transfer function. There may be more than 1 output unit. In this case, resulting model is a linear or logistic regression.This is depending on whether transfer function is linear or logistic. The weights of the network are regression coefficients.
By adding 1 or more hidden layers between the input and output layers and units in this layer the predictive power of neural network increases. But a number of hidden layers should be as small as possible. This ensures that neural network does not store all information from learning set but can generalize it to avoid overfitting.
Overfitting can occur. It occurs when weights make the system learn details of learning set instead of discovering structures. This happens when size of learning set is too small in relation to the complexity of the model.
A hidden layer is present or not, the output layer of the network can sometimes have many units, when there are many classes to predict.

Advantages and Disadvantages of Neural Networks

Let us see few advantages and disadvantages of neural networks:
  • Neural networks perform well with linear and nonlinear data but a common criticism of neural networks, particularly in robotics, is that they require a large diversity of training for real-world operation. This is so because any learning machine needs sufficient representative examples in order to capture the underlying structure that allows it to generalize to new cases.
  • Neural networks works even if one or few units fail to respond to network but to implement large and effective software neural networks, much processing and storage resources need to be committed. While the brain has hardware tailored to the task of processing signals through a graph of neurons, simulating even a most simplified form on Von Neumann technologymay compel a neural network designer to fill millions of database rows for its connections – which can consume vast amounts of computer memory and hard disk space.
  • Neural network learns from the analyzed data and does not require to reprogramming but they are referred to as black box” models, and provide very little insight into what these models really do. The user just needs to feed it input and watch it train and await the output.

Convolutional Neural Networks Architecture and Applications

1. Objective

In this blog, we will study Convolutional Neural Networks. Also, will learn it’s important terminologies and Convolutional Neural Networks architecture. Moreover, will study Convolutional Neural Networks applications and advantages.

2. Introduction to Convolutional Neural Networks

In machine learning, CNN is a class of deep and feedforward learning. It has been successfully applied to analyze the visual imagery. CNN is made up of neurons that have learnable weights and biases. As it each neuron receives some inputs. Further, performs a dot product. Neural Networks still apply.

3. History of CNN

As these convolutional networks were inspired by biological processes. The connectivity pattern between neurons is inspired by the animal visual cortex. Although, Individual cortical neurons are used to respond to stimuli that are only in a restricted region of the visual field known as the receptive field. Further, this field of different neurons partially overlap such that they cover the entire visual field.

4. Important Terminologies of Convolutional Neural Networks

a. Shared Weights

In CNN, we have to replicate each filter across the entire visual field. Also, we use these replicated units to share the same parameterization and form a feature map.

We will show 3 hidden units belonging to the same feature map in the above figure. That unit are of weights of the same color that must be shared—constrained to be identical. We use gradient descent to learn shared parameters. Thus, we can say gradient of a shared weight is simply the sum of the gradients of the parameters being shared.

b. Details and Notation

Firstly we use convolution of the input image with a linear filter. Then will add a bias term and then apply a non-linear function to it.
If we denote the kth feature map at a given layer as, whose filters are determined by the weights and bias. Then the feature map is obtained as follows (for non-linearitie)
hkij=tanh((Wk*x)ij+bk).

5. Architecture of Convolutional Neural Networks

As we can say that CNN consists of a number of convolutional and subsampling layers. That layers are optionally followed by fully connected layers. As this input to a convolutional layer is a m x m x r image. Where, m is the height and width of the image and r is the number of channels.

For Example:
We have an RGB image that is having r=3. Also, this layer will have k filters of size n x n x q where:
n = is smaller than the dimension of the image ;
q = can either be the same as the number of channels r ;
As this size of filter gives rise to the locally connected structure. That are each convolved with the image to produce k feature maps of size m−n+1. Also, each map is then subsampled typically with mean or max pooling over p x p
In this contiguous regions, p ranges between 2 for small images and is usually not more than 5 for larger inputs.
After the convolutional layers- consider there may be any number of layers, one connecting another. Although, connected layers are identical to the layers in a standard multilayer network.

6. Advantages of Convolutional Neural Networks

  • Once trained, the predictions are pretty fast.
  • With any number of inputs and layers, CNN can be train.
  • Neural networks work best with more data points.

7. Applications of Convolutional Neural Networks

a. Generally, we use CNN in Aircraft Ejection Seat Testing.

b. As effective Use of Word Order for Text Categorization with Convolutional Neural Networks.

c. Basically, semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding.

d. Generally, a Sensitivity Analysis of Convolutional Neural Networks for Sentence Classification.

e. Convolutional Neural Network for Modelling Sentences.

f. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts.

Recurrent Neural Networks – Deep Learning Fundamentals

1. Objective

In this blog, we will learn Recurrent Neural Networks. Also, will study every important concepts related to Recurrent Neural Networks. Besides, theory, we will use images for better representation and understanding of Recurrent Neural Networks.

2. Introduction to Recurrent Neural Networks

Generally, a recurrent neural network is a type of advanced artificial neural network. Also, this ANN involves directed cycles in memory. As this network has the ability to build on earlier types of networks. That contains with fixed-size input vectors and output vectors.

3. Understanding the Recurrent Neural Networks

Let’s say we have one task. That is to predict the next word in a sentence. To accomplish it, we will try to use a multilayer perceptron. In MLP, we have three layers. Such as an input layer, a hidden layer, and an output layer. As in this, input layer receives the input, the hidden layer activations are applied. Then we finally receive the output.
 Further, we have to send these activations to the next hidden layer. Although, these successive activations helps to produce the output. Thus, each hidden layer is characterized by its own weights and biases.

Since hidden layers behave independently, As they have their own weights and activations. Further, the main objective is to identify a relationship between successive points. Can we supply the inputs to hidden layers? Yes, we can!

Here, hidden layers are different. As their weight and bias are different. Although, each layer is independent and we can’t combine them together. Same weights and bias are required to combine the hidden layers.

All layers are combined together as they have same weight and bias. Then, we have to roll all these hidden layers in a single recurrent layer.

So it’s like supplying the input to the hidden layer. At all the time steps weights of the recurrent neuron would be the same since its a single neuron now. So a recurrent neuron stores the state of a previous input and combines with the current input. Further, thereby preserving some relationship of the current input with the previous input.

4. What can RNNs do?

We can say that RNN has shown great success in many MLP tasks. And the most common type of RNNs we use is LSTMs. That they are too good at capturing the long-term dependencies than vanilla RNNs are.

5. Why Recurrent Neural Networks?

This network connection often offers so many advantages. They are very helpful in image recognition and context information. As the time steps increase, the unit gets influenced by larger neighborhood. With that information, recurrent networks can watch large regions in the input space. In CNN this ability is limited to units in higher layers. Furthermore, the recurrent connections increase the network depth. While they keep the number of parameters low by weight sharing. Reducing the parameters is also a modern trend of CNN architectures

Additionally, the recurrent connections yield to an ability to handle sequential data. This ability is very useful for many tasks. As for the last point recurrent connections of neurons are biological inspired and they are used for many tasks in the brain. Therefore using such connections can enhance artificial networks and bring interesting behaviors. The last big advantage is that RNN offers some kind of memory, which can be used in many applications.

6. Training RNNs

Generally, training an RNN is similar to training a traditional Neural Network. We also use the backpropagation algorithm for this. Because the parameters are shared by all time steps in the network. The gradient at each output depends not only on the calculations of the current time step. But also the previous time steps.

For example:
In order to calculate the gradient at we would need to backpropagate 3 steps and sum up the gradients. This is called Backpropagation Through Time (BPTT). If this doesn’t make a whole lot of sense yet, don’t worry, we’ll have a whole post on the gory details. For now, that vanilla RNNs trained with BPTT have difficulties learning long-term dependencies. Due to what is called the vanishing/exploding gradient problem. There exists some machinery to deal with these problems. Also, certain types of RNNs (like LSTMs) were specifically designed to get around them.

The training of almost all networks is done by back-propagation. But with the recurrent connection, it has to be adapted. This is simply done by unfolding the net like. It is shown that the network consists of one recurrent layer and one feed forward layer. The network can be unfolded to k instances off.

In the example, in figure the network is unfolded with a depth of k = 3. After unfolding, the network can be trained in the same way as an FFD with Backpropagation. We have to except that each epoch has to run through each unfolded layer. The algorithm for recurrent nets is then called Backpropagation through time (BPTT).

7. RNN Extensions

Over the years researchers have developed more sophisticated types of RNNs. That is to deal with some of the shortcomings of the vanilla RNN model.

a. Bidirectional RNNs

These are based on the idea that the output at a time may not only depend on the previous elements in the sequence. But also future elements.

For example:
To predict a missing word in a sequence you want to look at both the left and the right context. Bidirectional RNNs are quite simple. They are just two RNNs stacked on top of each other. The output is then computed based on the hidden state of both RNNs.

b. Deep (Bidirectional) RNNs

These are similar to Bidirectional RNNs, only that we now have multiple layers per time step. In practice, this gives us a higher learning capacity.

c. LSTM networks

LSTMs don’t have a different architecture from RNNs. But they use a different function to compute the hidden state.

The memory in LSTMs is called cells. Internally these cells decide what to keep in memory. They then combine the previous state, the current memory, and the input.

It turns out that these types of units are very efficient at capturing long-term dependencies.

LSTMs can be quite confusing in the beginning. But if you’re interested in learning more this post has an excellent explanation.

8. Advantages of RNN

a. Store Information

The RNN can use the feedback connection. That is to store information over time in form of activations. This ability is significant for many applications. In the recurrent networks are described that they have some form of memory.

b. Learn Sequential Data

The RNN can handle sequential data of arbitrary length. On the left, the default FFN is shown which can just compute one fixed-size input to one fixed size output. With the recurrent approach also one too many, many to one and many to many inputs to outputs are possible.

One example for one to many networks is that you label an image with a sentence. The many to one approach could handle a sequence of image. They produce one sentence for it. And finally the many to many approaches can be used for language translations. Other use cases for the many to many approaches could be to label each image of a video sequence.

9. Applications of RNN

Particularly, RNNs are useful in training on any type of sequential data.
For example:
It would make sense to use an RNN. Such as for image/video captioning, word prediction, translation, image processing. However, an RNN can also suit to be trained on non-sequential data in a non-sequential manner. Not too long ago I implemented an RNN for a computational neuroscience project. In case you want to implement your very first RNN, here are a some of tips:

a. Unfold your network

This allows you to visualize network interacts with itself at adjacent time-steps. But it also allows you to visualize how the error is back-propagated through the system (BPTT).

The rule of thumb was any connection, at time step ‘t’,. That isn’t feed-forward should be connected to the next time step at ‘t+1’.

b. Keep track of your back-propagated errors

Don’t duplicate parameters. Use one set of weights for all your states (time-steps). This ensures that you are using a minimal amount of memory. And your weights for each state is the same across all states.
c. They are used in speech processing, non-Markovian control, and music composition. In addition, RNN is used successfully for sequential data. Such as handwriting recognition and speech recognition.
d. The advantage in comparison to FFD is, that RNN can handle sequential data.
A single RNN is proposed for sequence labeling. Most successful applications of RNN refer to tasks like handwriting recognition and speech recognition.
e. They are also used in for Clinical decision support systems. They used a network based on the Jordan/Elman neural network. Furthermore, in a recurrent fuzzy for control of dynamic systems is proposed as a newer application which uses combinations of RNN with CNN.
f. A great application is in collaboration with Natural Language Processing (NLP). RNNs have been demonstrated by many people on the internet. It can represent a language model. These language models can take input such as a large set of Shakespeare’s poems. And after training these models they can generate their own Shakespearean poems that is very hard to differentiate from originals!

Artificial Neural Network Applications in the Real World

1. Objective

The Artificial Neural Network has seen an explosion of interest over the last few years and is being successfully applied across an extra ordinary range of problem domains in the area such as Handwriting Recognition, Image compression, Travelling Salesman problem, stock Exchange Prediction etc. In this Machine Learning tutorial, we will discuss all these Artificial Neural Network Applications in detail.

2. Artificial Neural Network Applications

As we have already seen the introduction to Artificial neural networks, let us now see the major Artificial Neural Network Applications.
  • Handwriting Recognition – The idea of Handwriting recognition has become very important. This is because handheld devices like the Palm Pilot are becoming very popular. Hence, we can use Neural networks to recognize handwritten characters.
  • Traveling Salesman Problem – Neural networks can also solve the traveling salesman problem. But this is to a certain degree of approximation only.
  • Image Compression – Vast amounts of information is received and processed at once by neural networks. This makes them useful in image compression. With the Internet explosion and more sites using more images on their sites, using neural networks for image compression is worth a look.
  • Stock Exchange Prediction – The day-to-day business of the stock market is very complicated. Many factors weigh in whether a given stock will go up or down on any given day. Thus, Neural networks can examine a lot of information in a fast manner and sort it all out. So we can use them to predict stock prices since

Below is the description of every ANN application to get the proper understanding

2.1. Handwriting Recognition

The idea of using feed forward networks to recognize handwritten characters is straightforward. The bitmap pattern of the handwritten character is input, with the correct letter or digit as the desired output. Such programs need the user to train the network by providing the program with their handwritten patterns.

The two common applications of handwriting recognition are:
  • Optical character recognition for data entry
  • Validation of signatures on bank cheque

Feed-forward networks have the following characteristics:

a. First, they arrange perceptrons in layers, with the first layer taking in inputs and the last layer producing outputs. The middle layers have no connection with the external world, and hence we call them hidden layers.

b. Each perceptron in one layer is connected to every perceptron on the next layer. Hence information is “fed forward” from one layer to the next in a continuous manner. This explains why we call these networks feed-forward networks.

c. There is no connection among perceptrons in the same layer.

2.2. Traveling Salesman Problem

The traveling salesmen problem refers to the finding the shortest possible path to travel all cities in a given area. We can use Neural Networks to solve this problem.

A neural network algorithm such as genetic algorithm starts with random orientation of the network, to solve the problem. This algorithm chooses a city in a random manner each time and finds the nearest city. Thus, this process continues several times. After every iteration, the shape of the network changes and network converges to a ring around all the cities.

The used algorithm minimizes the length of rings. In this way, we can estimate the traveling problem.

2.3. Image Compression

A Neural Network used for image compression contain the equal size of input and output layer. The intermediate layer is of smaller size. The ratio of the input layer to the intermediate layer is the compression ratio of the network.

We can get the comparison ratio for image compression using the following formula:

Comparison Ratio = Input Layer / Intermediate Layer

Idea behind data compression neural networks is to store, encrypt and re-create the actual image again. Thus in such network, we can use input for training purposes itself.

2.4. Stock Exchange Prediction

The prediction accuracy of neural networks has made them useful in making a stock market prediction. For large business companies, making predictions for stock exchange is common. This is by using parameters, such as current trends, political situation, public view, and economists’ advice.

We can also use neural networks in currency prediction, business failure prediction, debt risk assessment, and credit approval.

Neural Network Algorithms – Learn How To Train ANN

1. Objective

In this Machine Learning tutorial, we will cover the top Neural Network Algorithms. These algorithms are used to train the Artificial Neural Network. This blog provides you a deep learning of the Gradient Descent, Evolutionary Algorithms, and Genetic Algorithm in Neural Network.

2. Top Neural Network Algorithms

Learning of neural network takes place on the basis of a sample of the population under study. During the course of learning, compare the value delivered by output unit with actual value. After that adjust the weights of all units so to improve the prediction.

There are many Neural Network Algorithms are available for training Artificial Neural Network. Let us now see some important Algorithms for training Neural Networks: 
  • Gradient Descent – Used to find the local minimum of a function.
  • Evolutionary Algorithms – Based on the concept of natural selection or survival of the fittest in Biology.
  • Genetic Algorithm – Enable the most appropriate rules for the solution of a problem and select it. So, that they send their ‘genetic material’ to ‘child’ rules. We will learn about them in details below.

2.1. Gradient Descent

We use the gradient descent algorithm to find the local smallest of a function. This algorithm converges to the local smallest. By approaching proportional to the negative of the gradient of the function. To find local maxima, take the steps proportional to the positive gradient of the function. This is gradient ascendant process.

In linear models, error surface is well defined and well known mathematical object in shape of a parabola. Then find the least point by calculation. Unlike linear models, neural networks are complex nonlinear models. Here, the error surface has an irregular layout, crisscrossed with hills, valleys, plateau, and deep ravines. To find the least point on this surface, for which no maps are available, the user must explore it.

In this algorithm, you move over the error surface by following the line with the greatest slope. It also offers the possibility of reaching the lowest possible point. You then have to work out at the optimal rate at which you should travel down the slope.

The correct speed is proportional to the slope of the surface and the learning rate. Learning rate controls the extent of modification of the weights during the learning process.

Hence, the moment of a neural network can affect the performance of multilayer perceptron.

2.2. Evolutionary Algorithms

This algorithm based on the concept of natural selection or survival of the fittest in Biology. Concept of natural selection states that – for a given population, environment conditions use a pressure that results in the rise of the fittest in that population.

To measure fittest in a given population, you can apply a function as an abstract measure.

In the context of evolutionary algorithms, refer recombination to as an operator. Then apply it to two or more candidates known as parents, and result in one of more new candidates known as children. Apply the mutation on a single candidate and results in a new candidate. By applying recombination and mutation, we can get a set of new candidates to place in the next generation based on their fittest measure.

The two basic elements of evolutionary algorithms are:
  • Variation operators (recombination and mutation)
  • Selection process (selection of the fittest)
The common features of evolutionary algorithms are:
  • Evolutionary algorithms are population based.
  • Evolutionary algorithms use recombination mix candidates of a population and create new candidates.
  • On random selection evolutionary algorithm based.

Hence, on the basis of details and applied problems, we use various formats of evolutionary algorithms.

Some common evolutionary algorithms are:
  • Genetic Algorithm Genetic Algorithm – It provides the solution for optimization problems. It provides the solution by the help of natural evolution processes. Like mutation, recombination, crossover, and inheritance.
  • Genetic Programming – The genetic programming provides a solution in the form of computer programs. By the ability to solve computational problems accuracy of a program measures.
  • Evolutionary Programming – In a simulated environment to develop the AI we use it.
  • Evolution Strategy -It is an optimization algorithm. Grounded on the concepts of the adaptation and the evolution in biological science.
  • Neuroevolution – To train neural networks we uses Neuroevolution. By specifying structure and connection weights genomes uses to develop neural networks.

In all these algorithms, genetic algorithm is the most common evolutionary algorithm.

2.3. Genetic Algorithm

Genetic algorithms, developed by John Holland’s group from the early 1970s. It enables the most appropriate rules for the solution of a problem to be selected. So that they send their ‘genetic material’ (their variables and categories) to ‘child’ rules.

Here refer a as a set of categories of variables. For example, customers aged between 36 and 50, having financial assets of less than $20,000 and a monthly income of more than $2000.

A rule is the equal of a branch of a decision tree; it is also analogous to a gene. You can understand genes as units inside cells that control how living organisms inherit features of their parents. Thus, Genetic algorithms aim to reproduce the mechanisms of natural selection. By selecting the rules best adapted to prediction and by crossing and mutating them until get a predictive model.

Together with neural networks, they form the second type of algorithm. Which mimics natural mechanisms to explain phenomena that are not necassary natural.

The steps for executing genetic algorithms are:
  • Step 1: Random generation of initial rules – Generate the rules first with the constraint being that they must be all distinct. Each rule contains a random number of variables chosen by user.
  • Step 2: Selection of the best rules – Check the Rules in view of the aim by fitness function to guide the evolution toward the best rules. Best rules maximize the fitness function and retain with probability that increases as the rule improves. Some rules will disappear while others select several times.
  • Step 3: Generation of new rules by mutation or crossing – First, go to step 2 until the execution of the algorithm stops. Chosen rules are randomly mutated or crossed. Mutation is replacement of a variable or a category of original rule with another.

Crossing of 2 rules is exchange of some of their variables or categories to produce 2 new rules. Crossing is more common than mutation.

Algorithm ends when 1 of the following 2 conditions meets: 
  • Specified number of iterations that reached.
  • Starting from generation of rank n, rules of generations n, n-1 and n-2 are (almost) identical.

Deep Learning & Neural Networks in Machine Learning

1. Objective

In this tutorial, we will focus on Deep Learning. Also, study its use case, structures and applications of deep learning. As Deep Learning is very important topic. Moreover, Deep Learning deals with Artificial intelligence and machine learning.

2. Introduction to Deep Learning

As Machine learning focuses only on solving real-world problems. Also, it takes few ideas of artificial intelligence. Moreover, machine learning does through the neural networks. That are designed to mimic human decision-making capabilities.

Machine Learning tools and techniques are the two key narrow subsets. That only focuses more on deep learning. Furthermore, we need to apply it to solve any problem. That requires thought- human or artificial. Any Deep neural network will consist of three types of layers:
  • The Input Layer
  • The Hidden Layer
  • The Output Layer


1.The input layer
It receives all the inputs and the last layer is the output layer which provides the desired output. 
2. Hidden Layers
All the layers in between these layers are called hidden layers. There can be n number of hidden layers. The hidden layers and perceptrons in each layer will depend on the use-case you are trying to solve. 3. Output Layers
It provides the desired output.

To feed a computer system with a lot of data we use deep learning. The system then uses these data to make a decision about other data. This data feeding takes place through neural networks. br> Moreover, Deep Learning is crucial because it focuses on developing these networks. As a result, they are known as Deep Neural Networks.

3. Brief Introduction to Neural Networks

  • It is a beautiful biologically programming paradigm. Also, enables a computer to learn from observational data.
  • Also, it provides the best solutions to many problems. That are image recognition, speech recognition, and natural language processing

4. Deep Learning Use – Case

Here, in this use case, we are passing the high dimensional data to the input layer. To match the dimensionality of the input data, the input layer will be needed. This contains multiple sub-layers of perceptrons so that it can consume the entire input.

The input layer will contain patterns which were received from the output. Also, it has ability to identify the edges of the images based on the contrast levels.

This output will be fed to the hidden layer 1. And in this layer it will be able to identify various face features like eyes, nose, ears etc.

Now, this will be fed to the hidden layer 2 where it will able to form the entire faces. Then, the output of layer 2 is sent to the output layer.

Finally, the output layer performs classification. This is based on the result obtained from the previous and predicts the name.

5. Deep Learning Applications in Real Life

a. Navigation of Self-driving cars

Although it is too early to catch someone reading newspaper while driving cars are in future. To recognize obstacles for car learning, we can use sensors and inboard analytics. And also react to them appropriately using Deep Learning.

b. Recolouring Black and White Images

At this time, computers are necessary to recognize objects. Also, learn what they should look like to humans. Basically, computers can be used to taught to return colors. Also, it needs to return black & white pictures and videos.
Won’t it be amazing to see Devdas (1955) in color?

c. Predicting the outcome of Legal Proceedings

British and American researchers had developed a system. They used that system to predict court’s decision.

d. Precision Medicine

We use Deep Learning to develop medicines. Also, these are genetically tailored to an individual’s genome. a. Automated analysis and Reporting

We are too much thankful for deep learning techniques. As we can see that the systems can now analyze data. Also, report insights from its natural soundings and human language.

f. Pre-Natal Care

We use image recognition and deep learning techniques to interpret signs. Also, this technique is used by UK and Australian researchers. Also, guide pre-operative strategies.

g. Weather Forecasting and Event Detection

As a result, the computational fluid dynamics codes are matching with neural networks. Also, other genetic algorithm approaches to detect cyclone activity.

h. Finance

Usually, we use popular technical indicators to generate buy and sell signals. That is for each stock and for portfolios of stocks.

i. Automatic Machine Translation

Deep Learning has been achieving amazing results in the following area as:

1. Automatic Translation of Text
2. Automatic Translation of Images
We use convolutional neural networks to identify images. That have letters and where the letters are in the scene. Learn more applications of machine learning.

20 Deep Learning Terminologies You Must Know

1. Objective

In this blog, we will understand commonly used neural network and Deep Learning Terminologies. As these are the most important and the basic to understand before complex learning neural network and Deep Learning Terminologies.

2. Introduction to Deep Learning Terminologies

a. Recurrent Neuron


It’s one of the best from the Deep Learning Terminologies. Basically, in this output is sent back to the neuron for t timestamps. After looking at the diagram, we can say output is back as input t times. Also, we have to connect different together that will look like an unrolled neuron. Although, an important thing is that it provides us a more generalized output.

b. RNN (Recurrent Neural Network)

We use recurrent neural network especially for sequential data. As in this, we use the previous output to predict the next one. Also, in this case, loops have a network within them. In a hidden neuron, loops have the capability to store information. As it stores previous words to predict the output.

Again, we have to send an output of hidden layer for t timestamps. Moreover, you can see that unfolded neuron looks like. Once the neuron completes it all timestamps then it goes to the next layer. As a result, we can say that the output is more generalize. Although, the before fetched information is retained after a long time.

Moreover, to update the weight of unfolded network, we have to propagate error once again. Hence, called as back propagation through time(BPTT).

c. Vanishing Gradient Problem

It’s one of the best from the Deep Learning Terminologies. Where the activation function is very small, this problem arises. At the time of backpropagation, we have to multiply weights with low gradients. Although, they are small and vanish if they go further deep in the network. As for this reason, neural network forgets the long-range dependence. Also, it becomes a problem of neural networks. As a result, dependence is very important for the network to remember.

We use activation function to solve problems like ReLu which do not have small gradients.

d. Exploding Gradient Problem

We can say this is the opposite of the vanishing gradient problem. It is different as the activation function is too large. Also, it makes the weight of particular node very high. Although, we can solve it by clipping the gradient. So that it doesn’t exceed a certain value.

a. Pooling

It’s one of the best from the Deep Learning Terminologies. We can introduce pooling layers in between the convolution layers. Basically, use this to reduce the number of parameters. Although, prevent over-fitting. Although, the size of the most common type of pooling layer of filter size(2,2) using the MAX operation. Further, we can say what it would do is, it would take the maximum of each 4*4 matrix of the original image.

f. Padding

In this process, we have to add an extra layer of zeros across the images. So, output image has the same size as the input. Hence, called as padding. If pixels of the image are actual or valid, we can say it’s a valid padding.

g. Data Augmentation

It refers to the addition of new data that come from the given data, which might prove to be beneficial for prediction.

For example:
Let us assume we have a digit “ 9 “. We can also change its recognition. But if it’s rotating or tilting. Thus, rotation help to increase the accuracy of our model. Although, we increase the quality of data by rotating. Hence, called for Data Augmentation.

h. Softmax

We use softmax activation function in the output layer for classification problems. It’s like sigmoid function. Also, the difference is that outputs are normalized, to sum up to 1.

It is like the sigmoid function, with the only difference being that the outputs are normalized, to sum up to 1. The sigmoid function would work in case we have a binary output. But we also have a multiclass classification problem. In this process softmax makes it easy to assign values to each class. Also, that can be interpreted as probabilities.

It’s very easy to see it this way – Suppose you’re trying to identify a 6 which might also look a bit like 8. The function would assign values to each number as below. We can easily see that the highest probability is assigned to 6, with the next highest assigned to 8 and so on…

i. Neural Network

Neural Network form the backbone of deep learning. The goal of it is to find an approximation of an unknown function. It is a combination of interconnected neurons. These neurons have weights. Also, have a bias that needs to be updated during the network training depending upon the error. The activation function puts a nonlinear transformation to the linear combination. Thus, generates the output. The combinations of the activated neurons give the output.

j. Input layer/ Output layer / Hidden layer

It’s one of the best from the Deep Learning Terminologies. The input layer is the one which receives the input. Also, it’s the first layer of the network. The output layer is the final layer of the network. These layers are the hidden layers of the network. We use these hidden layers to perform tasks on incoming data. Hence, pass generated output to the next layer. Although, both layers are visible but the intermediate layers are hidden.

k. MLP (Multi-Layer perceptron)

We can not perform highly complex tasks by a single neuron. Therefore, we use stacks of neurons to generate the desired outputs. In the simplest network, we would have an input layer, a hidden layer, and an output layer. As in this, each layer has multiple neurons. Also, in each layer, all neurons are connected to all the neurons in the next layer. These networks are fully connected networks.

l. Neuron

As we can say that we use neuron to form the basic elements of a brain. Also, helps to form the basic structure of a neural network. As we get new information. We start to generate an output.
Similarly, we have to deal in case of a neural network. As soon as neuron will get the input, we have to start this process. Further, after processing generates an output. Also, we have to send neurons which helps in further processing. Either, we can consider it as the final output.

m. Weights

As soon as the input enters the neuron, we have to multiply it by a weight.

For example:

If in case a neuron has two inputs, then we have to assign each input an associated weight. Further, we have to initialize the weights randomly. Moreover, during the model training process, these weights are updating. Although, after training, we have to assign a higher weight to the input.

Let’s assume the input to be a, and then associate weight to be W1. Then after passing through the node the input becomes a*W1

n. Bias

We have to add another linear component to input in addition to weight, this is a bias. In input, we have to add weight multiplication. Basically, we have to add bias to change the range of the weight multiplied input. As soon as bias is added result will look like a*W1+bias. Hence, it’s a linear component of the input transformation.

o. Activation Function

As soon as we apply linear component to the input, a non-linear function is applied to it. As this is done by applying the activation function to the linear combination. Hence, this translates the input signals to output signals.

The output after application of the activation function would look something like f(a*W1+b) where f() is the activation function.

In the below diagram we have “n” inputs given as X1 to Xn and corresponding weights Wk1 to Wkn. We have a bias given as bk. First, we have to multiply weights by its corresponding inputs. Then add these together along with the bias. Let assume as u.

u=∑w*x+b

Thus, activation function needs to apply on u i.e. f(u) and we receive the final output from the neuron as yk = f(u)

p. Gradient Descent

We use this as optimization algorithm for minimizing the cost.
Mathematically, to find the local minimum of a function one takes steps proportional to the negative of the gradient of the function.

q. Learning Rate

We can say it is amount of minimization in the cost function in each iteration. Also, one must be careful while choosing the learning rate. Since it should neither be very large that the optimal solution is missed. Also, not should be very low that it takes forever for the network to converge.

r. Backpropagation

Whenever we want to define a neural network, we assign random weights and bias values to our nodes. Also, as soon as we received the output for a single iteration. Thus, we can calculate the error of the network.

In back-propagation, the movement of the network is backwards, the error along with the gradient flows back from the out layer through the hidden layers and updation of weights is done.

s. Batches

In case of training of neural network, we divide in input into several chunks of equal size random. Instead of sending the entire input in one go. Also, trained data batches makes model more generalized.

t. Epochs

We can define it as a single training iteration. Then we define in term with batches in forward and back propagation. This means 1 epoch is a single forward and backward pass of the entire input data.

Audio Analysis Using Deep Learning

1. Objective

In this Deep Learning Tutorial, we will study Audio Analysis using Deep Learning. Also, will learn data handling in the audio domain with applications of audio processing. As we will use graphs for a better understanding of audio data Analysis.

2. Introduction to Audio Analysis

As we are always in contact with audio. Sometimes directly or maybe indirectly. As our brain works continuously. Thus, brain process and understands the information. And at last, it provides us information about the environment.
Sometimes we catch this audio floating around us and feel something constructive. As there are some devices which help to catch these sounds. Also represents in computer readable format.
Examples of these formats are:

wav (Waveform Audio File) format mp3 (MPEG-1 Audio Layer 3) format WMA (Windows Media Audio) format

3. Data Handling in Audio Domain

As there are present some unstructured data formats. For that audio data, has a couple of preprocessing steps. That we need to follow before it is presented for audio analysis.

Firstly we have to load data into a machine-understandable format. For this, we simply take values after every specific time steps.

For example – In a 2-second audio file, we extract values at half a second. This is called a sampling of audio data, and the rate at which it is sampled is called the sampling rate.

We can represent it in another way. As we can convert data into a different domain, namely frequency domain. When we sample an audio data, we require much more data points to represent the whole data. Also, the sampling rate should be as high as possible.
So, if we represent audio data in frequency domain. Then much less computational space is required. To get an intuition, take a look at the image below


Here, we have to separate one audio signal into 3 different pure signals, that can easily represent as three unique values in a frequency domain. Also, there are present few more ways in which we can represent audio data and its audio analysis. For example. using MFCs. These are nothing but different ways to represent the data.

Further, we have to extract features from this audio representations. This algorithm works on these features and performs the task it is designed for. Here’s a visual representation of the categories of audio features that can be extracted. After extracting, we have to send this to the machine learning model for further analysis.

4. Applications of Audio Processing

  • Indexing music collections according to their audio features.
  • Recommending music for radio channels
  • Similarity search for audio files (aka Shazam)
  • Speech processing and synthesis – generating artificial voice for conversational agents

XGBoost in Machine Learning – Features & Importance

1. Objective

In this Machine Learning Tutorial, we will study What is XGBoost. Also, will learn features of XGBoosting and why we need XGBoost Algorithm. We will try to cover all basic concepts like why we use XGBoost, why XGBoosting is good and much more.

2. Introduction to XGBoost

Basically, XGBoost is an algorithm. That has recently been dominating applied machine learning. XGBoost Algorithm is an implementation of gradient boosted decision trees. That was designed for speed and performance. Basically, XGBoosting is a type of software library. That you can download and install on your machine. Then have to access it from a variety of interfaces.

Specifically, XGBoosting supports the following main interfaces:
  • Command Line Interface (CLI).
  • C++ (the language in which the library is written).
  • Python interface as well as a model in scikit-learn.
  • R interface as well as a model in the caret package.
  • Julia.
  • Java and JVM languages like Scala and platforms like Hadoop.

3. XGBoost Features

a. Model Features

XGBoost model implementation supports the features of the scikit-learn and R implementations. Three main forms of gradient boosting are supported:

Gradient Boosting

This is also called as gradient boosting machine including the learning rate.

Stochastic Gradient Boosting

This is the boosting with sub-sampling at the row, column, and column per split levels.

Regularized Gradient Boosting

It includes boosting with both L1 and L2 regularization.

b. System Features

For use of a range of computing environments this library provides:
  • Parallelization of tree construction using all of your CPU cores during training.
  • Distributed Computing for training very large models using a cluster of machines.
  • Out-of-Core Computing for very large datasets that don’t fit into memory.
  • Cache Optimization of data structures and algorithm to make the best use of hardware.

c. Algorithm Features

For efficiency of computing time and memory resources, we use XGBoost algorithm. Also, this was designed to make use of available resources to train the model.

Some key algorithm implementation features include:
  • Sparse aware implementation with automatic handling of missing data values.
  • Block structure to support the parallelization of tree construction.
  • Continued training so that you can further boost an already fitted model on new data.
  • XGBoost is free open source software. That is available for use under the permissive Apache-2 license.

4. Why Use XGBoosting?

a. XGBoost Execution Speed

When we compare XGBoosting to implementations of gradient boosting, it’s so fast. It compares XGBoost to other implementations of gradient boosting and bagged decision trees. Also, he wrote up his results in May 2015 in the blog post titled. That is “Benchmarking Random Forest Implementations“.

Moreover, it provides all the code on GitHub and a more extensive report of results with hard numbers.

b. XGBoost Model Performance

It dominates structured datasets on classification and regression predictive modeling problems.

The evidence is that it is a go-to algorithm for competition winners. That is based on the Kaggle competitive data science platform.

5. Why is it so good?

a. Flexibility

XGBoosting supports user-defined objective functions with classification, regression and ranking problems. We use an objective function to measure the performance of the model. That is given a certain set of parameters. Furthermore, it supports user-defined evaluation metrics as well.

b. Availability

As it is available for programming languages such as R, Python, Java, Julia, and Scala.

c. Save and Reload

We can easily save our data matrix and model and reload it later. Let suppose, if we have a large dataset, we can simply save the model. Further, we use it in future instead of wasting time redoing the computation.

XGBoost Algorithm – Applied Machine Learning

1. XGBoosting

In this Machine Learning Tutorial, we will learn Introduction to XGBoost, coding of XGBoost Algorithm, an Advanced functionality of XGboosting Algorithm, General Parameters, Booster Parameters, Linear Booster Specific Parameters, Learning Task Parameters. Furthermore, we will study about building models and parameters of XGBoost

2. Introduction to XGBoost Algorithm

Basically, XGBoost is an algorithm. Also, it has recently been dominating applied machine learning. XGBoost is an implementation of gradient boosted decision trees. Although, it was designed for speed and performance. Basically, it is a type of software library. That you can download and install on your machine. Then have to access it from a variety of interfaces.

3. XGBoost Algorithm working With Main Interfaces

  • C++, Java and JVM languages.
  • Julia.
  • Command Line Interface.
  • Python interface along with integrated model in scikit-learn.
  • R interface as well as a model in the caret package.

4. Preparation of Data for using XGBoost Algorithm

Let’s assume, you have a dataset named ‘campaign’ . If want to convert all categorical variables into such flags. Then except the response variable. Here is how you do it :

sparse_matrix <- sparse.model.matrix(response ~ .-1, data = campaign)

Now let’s break down this code as follows:

“sparse.model.matrix” is the command. And, all other inputs inside parentheses are parameters.

The parameter “response” says that this statement should ignore “response” variable.
“-1” removes an extra column which this command creates as the first column.
And finally, you specify the dataset name.

To convert the target variables as well, you can use following code:
output_vector = df[,response] == “Responder”

Here is what the code does:

set output_vector to 0
set output_vector to 1 for rows where a response is “Responder” is TRUE ;
return output_vector.

5. Building Model – Xgboost AlgorithmR

Here are simple steps you can use to crack any data problem using xgboost Algorithm:

Step 1: Load all the libraries

library(xgboost)
library(readr)
library(stringr)
library(caret)
library(car)

Step 2 : Load the dataset

(Here I use a bank data where we need to find whether a customer is eligible for loan or not).
set.seed(100)
setwd(“C:\\Users\\ts93856\\Desktop\\datasource”)
# load data
df_train = read_csv(“train_users_2.csv”)
df_test = read_csv(“test_users.csv”)
# Loading labels of train data
labels = df_train[‘labels’]
df_train = df_train[-grep(‘labels’, colnames(df_train))]
# combine train and test data
df_all = rbind(df_train,df_test)

Step 3: Data Cleaning & Feature Engineering # clean Variables : here I clean people with age less than 14 or more than 100
df_all[df_all$age < 14 | df_all$age > 100,’age’] <- -1
df_all$age[df_all$age < 0] <- mean(df_all$age[df_all$age > 0])
# one-hot-encoding categorical features
ohe_feats = c(‘gender’, ‘education’, ’employer’)
dummies <- dummyVars(~ gender + education + employer, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)df_all_combined$agena <- as.factor(ifelse(df_all_combined$age < 0,1,0))
I am using a list of variables in “feature_selected” to be used by the model. I have shared a quick and smart way to choose variables later in this article.
df_all_combined <- df_all_combined[,c(‘id’,features_selected)]
# split train and test
X = df_all_combined[df_all_combined$id %in% df_train$id,]
y <- recode(labels$labels,”‘True’=1; ‘False’=0)
X_test = df_all_combined[df_all_combined$id %in% df_test$id,]


Step 4: Tune and Run the model

xgb <- xgboost(data = data.matrix(X[,-1]),
label = y,
eta = 0.1,
max_depth = 15,
nround=25,
subsample = 0.5,
colsample_bytree = 0.5,
seed = 1,
eval_metric = “merror”,
objective = “multi:softprob”,
num_class = 12,
nthread = 3
)
Step 5: Score the Test Population
And that’s it! You now have an object “xgb” which is an xgboost Algorithm model. Here is how you score a test population :
# predict values in test set
y_pred <- predict(xgb, data.matrix(X_test[,-1]))

6. Xgboost Algorithm – Parameters

a. General Parameters

Following are the General parameters used in Xgboost Algorithm:
  • silent: The default value is 0. You need to specify 0 for printing running messages, 1 for silent mode.
  • booster: The default value is gbtree. You need to specify the booster to use: gbtree (tree based) or gblinear (linear function).
  • num_pbuffer: This is set automatically by xgboost Algorithm, no need to be set by a user. Read the documentation of xgboost for more details.
  • num_feature: This is set automatically by xgboost Algorithm, no need to be set by a user.


b. Booster Parameters
Below we discussed tree-specific parameters in Xgboost Algorithm:
,br>
  • eta: The default value is set to 0.3. You need to specify step size shrinkage used in an update to prevents overfitting. After each boosting step, we can directly get the weights of new features. eta actually shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. Low eta value means the model is more robust to overfitting.
  • gamma: The default value is set to 0. You need to specify minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The range is 0 to ∞. Larger the gamma more conservative the algorithm is.
  • max_depth: The default value is set to 6. You need to specify the maximum depth of a tree. The range is 1 to ∞.
  • min_child_weight: The default value is set to 1. You need to specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node. Then with the sum of instance weight less than min_child_weight. Then the building process will give up further partitioning. In linear regression mode, corresponds to a minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. The range is 0 to ∞.
  • max_delta_step: The default value is set to 0. Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help make the update step more conservative. Usually, this parameter is not needed, but it might help in logistic regression. Especially, when a class is extremely imbalanced. Set it to a value of 1-10 might help control the update.The range is 0 to ∞.
  • subsample: The default value is set to 1. You need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances. That needs to grow trees and this will prevent overfitting. The range is 0 to 1.
  • colsample_bytree: The default value is set to 1. You need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1.


c. Linear Booster Specific Parameters

These are Linear Booster Specific Parameters in XGBoost Algorithm.
  • lambda and alpha: These are regularization term on weights. Lambda default value assumed is 1 and alpha are 0.
  • lambda_bias: L2 regularization term on bias and has a default value of 0.


d. Learning Task Parameters

Following are the Learning Task Parameters in XGBoost Algorithm:
  • base_score: The default value is set to 0.5. You need to specify the initial prediction score of all instances, global bias.
  • objective: The default value is set to reg:linear. You need to specify the type of learner you want. That includes linear regression, Poisson regression etc.
  • eval_metric: You need to specify the evaluation metrics for validation data. And a default metric will be assigned according to the objective.
  • seed: As always here you specify the seed to reproduce the same set of outputs.

7. Advanced functionality of XGBoost Algorithm

We can say xgboost is simple in comparison to other machine learning techniques. If you did all we have done till now, you already have a model.

Let’s take it one step further and try to find the variable importance in the model and subset our variable list.

# Lets start with finding what the actual tree looks like
model <- xgb.dump(xgb, with.stats = T)
model[1:10] #This statement prints top 10 nodes of the model
# Get the feature real names
names <- dimnames(data.matrix(X[,-1]))[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = xgb)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
#In case last step does not work for you because of a version issue, you can try following :
barplot(importance_matrix[,1])

AdaBoost Algorithm For Machine Learning

1. Objective

Through this Machine Learning Tutorial, we will study Boosting – AdaBoost Algorithm. Also, will try to cover every concept related to Adaptive boosting with AdaBoost example.

2. What is AdaBoost

First of all, AdaBoost is short for Adaptive Boosting. Basically, Ada Boosting was the first really successful boosting algorithm developed for binary classification. Also, it is the best starting point for understanding boosting. Moreover, modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

Generally, AdaBoost is used with short decision trees. Further, the first tree is created, the performance of the tree on each training instance is used. Also, we use it to weight how much attention the next tree. Thus, it is created should pay attention to each training instance. Hence, training data that is hard to predict is given more weight. Although, whereas easy to predict instances are given less weight.

3. Learning – AdaBoost Model

Learn AdaBoost Model from Data
  • Ada Boosting is best used to boost the performance of decision trees and this is based on binary classification problems.
  • AdaBoost was originally called AdaBoost.M1 by the author. More recently it may be referred to as discrete Ada Boost. As because it is used for classification rather than regression.
  • AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners.


Each instance in the training dataset is weighted. The initial weight is set to:
weight(xi) = 1/n
Where xi is the i’th training instance and n is the number of training instances.

4. How To Train One Model

A weak classifier is prepared on the training data using the weighted samples. Only binary classification problems are supported. So each decision stump makes one decision on one input variable. And outputs a +1.0 or -1.0 value for the first or second class value.
The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:
error = (correct – N) / N
Where error is the misclassification rate. While correct is the number of training instance predicted by the model. And N is the total number of training instances.

Example 1

if the model predicted 78 of 100 training instances the error.
This is modified to use the weighting of the training instances:
error = sum(w(i) * terror(i)) / sum(w)
Which is the weighted sum of the misclassification rate.
where w is the weight for training instance i
terror is the prediction error for training instance i. Also, which is 1 if misclassified and 0 if correctly classified?

Example 2

if we had 3 training instances with the weights 0.01, 0.5 and 0.2.
The predicted values were -1, -1 and -1, and
the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0.
The misclassification rate would be calculated as:
error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2)
or
error = 0.704
A stage value is calculated for the trained model. As it provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows:
stage = ln((1-error) / error)
Where stage is the stage value used to weight predictions from the model. Also, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight.
The training weights are updated giving more weight to predicted instances. And less weight to predicted instances.

Example 3

the weight of one training instance (w) is updated using:
w = w * exp(stage * terror)
Where w is the weight for a specific training instance,
exp() is the numerical constant e or Euler’s number raised to a power,
a stage is the misclassification rate for the weak classifier and
terror is the error the weak classifier made predicting the output and evaluated as:
terror = 0 if(y == p), otherwise 1
Where y is the output variable for the training instance and p is the prediction from the weak learner.
This has the effect of not changing the weight if the training instance was classified. Thus, making the weight slightly larger if the weak learner misclassified the instance.

5. AdaBoost Ensemble

  • Basically, weak models are added sequentially, trained using the weighted training data.
  • Generally, the process continues until a pre-set number of weak learners have been created.
  • Once completed, you are left with a pool of weak learners each with a stage value.

6. Making Predictions with AdaBoost

Predictions are made by calculating the weighted average of the weak classifiers. For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted.

For example-
5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8. And which would be an ensemble prediction of -1.0 or the second class.

7. Data Preparation for AdaBoost

This section lists some heuristics for best preparing your data for AdaBoost.
Quality Data:
Because of the ensemble method attempt to correct misclassifications in the training data. Also, you need to be careful that the training data is high-quality.
Outliers:
Generally, outliers will force the ensemble down the rabbit hole of work. Although, it is so hard to correct for cases that are unrealistic. These could be removed from the training dataset.
Noisy Data:
Basically, noisy data, specifical noise in the output variable can be problematic. But if possible, attempt to isolate and clean these from your training dataset.


Top 50 Machine Learning Interview Questions & Answers

1) What is Machine learning?

Machine learning is a branch of computer science which deals with system programming in order to automatically learn and improve with experience. For example: Robots are programed so that they can perform the task based on data they gather from sensors. It automatically learns programs from data.

2) Mention the difference between Data Mining and Machine learning?

Machine learning relates with the study, design and development of the algorithms that give computers the capability to learn without being explicitly programmed. While, data mining can be defined as the process in which the unstructured data tries to extract knowledge or unknown interesting patterns. During this process machine, learning algorithms are used.

3) What is ‘Overfitting’ in Machine learning?

In machine learning, when a statistical model describes random error or noise instead of underlying relationship ‘overfitting’ occurs. When a model is excessively complex, overfitting is normally observed, because of having too many parameters with respect to the number of training data types. The model exhibits poor performance which has been overfit.

4) Why overfitting happens?

The possibility of overfitting exists as the criteria used for training the model is not the same as the criteria used to judge the efficacy of a model.

5) How can you avoid overfitting ?

By using a lot of data overfitting can be avoided, overfitting happens relatively as you have a small dataset, and you try to learn from it. But if you have a small database and you are forced to come with a model based on that. In such situation, you can use a technique known as cross validation. In this method the dataset splits into two section, testing and training datasets, the testing dataset will only test the model while, in training dataset, the datapoints will come up with the model.

In this technique, a model is usually given a dataset of a known data on which training (training data set) is run and a dataset of unknown data against which the model is tested. The idea of cross validation is to define a dataset to “test” the model in the training phase.

6) What is inductive machine learning?

The inductive machine learning involves the process of learning by examples, where a system, from a set of observed instances tries to induce a general rule.

7) What are the five popular algorithms of Machine Learning?

  1. Decision Trees
  2. Neural Networks (back propagation)
  3. Probabilistic networks
  4. Nearest Neighbor
  5. Support vector machines

8) What are the different Algorithm techniques in Machine Learning?

The different types of techniques in Machine Learning are

a) Supervised Learning 
b) 
Unsupervised Learning 
c) 
Semi-supervised Learning 
d) 
Reinforcement Learning 
e) 
Transduction
f) Learning to Learn

9) What are the three stages to build the hypotheses or model in machine learning?

  1. Model building
  2. Model testing
  3. Applying the model

10) What is the standard approach to supervised learning?

The standard approach to supervised learning is to split the set of example into the training set and the test.

11) What is ‘Training set’ and ‘Test set’?

In various areas of information science like machine learning, a set of data is used to discover the potentially predictive relationship known as ‘Training Set’. Training set is an examples given to the learner, while Test set is used to test the accuracy of the hypotheses generated by the learner, and it is the set of example held back from the learner. Training set are distinct from Test set.

12) List down various approaches for machine learning?

The different approaches in Machine Learning are
  • Concept Vs Classification Learning
  • Symbolic Vs Statistical Learning
  • Inductive Vs Analytical Learning

13) What is not Machine Learning?

  1. Artificial Intelligence
  2. Rule based inference

14) Explain what is the function of ‘Unsupervised Learning’?

  1. Find clusters of the data
  2. Find low-dimensional representations of the data
  3. Find interesting directions in data
  4. Interesting coordinates and correlations
  5. Find novel observations/ database cleaning

15) Explain what is the function of ‘Supervised Learning’?

  1. Classifications
  2. Speech recognition
  3. Regression
  4. Predict time series
  5. Annotate strings

16) What is algorithm independent machine learning?

Machine learning in where mathematical foundations is independent of any particular classifier or learning algorithm is referred as algorithm independent machine learning?

17) What is the difference between artificial learning and machine learning?

Designing and developing algorithms according to the behaviours based on empirical data are known as Machine Learning. While artificial intelligence in addition to machine learning, it also covers other aspects like knowledge representation, natural language processing, planning, robotics etc.

18) What is classifier in machine learning?

A classifier in a Machine Learning is a system that inputs a vector of discrete or continuous feature values and outputs a single discrete value, the class.

19) What are the advantages of Naive Bayes?

In Naïve Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. The main advantage is that it can’t learn interactions between features.

20) In what areas Pattern Recognition is used?

Pattern Recognition can be used in
  1. Computer Vision
  2. Speech Recognition
  3. Data Mining
  4. Statistics
  5. Informal Retrieval
  6. Bio-Informatics

21) What is Genetic Programming?

Genetic programming is one of the two techniques used in machine learning. The model is based on the testing and selecting the best choice among a set of results.

22) What is Inductive Logic Programming in Machine Learning?

Inductive Logic Programming (ILP) is a subfield of machine learning which uses logical programming representing background knowledge and examples.

23) What is Model Selection in Machine Learning?

The process of selecting models among different mathematical models, which are used to describe the same data set is known as Model Selection. Model selection is applied to the fields of statistics, machine learning and data mining.

24) What are the two methods used for the calibration in Supervised Learning?

The two methods used for predicting good probabilities in Supervised Learning are
  • Platt Calibration
  • Isotonic Regression

These methods are designed for binary classification, and it is not trivial.

25) Which method is frequently used to prevent overfitting?

When there is sufficient data ‘Isotonic Regression’ is used to prevent an overfitting issue.

26) What is the difference between heuristic for rule learning and heuristics for decision trees?

The difference is that the heuristics for decision trees evaluate the average quality of a number of disjointed sets while rule learners only evaluate the quality of the set of instances that is covered with the candidate rule.

27) What is Perceptron in Machine Learning?

In Machine Learning, Perceptron is an algorithm for supervised classification of the input into one of several possible non-binary outputs.

28) Explain the two components of Bayesian logic program?

Bayesian logic program consists of two components. The first component is a logical one ; it consists of a set of Bayesian Clauses, which captures the qualitative structure of the domain. The second component is a quantitative one, it encodes the quantitative information about the domain.

29) What are Bayesian Networks (BN) ?

Bayesian Network is used to represent the graphical model for probability relationship among a set of variables .

30) Why instance based learning algorithm sometimes referred as Lazy learning algorithm?

Instance based learning algorithm is also referred as Lazy learning algorithm as they delay the induction or generalization process until classification is performed.

31) What are the two classification methods that SVM ( Support Vector Machine) can handle?

  • Combining binary classifiers
  • Modifying binary to incorporate multiclass learning

32) What is ensemble learning?

To solve a particular computational program, multiple models such as classifiers or experts are strategically generated and combined. This process is known as ensemble learning.

33) Why ensemble learning is used?

Ensemble learning is used to improve the classification, prediction, function approximation etc of a model.

34) When to use ensemble learning?

Ensemble learning is used when you build component classifiers that are more accurate and independent from each other.

35) What are the two paradigms of ensemble methods?

The two paradigms of ensemble methods are
  • Sequential ensemble methods
  • Parallel ensemble methods

36) What is the general principle of an ensemble method and what is bagging and boosting in ensemble method?

The general principle of an ensemble method is to combine the predictions of several models built with a given learning algorithm in order to improve robustness over a single model. Bagging is a method in ensemble for improving unstable estimation or classification schemes. While boosting method are used sequentially to reduce the bias of the combined model. Boosting and Bagging both can reduce errors by reducing the variance term.

37) What is bias-variance decomposition of classification error in ensemble method?

The expected error of a learning algorithm can be decomposed into bias and variance. A bias term measures how closely the average classifier produced by the learning algorithm matches the target function. The variance term measures how much the learning algorithm’s prediction fluctuates for different training sets.

38) What is an Incremental Learning algorithm in ensemble?

Incremental learning method is the ability of an algorithm to learn from new data that may be available after classifier has already been generated from already available dataset.

39) What is PCA, KPCA and ICA used for?

PCA (Principal Components Analysis), KPCA ( Kernel based Principal Component Analysis) and ICA ( Independent Component Analysis) are important feature extraction techniques used for dimensionality reduction.

40) What is dimension reduction in Machine Learning?

In Machine Learning and statistics, dimension reduction is the process of reducing the number of random variables under considerations and can be divided into feature selection and feature extraction

41) What are support vector machines?

Support vector machines are supervised learning algorithms used for classification and regression analysis.

42) What are the components of relational evaluation techniques?

The important components of relational evaluation techniques are
  • Data Acquisition
  • Ground Truth Acquisition
  • Cross Validation Technique
  • Query Type
  • Scoring Metric
  • Significance Test

43) What are the different methods for Sequential Supervised Learning?

The different methods to solve Sequential Supervised Learning problems are
  • Sliding-window methods
  • Recurrent sliding windows
  • Hidden Markow models
  • Maximum entropy Markow models
  • Conditional random fields
  • Graph transformer networks

44) What are the areas in robotics and information processing where sequential prediction problem arises?

The areas in robotics and information processing where sequential prediction problem arises are


  • Imitation Learning
  • Structured prediction
  • Model based reinforcement learning

45) What is batch statistical learning?

Statistical learning techniques allow learning a function or predictor from a set of observed data that can make predictions about unseen or future data. These techniques provide guarantees on the performance of the learned predictor on the future unseen data based on a statistical assumption on the data generating process.

46) What is PAC Learning?

PAC (Probably Approximately Correct) learning is a learning framework that has been introduced to analyze learning algorithms and their statistical efficiency.

47) What are the different categories you can categorized the sequence learning process?

  • Sequence prediction
  • Sequence generation
  • Sequence recognition
  • Sequential decision

48) What is sequence learning?

Sequence learning is a method of teaching and learning in a logical manner.

49) What are two techniques of Machine Learning ?

The two techniques of Machine Learning are
  • Genetic Programming
  • Inductive Learning

50) Give a popular application of machine learning that you see on day to day basis?

The recommendation engine implemented by major ecommerce websites uses Machine Learning

No comments:

Post a Comment

Online Compiler - Techie Delight TECHIE DELIGHT </> Bash (4.4) Bash (4.0) Basic (fbc 1.05.0) ...