%\title{Deep Learning}
%% 2012/12/27
%% by the Captain 
% OPTIONAL PACKAGES
%\documentclass[journal]{IEEEtran}
\documentclass[12pt,journal,compsoc]{IEEEtran}
%\usepackage{ifpdf}
\usepackage{cite}
\ifCLASSINFOpdf
  \usepackage[pdftex]{graphicx}
  % \graphicspath{{../pdf/}{../jpeg/}}
  % \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
\else
  \usepackage[dvips]{graphicx}
  % \graphicspath{{../eps/}}
  % \DeclareGraphicsExtensions{.eps}
\fi
%\usepackage[cmex10]{amsmath}
%\usepackage{algorithmic}
\usepackage{amsfonts}
%\usepackage{subfig}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{multirow}
 \usepackage[table,xcdraw]{xcolor}
% \usepackage{graphicx}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{listings}
%\ifCLASSOPTIONcompsoc
%  \usepackage[caption=false,font=normalsize,labelfont=sf,textfont=sf]{subfig}
%\else
%  \usepackage[caption=false,font=footnotesize]{subfig}
%\fi
%\usepackage{fixltx2e}
%\usepackage{stfloats}
% \usepackage{dblfloatfix}
%\ifCLASSOPTIONcaptionsoff
%  \usepackage[nomarkers]{endfloat}
% \let\MYoriglatexcaption\caption
% \renewcommand{\caption}[2][\relax]{\MYoriglatexcaption[#2]{#2}}
%\fi
\usepackage[hyphens]{url}
\begin{document}
\title{Deep Learning}
\author{Nolan~Reis}
\markboth{nreis, May~2016}%
{Shell \MakeLowercase{\textit{et al.}}: Bare Demo of IEEEtran.cls for Computer Society Journals}
\date{\normalsize\today}
\IEEEtitleabstractindextext{%
\begin{abstract}
Deep learning is a fast growing field in tech that is often described to have limitless potential.  This paper describes its history, why the explosion in popularity, and how it works.  An example of classifying images of handwritten digits (MNIST) will be explored using a fully connected network and a convolutional neural network.  Next, a brief description of the tools necessary for the reader to implement his or her own network.  Finally, a view of the state of the art being developed by companies such as Google, Facebook, and Baidu.  
\end{abstract}}
\maketitle
\IEEEpeerreviewmaketitle
\IEEEdisplaynontitleabstractindextext
\section{Trendy}
\IEEEPARstart{D}{eep} learning (DL) is one of the hottest terms in tech right now.  Companies like Facebook, Google, YouTube, Tesla, Spotify, Yelp, and Microsoft are investing heavily into this tool.  So what is it?  
The secret is that deep learning is just a re-branding of artificial neural networks (ANN), which have been around since the 1960s.  
\section{History}
The earliest deep learning-like algorithms were invented by Ivakhnenko and Lapa in 1965.  A lot of work and innovation happened in the 1980s (Fukushima's convolutional nueral networks) and 1990s (LeCunn's LeNet).\cite{NvidiaHistory}  Many of these techniques are still used today.  However, back then computers were slow and data sets were tiny.  Researchers did not find many applications for neural networks (NN), so during the 2000's research dropped off.  It was not until the last few years did NN make a resurgence. 
The big shift was due to increased computational power and increased data.  First was the introduction of the graphics processing unit (GPU).  GPUs increased the computational processing speed by a factor of 1000 in the span of 10 years.\cite{NvidiaHistory}  The second reason for NN's comeback was the exponential rise in data.  Technology has allowed us to store more data and the internet has allowed us to share that data.  
The combination of exponential technology and data has allowed deep learning to break record after record.
\begin{itemize}
  \item \emph{Speech Recognition}: In 2009, Microsoft and Toronto University improved speech recognition by 30\% using DL.\cite{NyTimes} 
  \item \emph{Computer Vision}: There is a yearly competition called ImageNet where teams compete to classify a library of 14 million images into 20,000 categories.  In 2012, Alex Krizhevsky and Geoff Hinton submitted a deep learning algorithm, AlexNet, which achieved an error rate of 15\% (40\% better than state of the art). \cite{AlexNetCompetition}  
  \item \emph{Drug Discovery}: In 2011, Geoff Hinton and a team from Toronto University won the "Merck Molecular Activity Challenge" for automatic drug discovery.  They used deep learning to determine which molecule was most likely to be an effective drug agent.  The amazing thing was that nobody on the team had a background in chemistry, biology, or life science and they did it in two weeks.   \cite{NyTimes}
\end{itemize}
\section{Core Concepts}
\subsection{Machine Learning 101}
Machine learning is a subfield of computer science where computers have the ability to learn from experience instead of being explicitly programmed. First we take some data, train a model off that data, then use that model to make predictions on brand new data.  Training is analogous to how humans learn.  The model is exposed to new data, makes a prediction, and gets feedback about how accurate its prediction was.  It uses that feedback to correct errors inside the model.\footnote{In contrast, unsupervised learning eliminates the feedback portion and looks for unlabeled underlying structure.} This process is repeated step by step multiple times through the entire data set. 
The input data has $n$ observations and $m$ features.  Features are attributes about the input data.  For example, take a bank.  There are $n$ customers who have $m$ features such as \textit{does someone have a checking account?  How much money is in that account?}    
Machine learning models have the ability to predict either continuous values (\textit{How much will someone spend per month?}) or classify $k$ discrete values (\textit{will someone open a savings account?}).   These discrete values are referred to as \emph{classes}.  In the classification example, the output has $k = 2$ classes (\textit{yes} or \textit{no}). This paper will focus on classification.
\subsection{Feature Engineering}
\emph{Feature engineering} is the process of using domain knowledge of data to extract useful features or patterns to make machine learning easier.  For example, say you were training a model to predict if a photo was taken indoors or outside.  You know that the sky is blue and so the percentage of blue pixels might be a good indicator (feature).  By engineering that feature ahead of time, the model does not have to learn that the sky is in fact blue.  This reduces the number of classes the model needs to consider (percentage of blue pixels vs. sky is blue or green or white).  Examples of feature extractors for images are SIFT, HOG, RIFT.  
% * <darren@myhigherground.com> 2016-05-19T03:42:41.143Z:
%
% > This reduces the number of classes the model needs to consider.
%
% needs another sentance for clarification 
%
% ^ <nreis@ucdavis.edu> 2016-05-19T16:09:51.417Z.
While feature engineering is still a very important skill, it has its drawbacks.  It requires expert knowledge of the problem, it can be very problem specific, and it takes a lot of hand tuning - which is time consuming.  
\subsection{Feature Learning}
\emph{Feature learning} is the process in which the algorithm autonomously finds distinguishing patterns, extracts them, and then feeds them to the classification layer.  In other words, feature learning is feature engineering done automatically by algorithms.  In deep learning, convolutional neural networks [CovNet] form a hierarchy of abstraction that grow in complexity (blobs$\rightarrow$edges$\rightarrow$eyes, noses, ears $\rightarrow$ face), see Fig \ref{fig:FeatLearn}.  The final layer takes this generated feature and uses it for classification.\cite{NvidiaConcepts} 
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.85\linewidth]{hierarchical_features.png}
 \caption{Learned hierarchical feature learned by Deep Learning algorithm\cite{NvidiaConcepts}}
 \label{fig:FeatLearn}
 \end{center}
 \end{figure}
\section{Logistic Classifier}
The next two sections are going to explain the theory behind deep learning by starting with a logistic classifier and evolving it into a deep network.  The purpose of a logistic classifier is predict a categorical class, given input data.  \textit{Is this an image of a 5? or a 4?}
Throughout all of deep learning the fundamental ingredients are a) Data b) Structure c) Loss and d) Optimizer
\subsection{Data}
As with all supervised machine learning algorithms, it is important to split the data into three sets: training, validation, testing.  Normally, the data is split into 70\% training, 20\% validation, and 10\% testing.  
The training and validation sets are used during training.  The training set is used to adjust the weights of the model.  While the validation set does not update the weights, it is used to validate that the model is not overfitting.  \emph{Overfitting} is when a model is overly complex - it has superfluous freedom to align with the specific data.    
Overfitting can be seen in the following analogy.  A student, analogous to our network model, takes two exams of the same subject repeatedly.  Over many trials the student will improve.  However, if the student's accuracy increases on Test 1 but not on Test 2, then he may be memorizing the answers, not learning the material.  The same is true for our model.  It could be overfitting the training data and not learning the underlying relationship.
The test set is new, unseen data that is only used for testing the final model's predictive power.  To follow the student analogy, the test set is the real world.  
\subsection{Structure}
\subsubsection{Neuron or Node}
The basic building block of a network is the neuron or node (Fig \ref{fig:neuron}).  It takes some input data, applies a linear function to those inputs by calculating a weighted sum, and applies an activation function to that sum.    
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.8\linewidth]{neuron.png}
 \caption{Neuron or node: Basic unit of Deep Learning}
 \label{fig:neuron}
 \end{center}
 \end{figure}
 
 The linear function is defined as 
\begin{equation}
WX+b = Y
\end{equation}
where $X$ denotes an input vector, $W$ denotes a matrix of weights, $b$ denotes the biases, and $Y$ denotes the \emph{scores} or \emph{logits}.  The training happens by trying to find the weights and biases that are good at predicting the correct class.  
For example, take a model that is trying to learn handwritten digits with an input as an image of a handwritten "5."  The linear function (Fig \ref{fig:Lin}) takes that input and outputs logits.  At first these outputs do not mean much.  The task is to determine the probability the image belongs to each class (digit).  The way to turn logits into probabilities is to apply a softmax as our activation function, see Fig \ref{fig:SftMax}.
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.6\linewidth]{LinFunc.png}
 \caption{Linear Function }
 \label{fig:Lin}
 \end{center}
 \end{figure}
 
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.6\linewidth]{sftMax.png}
 \caption{To turn logits into probabilities, the activation function was chosen to be softmax}
 \label{fig:SftMax}
 \end{center}
 \end{figure}
 
The softmax function outputs the probabilities the image belongs to each class (the most likely is close to 1 and the less likely are close to 0).  The technique of One-hot Encoding is used to turn each label into a class-membership vector.  This vector has the value 1 for the correct class, and 0 for the rest of entries.  In the above example, the five is the correct label, so the one-hot encoded vector is [0,0,1]. 
There are now two vectors, one from the classifier (the probabilities) and one that represents the correct label (encoded vector). 
\subsection{Loss}
For the feedback in the model to work, there must be a metric of success.  The way to measure the distance between potential two vectors is called Cross Entropy, Eq \ref{Eq:X-entropy}. The goal is to have a low distance for a correct class but a high distance for an incorrect one.    
\begin{equation}\label{Eq:X-entropy}
D(S(Y),L)) = \sum_{k}^{ }L_{k}log(S(Y_{k})))
\end{equation}
The Training Loss, Eq \ref{Eq:TrainingLoss}, is defined as the average cross entropy over the entire training set ($i$). A good model has a low training loss.   
\begin{equation}\label{Eq:TrainingLoss}
\mathfrak{L} = \frac{1}{N}\sum_{i}^{ }D(S(WX_{i}+b)),L_{i})
\end{equation}
The loss is a function of the weights and the biases, so we are going to minimize that function using an optimizer.\cite{Udacity} 
\subsection{Optimizer}
One of the most popular optimizing techniques in machine learning is called Stochastic Gradient Descent (SGD).  It takes small steps along the loss surface following the gradient until it finds a minimum.  Recall the gradient is the multivariate slope of a function.  The size of the step is called the \emph{learning rate}.  The bigger the learning rate the faster it learns, but it may not reach the absolute minimal loss.  In practice, SGD is performed over multiple passes of the data set called epochs.  
SGD is popular in machine learning  because it scales well with data and model size.  However, it comes with additional hyper-parameters.  These are different from ordinary parameters that the model optimizes.  Examples of hyper-parameters that the user must tune are:
\begin{itemize}
\item Learning Rate initialization
\item Learning Rate decay
\item Weight initialization
\item Number of Epochs 
\end{itemize}
\subsection{Summary}
To summarize, we have created a linear model that outputs probabilities [structure].  We evaluate how the model is doing by calculating the cross entropy [loss] and use SGD [optimizer] to minimize that loss.  It is still a shallow model, but these are the fundamental tools for going deeper.   
% * <darren@myhigherground.com> 2016-05-19T04:06:28.296Z:
%
% > To summarize, we have created a linear model that outputs probabilities [structure].  We evaluate how the model is doing by calculating the cross entropy [loss] and use SGD [optimizer] to minimize that loss.  It is still a shallow model, but these are the fundamental tools for going deeper.   
%
% awesome paragraph.  give me that same thing in the beginning of the section as part of the intro so i get a better sense of where things are going.
%
% ^.
\section{Deep Learning}
\subsection{MultiLayer Perceptron [MLP]}
To turn the logistic classifier into a network, a second neuron is linked between the current neuron and the input (Fig \ref{fig:basic2L}).  This is called a two-layer Neural Network (the input layer is not counted).   
Layers are the highest level building block of a network.  The new layer is called the Hidden Layer because its output values are not visible to the network output.  The hidden layer gives the model the opportunity to represent the data in a simpler way.
The depth of the network is defined by the number of hidden layers. 
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.8\linewidth]{basic_2layer.png}
 \caption{Basic two layer Neural Network}
 \label{fig:basic2L}
 \end{center}
 \end{figure}
In addition to layers, the number of nodes per layer can increase as well.  The number of nodes on a layer represents the degree of freedom of that layer.  
When the output of every node on one layer is connected to the input of every node on the next layer, the network is called \emph{Fully Connected} or \emph{Dense}.  
The size of a network is defined by the number of layers and the number of nodes or parameters.  Fig \ref{fig:NN2} has 2 layers, 4+2=6 nodes (do not count input) or [3x4]+[4x2] = 20 weights and 4+2=6 biases for a total of 26  parameters.  Fig \ref{fig:NN3} has 3 layers, 9 nodes, and 42 learnable parameters.  
Modern convolutional neural networks have 100 million parameters and 20 layers (hence deep learning).  
\begin{figure}[!h]
\centering
\begin{subfigure}{.4\linewidth}
  \includegraphics[width=\linewidth]{NN_2layer.jpeg}
  \caption{2-Layer}
  \label{fig:NN2}
\end{subfigure}%
\hfill
\begin{subfigure}{.6\linewidth}
  \includegraphics[width=\linewidth]{layerView.jpg}
  \caption{3-Layer}
  \label{fig:NN3}
\end{subfigure}
\caption{Two Fully connected Neural Networks\cite{Stanf}}
\label{fig:test}
\end{figure}
Evolving the structure from a single node into a network has allowed the model more opportunities to represent the data in a simpler way (layers) and more degrees of freedom (nodes).  However, the model is still linear. Because of superposition, stacking a 100 purely linear transformations can be simplified to a single layer. The solution is to introduce non-linear functions.
\subsection{Non-Linearities}
To preserve the network's structure (and the benefits gained with this structure), each hidden layer is given a non-linear activation function.  By adding non-linearity, the entire model is now non-linear and cannot be simplified down to a single transformation.  This creates a hierarchy of abstraction that grows in complexity with every layer.\cite{NvidiaConcepts} \cite{DataWknd}    
This is the foundation for building deep models.    
%Non-linear transformations increase the complexity of the relationships.  In DL, this creates increasingly complex features with every layer.  In contrast, stacking 100 purely linear transformations can be simplified to a single layer.  That is, even when multiple node layers are added, because of superposition, the layers can be rearranged into a single mapping.  For nonlinear layers, the mapping is non-separable - forcing different levels of complexity to be modeled.  
% * <darren@myhigherground.com> 2016-05-19T04:11:23.523Z:
%
% > Non-linear transformations increase the complexity of the relationships.  In DL, this creates increasingly complex features with every layer.  In contrast, stacking 100 purely linear transformations can be simplified to a single layer.\cite{NvidiaConcepts}  That is, even when multiple node layers are added, because of superposition, the layers can be rearranged into a single mapping.  For nonlinear layers, the mapping is non-separable - forcing different levels of complexity to be modeled.  This is the foundation for building deep models. 
%
% you should edit my sentance, but this is a SUPER important paragraph.  i didn't even think about how important this was until you pointed it out.  thats the whole reason its called deep.  neat
%
% ^ <nreis@ucdavis.edu> 2016-05-19T23:04:54.605Z.
There are multiple types of non-linear activation functions: softmax, sigmoidal/logistic, tanh, and the rectified linear unit [ReLU].  A ReLU is a very simple, very powerful non-linearity.  Its output is linear for x greater than zero and zero everywhere else (Fig \ref{fig:ReLU}). Since its introduction in 2012, ReLu has become the most popular non-linearity because it does not face gradient vanishing problems as with sigmoid and tanh function.\cite{DataWknd}
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.6\linewidth]{Relu.jpeg}
 \caption{Rectified Linear Unit is the most popular nonlinear function.  }
 \label{fig:ReLU}
 \end{center}
 \end{figure}
\subsection{Summary}
By constructing this MLP network we have given the model a better structure 
\begin{itemize}
\item Hidden Layers - number of moves to figure out a simpler way to represent the data
\item Number of nodes per hidden layer - the degrees of freedom for that move
\item Non-linearities allow increasing feature complexity with each layer  
\end{itemize}
We then told the network to learn the best parameters in order to correctly classify the input.
This is the core to Deep Learning.\cite{Udacity} \cite{DataWknd} \cite{playground}
\section{Convolutional Neural Networks}
Deep networks are powerful but can quickly increase in complexity.  Back to the example of classifying handwritten digits.  If the input image is 32x32 pixels with 3 colors and the network has 2 fully connected layers with 2 outputs (similar to Fig \ref{fig:NN2}) then there would be a total of 9.4 millions learnable parameters.  That is a lot of parameters for a small image and a simple structure.  To help out the model, the user can use his or her domain knowledge (the fact that it is an image).    
Take an image of a cat.  It does not matter where in the image the cat is, it is still an image of a cat.  This is called translational invariance.  Identifying invariant structure is a key aspect in machine learning because it is a direct path to efficient learning.  In a fully connected network, the model learns weights for cats in the right corner and different weights for cats in the left corner.  Instead, the user would like the model to learn features by sharing weights across the entire image.  This is called \emph{convolutions}.   
\subsection{Convolutions}
Fig \ref{fig:CNN} is an example of an input image ($X$)  with a cat in it.  The image has a width, height, and a depth (represented by the RGB colors).  Take a small patch of the image and run a tiny neural network on it with $k$ outputs.  Sliding that patch across the entire image creates a new image with a new width, height, and a depth of $k$.  If the patch was the size of the original image, it would be no different than a fully connected layer.  However, by sweeping a smaller patch across the image  there are fewer weights and the weights are shared across space.  
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.7\linewidth]{CNN.png}
 \caption{Sketch of how a Convolution passes over an image \cite{Udacity}}
 \label{fig:CNN}
 \end{center}
 \end{figure}
\subsection{Network}
Convolutions are stacked on top of each other to form a convolutional pyramid, Fig \ref{fig:ConvNet}.  The layers progressively squeeze the spatial dimensions, while increasing the network depth.  The depth can be thought of as the semantic representation.  At the end, a fully connected classifier is attached.  Through training, these convolutional layers form the hierarchy of abstraction seen in Fig \ref{fig:FeatLearn}. \cite{Udacity} \cite{DataWknd} 
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.85\linewidth]{CNN_stack.png}
 \caption{Structure of a ConvNet \cite{Udacity}}
 \label{fig:ConvNet}
 \end{center}
\end{figure}
\section{Example: MNIST}
The challenge of classifying handwritten digits is a classic machine learning problem.  The dataset used is called MNIST and it was one of the first real world problems solved by neural networks.  
\subsection{Data}
MNIST contains 60,000 training images and 10,000 test images of handwritten digits from 500 different writers.  Each image is a grey scale 28x28 pixel image.  10\% of the training data was reserved for validation.  
\subsection{Structure}
Two structures were evaluated: MultiLayer Perceptron [MLP] and a ConvNet.
\subsubsection{MLP}
A 2-Layer Perceptron [MLP] was built with fully connected layers. The input was an image flattened to a vector of length 784 (28x28).  This input was fully connected to a hidden layer with 512 nodes with a ReLU activation function.  The hidden layer was fully connected to the output layer of length 10 (0-9) with a softmax.\footnote{Dropout was applied to combat overfitting}
\subsubsection{ConvNet}
The input image was kept in its original form 1x28x28.  It was inputted into two convolutional layers which squeezed it to a shape of 32x14x14.  That image was fed into the same MLP classifier as described above.
\subsection{Loss and Optimizer}
Both models defined the loss as cross entropy and used RMSprop as their optimizer.  [RMSprop is a version of SGD with an adaptive learning rate].
\subsection{Results}
The results are shown in the Table \ref{tbl:MNIST} below.  Both algorithms performed excellent. Out of the 10,000 test points, the MLP missed 188.  The  ConvNet misclassified only 93;  it was twice as accurate in half the number of epochs.  Fig \ref{fig:MNIST} shows the loss curves for each, demonstrating that neither model overfit the training data.  The time per epoch was listed because this example was done on the author's personal computer (2014 13in Macbook Pro running OS X 10.11.4, 2.6GHz 8GB, Intel Iris 1536MB).  At the time of purchase, the author had no intention of demanding more than basic performance from his machine.  Had this experiment been run on a GPU the training time would have been an order of magnitude faster.  
\begin{table}[h!]
\centering
\caption{Results from Classifying MNIST}
\label{tbl:MNIST}
\resizebox{\linewidth}{.3in}{%
\begin{tabular}{ccc|ccc}
\hline
\multicolumn{1}{l}{} & \multicolumn{1}{l}{} & \multicolumn{1}{l|}{} & \multicolumn{3}{c}{Accuracy}            \\ \hline
                     & Epochs               & time per Epoch [s]    & Training Set & Validation Set & Test Set \\ \cline{2-6} 
MLP                  & 10                   & 10                    & 0.9822       & 0.9828         & 0.9812   \\
CNN                  & 5                    & 360                   & 0.9928       & 0.9915         & 0.9907   \\ \hline
\end{tabular}%
}
\end{table}
\begin{figure}[!h]
\centering
\begin{subfigure}{.5\linewidth}
  \includegraphics[width=\linewidth]{MLP.png}
  \caption{MLP}
  \label{fig:MLP}
\end{subfigure}%
\hfill
\begin{subfigure}{.5\linewidth}
  \includegraphics[width=\linewidth]{CNN_loss.png}
  \caption{ConvNet}
  \label{fig:NN3}
\end{subfigure}
\caption{Loss as a function of epoch.  The fact that the training and validation losses converge is evidence that the model is not overfitting}
\label{fig:MNIST}
\end{figure}
\section{Tools}
The good news is that the field of DL is exploding and a majority of it is open-source.  The bad news is that this field is in its infancy so there are a lot of options and it is difficult to configure your system.  
 
\subsection{Programming}
Deep learning is programmed in mostly Python or C++. Since it is just math, one can program all of the operations from scratch.  However, many common functions have been built into open-source toolkits.  The community has not consolidated yet, so there are over 50 different toolkits, each with its advantages and disadvantages.  The most popular include
\begin{itemize}
\item TensorFlow
\item Keras
\item Theano
\item Caffe
\item Torch
\item CNTK
\end{itemize}
Google open-sourced its toolkit called TensorFlow and it has gained a lot of traction in the six months since its release (November, 2015).  It is a very powerful toolkit that they are writing all of their algorithms on.  
The best place to start is a toolkit called Keras.  It is built to run on top of TensorFlow or Theano.  The purpose of this toolkit is to enable fast, easy prototyping.  While it is not as powerful as other options it allows beginners to get their hands dirty quickly. \footnote{It is this author's opinion that creating an environment inside anaconda was the most successful way to get started.}
In addition to toolkits, existing pre-trained networks are usually open-sourced.  Winning networks, such as AlexNet (the first ConvNet submitted to ImageNet), open-source their learned parameters and structure.  
\subsection{Transfer Learning}
One of the continuing limitations of DL is data.  While the amount of data is increasing exponentially, getting good, clean data is hard to come by.  Large corporations like Facebook or Google can pay for manual labeling of data, but everyone else uses shared data sets like MNIST over and over again.  Few DL models are trained from scratch.  It is logical to assume that is would stall innovation; however, this is not true.  It has been shown that CovNets learned from large data sets can learn generic features and be repurposed for other smaller databases. \cite{lin2015learning} 
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.8\linewidth]{alexnet_small.png}
 \caption{The structure of AlexNet  \cite{AlexNet}}
 \label{fig:AlexNet}
 \end{center}
\end{figure}
\subsubsection{Fine Tuning}
Take a ConvNet pre-trained on ImageNet and cut off the last fully connected layer (that classifies the 1000 classes defined by ImageNet).  Retrain the CovNet by fine-tuning the existing weights for this new dataset. Essentially, the original weights are used as an initialization for the new task.  The motivation behind this strategy is that the lowest levels of the ConvNet contain generic features (edge detector or blob) that can be useful for many task; however, later layers become progressively more specific to the nuances of the classes for the original dataset.   For example, ImageNet has a number of dog breeds, so AlexNet likely has a number of later filters that can distinguish between the breeds. \cite{Stanf} 
\subsection{Technology}
As previously stated, the exponential rise in computational power has enabled DL to grow at an incredible rate.  In the 2000s, researchers recognized that the GPU inside gaming computers was perfect for quickly multiplying very large matrices.  They originally rode the rise of gaming computers, but lately, computer companies, like Nvidia, have taken notice of DL and begun building chips specifically designed for DL.  Deciding which GPU specifications is beyond the scope of this report but there are many resources for those who are interested. \cite{GPUs}  
It is this author's opinion that if the reader is planning to do deep learning, then he or she should spend time researching a sufficient computer.  While it may be possible to train models using CPUs, it is not practical.  Even basic GPUs are 10x faster, which means faster iterating.  
Another option is Amazon Web Services [AWS].  A user can rent time on Amazon's GPUs to run an algorithm for a couple hours.  This is great for testing models out without the investment in hardware.  
Currently, most of DL is done on clusters of GPUs by big companies and requires the cloud.  Movidius is trying to change that.  Last month they announced their Fathom Neural Compute Stick - a modular deep learning accelerator in the form of a standard USB stick.  This chip allows DL to be embedded in new places like robots, drones, cameras, VR.  The goal to able to add a visual cortex to any device. This will take the learning out of the cloud and allow the devices to be more natively intelligent.\cite{Thumb1} \cite{Thumb2}
\section{State of the Art}
Here is a snapshot of the state of the art at the time this report was written. (note: new advances are announced almost every week)
\subsection{Google}
\subsubsection{Google Photos [Sept 2014]} 
Google Photos is a downloadable app that stores and learns the user's photos in the cloud.  It is built off Google's first place finish in the 2014 ImageNet:  googLeNet.\cite{GPhotos}  The program learns your friends through facial recognition, as well as learns about photo context - this allows for easy searching.  
\subsubsection{Deep Dream [July 2015]}
The Google engineers wanted a way to visualize what the network was visualizing on the middle layers of googLeNet, so they invented a technique called Inceptionism.  The network is fed an arbitrary image and asked to enhance whatever it detected on a selected layer.  Each layer deals with a different level of abstraction, so lower layers tend to produce strokes (Fig \ref{fig:DreamLines}) while higher levels identify more sophisticated features (even objects), see Fig \ref{fig:DreamSky} and Fig \ref{fig:DreamAnimals}.
\begin{quote}
This creates a feedback loop: if a cloud looks a little bit like a bird, the network will make it look more like a bird. This in turn will make the network recognize the bird even more strongly on the next pass and so forth, until a highly detailed bird appears, seemingly out of nowhere.\cite{GDream}
\end{quote}
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.85\linewidth]{Dream_lines.png}
 \caption{Visualization of a lower layer produces strokes \cite{GDream}}
 \label{fig:DreamLines}
 \end{center}
\end{figure}
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.85\linewidth]{Dream_sky.png}
 \caption{Visualization of a higher level produces more complex objects \cite{GDream}}
 \label{fig:DreamSky}
 \end{center}
\end{figure}
\begin{figure}[!h]
 \begin{center}
 \includegraphics[width=.9\linewidth]{DreamFunny-Animals.png}
 \caption{Zoomed in view of the sky visualization in Fig \ref{fig:DreamSky} shows the advanced objects  \cite{GDream}}
 \label{fig:DreamAnimals}
 \end{center}
\end{figure}
Due to their trippy, psychedelic nature the engineers joke that this might be what a computer brain's daydreams might actually look like.  Google has open-sourced the code, DeepDream, for anyone to create their own images.
Building off DeepDream, a paper \cite{artifyPaper} was published using a ConvNet to factor images into style and content.  This allows the creation of new images that combine the style of one image with the content of another, Fig \ref{fig:Artify}.  
\begin{figure}[h!]
 \begin{center}
 \includegraphics[width=.7\linewidth]{Gart.jpg}
 \caption{Using a NN to cross a photo with a painting style: for example Neil deGrasse Tyson in the style of Kadinsky’s Jane Rouge Bleu. \cite{GArt}}
 \label{fig:Artify}
 \end{center}
\end{figure}
\subsubsection{PlaNet [February 2016]}
Google created a network, PlaNet, that has the "superhuman" ability to determine the location of almost any image.  They trained a deep learning network to work out the location of a photo using only the image's pixels.  Their approach was to divide the world into a grid of 26,000 squares.  The size of those squares varied based on the number of images taken in that location.  Big Cities had more fine-grained grid structures; while oceans and the poles were ignored.  The data set consisted of  126M photos with Exif geolocations mined from all over the web (training: 91M; validation: 34M).  To test the model's localization accuracy, they fed it 2.3M geotagged Flickr photos from across the world, see Fig \ref{fig:PlaNet}.  
\begin{quote}
PlaNet is able to localize 3.6\% of the images at street-level accuracy and 10.1\% at city-level accuracy. 28.4\% of the photos are correctly localized at country level and 48.0\% at continent level. \cite{PlaNet}
\end{quote}
\begin{figure}[h!]
 \begin{center}
 \includegraphics[width=.8\linewidth]{PlaNet.jpg}
 \caption{Google's PlaNet geolocating a picture in one of its 26,000 zones \cite{PlaNet_paper}}
 \label{fig:PlaNet}
 \end{center}
\end{figure}
\subsubsection{AlphaGo [March 2016]}
In March 2016, Google's AlphaGo won 4-1 against the Lee Sedol, the legendary Go champion for that last decade.  
The game of Go is 2,500 year old Chinese game.  Players take turns placing black or white stones on the board, trying to capture the opponent's stones or empty territory.  The game is incredibly complex with more possible positions than there are atoms in the universe.  Go is a googol times more complex than chess. The game is played primarily through intuition and feel.     
In 1997 IBM's DeepBlue beat the world champion chess player using a brute force method, Search Tree, to calculate all possible positions.  This is not possible with Go. Instead, AlphaGo used a combination of Monte Carlo Tree Search with deep neural networks to play out the rest of the game in its imagination.  Not only was the match one sided, it was 10 years before experts predicted a computer would win at Go. \cite{AlphaGo_paper} \cite{AlphaGo} \cite{AlphaGo_blog}   
\subsubsection{SyntaxNet [May 2016]}
Google just released SyntaxNet, an open-source neural network framework that provides a foundation for Natural Language Understanding (NLU).  Not only did they provide all the code (written in TensorFlow) needed to train models on individual data, they included Parsey McParseface, an English parser that has been pre-trained to analyze English text.  
SyntaxNet is built off syntactic parsing. Given a sentence as input, it tags each word as a part-of-speech with its syntactic function.  Then it determines the syntactic relationship between the words, which is related to the underlying meaning of the sentence.  Parsey McParseface can handle complex sentences like Fig \ref{fig:Parse_long}, this allows users to ask questions like \textit{whom did Alice see?}, \textit{when did Alice see Bob?}
\begin{figure}[h!]
 \begin{center}
 \includegraphics[width=\linewidth]{Parse_long.png}
 \caption{Parsey McParseface understanding a sample sentence  \cite{SyntaxNet}}
 \label{fig:Parse_long}
 \end{center}
\end{figure}
Unfortunately, natural language is full of ambiguities.  In a moderate length sentence (20-30 words), there can be tens of thousands of syntactic relationships.  For example the following sentence (Fig \ref{fig:Parse_amb}) can be read two ways.  First, the correct understanding is Alice is driving the car. The second (absurd, but possible) interpretation is where the street is located in the car.  The preposition \textit{in} can modify \textit{drove}  or \textit{street}, causing ambiguity.  From our vast experience, humans do a great job navigating these ambiguous cases.  SyntaxNet uses deep learning.  
\begin{figure}[h!]
 \begin{center}
 \includegraphics[width=\linewidth]{Parse_ambig.png}
 \caption{An example of a prepositional phrase attachment ambiguity. \cite{SyntaxNet}}
 \label{fig:Parse_amb}
 \end{center}
\end{figure}
Parsey McParseface understands sentences from news articles with an accuracy of 94\%.  On sentences scraped from the internet, Parsey understands 90\%.
The goal of Natural Language Understanding is to make our interactions with computers more natural.  Instead of memorized phrases, soon the user will be able to just talk with a computer.  \cite{SyntaxNet} \cite{ParserPaper} 
\subsection{Facebook [2014]} 
Facebook AI Research (FAIR) is a research team at Facebook advancing the field of machine learning.  One of their big projects was called DeepFace.  DeepFace performs facial verification (it recognizes that two images show the same face), not facial recognition (putting a name to a face). 
To derive a facial representation, they created a nine-layer deep neural network.  This network involved more than 120 million parameters using several locally connected layers but without the weight sharing as in standard convolutional layers.  
Using a dataset of 4 million photos of 4,000 individuals, DeepFace achieved a 97.25\% accuracy at predicting if two images showed the same person.  This is remarkable result as it is almost at human-level performance (97.53\%). \cite{FB} \cite{FB_article} 
\subsection{FaceYou [October, 2015]}
FaceYou is an entertainment app that allows users to merge their own face with another face.  It is able to capture  facial expressions and speech in real-time.
Baidu researchers developed FaceYou as a demonstration to show the sophistication of deep learning on a smartphone.  Traditionally, this level of real-time face tracking was only possible on large systems used in film and animation studios.\cite{FaceYou}  Another application of this technology could be fashion.  Instead of merging two faces, a shirt or dress could be projected onto a potential buyer. 
\begin{figure}[h!]
 \begin{center}
 \includegraphics[width=.5\linewidth]{NolanSkel.JPG}
 \caption{The author with a skeleton merged onto his face}
 \label{fig:NolanSkel}
 \end{center}
\end{figure}
\section{Future of Deep Learning}
By now it is clear that deep learning has incredible potential.  However, what makes DL special from other machine learning techniques?  Why is it causing renowned scientists to make bold claims?
\begin{quote}
“Deep Learning is an algorithm which has no theoretical limitations of what it can learn; the more data you give and the more computational time you provide, the better it is” \\
\hfill - Geoff Hinton\cite{geoff} 
\end{quote}
Andrew Ng, Cofounder of Google Brain and Chief Scientist at Baidu Research, describes it with the following graph, Fig \ref{fig:WhyDL}.  While all other machine learning algorithms get better with more data, at some point their performance plateaus.  In deep learning, the exponential growth of both data and computation power allow models to evolve in structure and complexity, thus increasing their performance. 
\begin{figure}[h!]
 \begin{center}
 \includegraphics[width=.8\linewidth]{WhyDL.png}
 \caption{Andrew Ng's slide about the immense potential of DL over traditional machine learning algorithms \cite{AndrewNg}}
 \label{fig:WhyDL}
 \end{center}
\end{figure}
\section{Summary}
This paper described the history of deep learning and how it is just a rebranding of artificial neural networks.  However, the reason for DL's recent explosion in popularity is due to the exponential growth of both computing power and data.  
Next, we walked through the foundation of what makes a deep network: \textit{data}, \textit{structure}, \textit{loss}, and an \textit{optimizer}.  We built a multilayer perceptron by adding nodes, layers, and non-linearities to a simple logistic classifier.  By using domain knowledge (the input was an image), we showed how convolutional neural networks form a hierarchy of abstraction that grow in complexity.   
We talked about how tools like TensorFlow, transfer learning, and GPUs can greatly increase productivity when training DIY models.  We examined the state-of-art problems big companies are solving using deep learning. 
Finally, we explained why experts are claiming the limitless potential of Deep Learning.  
\ifCLASSOPTIONcaptionsoff
  \newpage
\fi
\bibliography{ref.bib}
\bibliographystyle{IEEEtran}
\newpage
\onecolumn
\appendices
\section{MLP Code for Learning MNIST}
\begin{lstlisting}[language=Python]
#------MNIST - MLP
#MNIST: grayscale hand-written digits 28x28 pixels. 
# there are 10 classes in the dataset corresponding to the digits 0-9. 
import matplotlib.pyplot as plt
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adam, RMSprop
from keras.utils import np_utils
#load the data and split it into train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print X_train.shape
#show a random example
plt.imshow(X_train[np.random.randint(len(X_train))], cmap='Greys')
## flatten the 28x28 images to a 784 dimensional vector. 
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# encode our labels as one-hot vectors.
y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)
#function to plot the Training Loss
def plot_loss(hist):
    loss = hist.history['loss']
    val_loss= hist.history['val_loss']
    #plt.plot(range(len(loss)), loss)
    plt.plot(range(len(loss)), loss, 'b', val_loss, 'r')
    plt.legend(['loss','val_loss'])
MLP
## - - -Construst the MLP structure  
model = Sequential()
model.add(Dense(512, input_dim=784)) #784 = 28x28 input; and 512 nodes
model.add(Activation('relu')) #define the activation func as 'RELU'
model.add(Dropout(.5)) #add dropout with a probability of 50%
model.add(Dense(10)) #add a fully connected layer with 10 outputs (0-9 digits)
model.add(Activation('softmax'))
model.summary()		#print out structure
#define LOSS and OPTIMIZER
model.compile(loss='categorical_crossentropy',	
               optimizer='rmsprop',
               metrics=['accuracy'])
history = model.fit(X_train, y_train, nb_epoch=10,
                      batch_size=128, verbose=1,
                      validation_split=0.1)
plot_loss(history)
# Final test evaluation
score = model.evaluate(X_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
\end{lstlisting}
\newpage
\section{CovNet Python Code for Learning MNIST}
\label{App:Conv}
\begin{lstlisting}
# 
#  Classifying MNIST with CNNs
# 
import matplotlib.pyplot as plt
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.optimizers import SGD, RMSprop
from keras.utils import np_utils
# FORMAT INPUT DATA
# Keep the data in its original shape. 
#note: when we reshape the data below, we add a dimension of 1.
#      this is the number of **channels** in the image, 
#      which is just 1 because these are grayscale images. 
#      If they were color, this would be 3 for RGB. 
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 1, 28, 28)
X_test = X_test.reshape(X_test.shape[0], 1, 28, 28)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print X_train.shape
y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)
#function to plot the Training Loss
def plot_loss(hist):
    loss = hist.history['loss']
    val_loss= hist.history['val_loss']
    #plt.plot(range(len(loss)), loss)
    plt.plot(range(len(loss)), loss, 'b', val_loss, 'r')
    plt.legend(['loss','val_loss'])
from keras.layers.core import Dense, Dropout, Activation
from keras.layers import Convolution2D, MaxPooling2D, AveragePooling2D,Flatten
numb_labels = 10
#
# ### design structure of CNN###
model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode='same', input_shape=(1, 28, 28), subsample = (1,1), activation = 'relu'))
#2x2 pooling cuts in image in half
model.add(MaxPooling2D(pool_size=(2, 2), strides=None, border_mode='same'))
model.add(Convolution2D(32, 3, 3, border_mode='same'))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(.4))
model.add(Dense(numb_labels))
model.add(Activation('softmax'))
model.summary()
#define LOSS and OPTIMIZER
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
history = model.fit(X_train, y_train, nb_epoch=5,
                      batch_size=128, verbose=1,
                      validation_split=0.1)
plot_loss(history)
score = model.evaluate(X_test, y_test, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# ## Saving a trained model
with open('mnist_cnn.json', 'w') as f:
    f.write(model.to_json())
model.save_weights('mnist_cnn_weights.h5')
\end{lstlisting}
\end{document}