## Wednesday, 30 March 2016

### Unsupervised Learning: Self Organizing Maps

Most machine learning techniques like logistic regression, linear regression, and neural networks do something that is called supervised learning. This means that we need to supply examples of "correct" data to the machine learning algorithm for it to learn the patterns in the data. The goal of unsupervised learning is to automatically find clusters of similar data in a huge unlabeled set of data. Humans can do this easily with some types of data. If you saw a 2D scatter plot of points you'd easily be able to identify clusters in the data by visual examination. I've always considered unsupervised learning cooler than supervised mostly because it feels like that's what a general AI should be able to do.

Self organizing maps (SOMs) are one technique used to find these clusters in the data. A self organizing map looks a little similar to a neural network at first. But the kind of operation that it does is very different. An SOM has a lattice of nodes. This lattice is usually one dimensional(arranged in a linear array) or 2 dimensional (arranged in a matrix). As the self organizing map is trained, the lattice of will be partitioned into separate classes. Since one of the main uses of self organizing maps is visualizing higher dimensional data, the lattice is is rarely more than two dimensions wide. The clusters in the higher dimensional data can be visualized on the 2D lattice.  Associated with each of the nodes in the lattice is a weight vector that has the same dimension as the input data vectors. To train the self organizing map we take each element in the data set and find the node with the weight that's closest to the element (Using the euclidean distance as the measure of closeness. Other measures can be used too.). Then we define a neighbourhood of nodes that are 'close' to this best matching unit and we adjust the weights of this entire neighbourhood of weights to be a little closer to the value of the element from the dataset. How much the weights are adjusted depends on the neighbour falloff function and on the learning rate. This can be expressed in one simple equation.
$$W_{uv}(t+1) = W_{uv}(t) + \Omega (u,v,t) \alpha (t)[x_n - W_{uv}(t)]$$
In each iteration, the learning rate and the size of the neighbourhood function is reduced by a small amount. The neighbourhood size is initially set to a large value. This allows the algorithm to initially affect the weights on a global scale and then over time adjust smaller and smaller regions of the map to train it.

In my implementation I used a Gaussian neighbourhood function whose variance decayed exponentially. The learning rate started off at 0.1 and decayed exponentially from there.

To check my implementation, I tested something that is a common use for SOMs. Clustering colors in an RGB color space. Colors are easy to visualize and I thought it would make a good demo. I first generated a random data set of bluish, reddish, greenish and dark bluish colors.

 A random sampling of 4 different colors for the dataset.
I initialized the weights randomly and started the training. For this particular experiment I used a 2D 50x50 lattice of nodes. Since each node has a weight that is a 3 dimensional vector that represents a color, we can visualize the SOM node lattice as an RGB image. Stacking together the weight visualization after each epoch (One pass through the training set) of training gave me this really beautiful looking animation! :D
 Animation that shows the self organizing map after each epoch of training.
You can clearly see the different clusters of color slowly resolving as the training happens. The initial changes are more global due to the large neighbourhood. But as the training progresses the adjustments to the weights become more and more local. There is also a very nice separation of the different colors that were present in the first image. You can see that the brighter colors seem to be congregating towards the top of the image and the darker colors towards the bottom. The darkest blue (maybe even a little black) ended up in the bottom left corner surrounded by the darker hues of red, green and blue.
 The final map after 30 epochs of training.

## Tuesday, 22 March 2016

### Machine Learning Part 2: Implementing Multi Class Logistic Regression

In my last post on machine learning I talked about how to implement simple linear regression in Python. This time I am going to implement logistic regression. This technique is used to classify input data into one of several classes.

First let's take a look at two class regression.

Let's have a set of input vectors $\{x_1, x_2, ... , x_N\}$ and a set of targets $\{t_1, t_2, t_3, ..., t_N\}$ as our training data. A logistic regression classifier defines a model $y(x,w) = h(w^Tx)$. The function $h(a)$ is a logistic sigmoid function or a "squashing" function that takes the output of the linear function $w^Tx$ and sqeezes it into a range of values from 0 to 1. Using the logistic sigmoid function allows us to interpret the output of the classifier as the probability that the input belongs to class 1 under certain assumptions.

So just like in linear regression we need to define cost function in terms of the parameters so that we have a measure of the performance of the model and a way to train the model. In the case of logistic regression there is a cost function - that is very effective and commonly used - called the cross entropy cost function. This cost function is again selected based on a probabilistic interpretation of the classifier.

$$J = -\sum_{n=1}^{N}t_n ln(y(x_n) + (1-t_n)ln(1-y(x_n))$$

The gradient of this cost function is surprisingly the same as the gradient of the square error cost function.

$$\frac{\partial J}{\partial W} = \sum_{n=1}^{N}(y(x_n) - t_n)x_n$$

The weights of the model can be adjusted using the gradient descent algorithm just like in linear regression. The classifier will generate a straight that will separate the two classes.

Multiclass logistic regression can be done using many logistic regression units if the goal is to perform multiple binary classifications. However, if you have a vector that needs to be assigned to one of K classes, the procedure is slightly different. Multiclass regression is performed using the model $y(x) = softmax(W^Tx + \vec{b})$ where x is the input vector, W is a matrix with K columns where each column is a parameter vector, $\vec{b}$ is a vector of biases and the softmax function is defined as follows.

$$softmax(a_j) = \frac{e^{a_j}}{\sum_{k=1}^{K}e^{a_k}}$$

Each of the targets to train the classifier will be an array of size K. If an input belongs to class k, the kth position of the array will be a one. The rest of the elements in the array will be zero. This is called 'one-hot encoding'. It is easy to see from the definition of the softmax function that the output of the classifier will be an array of values  that are between zero and one and that the sum of all the elements in the output array will be 1. This means that we can interpret the output array as a probability distribution that tells you the probabilities that the input vector belongs to each of the K classes.

Like in two class classification, the input data (with N vectors and dimension M) set can be arranged in a NxM array $X$ where each row is an input vector. The corresponding targets can be arranged in a matrix T that has N rows and K columns. If the input vector $x_n$ belongs to class K the element $T_{nk}$ will be one. The rest of the elements of the matrix will be zero.

The cost function is the same cross entropy function extended to work with the matrix T.
$$J = -\sum_{n=1}^{N}\sum_{k=1}^{K}T_{nk}ln(y_k(x_n))$$
However, since we are using a softmax function instead of the logistic sigmoid function, the gradient of the cost function will be different. The derivation to obtain expression for the gradient gets very hairy but the expression for the gradient is surprisingly simple.

$$\frac{\partial J}{\partial W_{uv}} = \sum_{n=1}^{N}X^T_{vn}(t_{nu} - y_{nu})$$

Once we have this gradient, gradient descent can be performed on the parameters in the same way. I gathered all the vectorized implementations of the equations I've mentioned into a classifier class in Python for ease of usage.

To test the classifier I created a scatter of normally distributed points centred around different means on a 2D plane and applied the classifier. With a little bit of help from stackoverflow I was able to figure out how to color the regions of the plot according to the trained classifier.

 This is a scatter plot of the unclassified data. The points were generated using multivariate normal distributions using different means.

 This is a plot of the error or the cost over time as the gradient descent algorithm runs.

 A plot of the decision regions learned by the classifier. If a region is shaded a particular color, that means that the classifier identifies all the points in that region as belonging to the same class.

## Saturday, 27 February 2016

### My First Build of BB-8

Being a robotics person, the part of Star Wars: The Force Awakens that gave me the most delight was BB-8! The cute little droid who rolls along on a sphere. The moment I saw it on screen I knew that I had to build it! This blogpost documents my entire build process.

I decided that the first thing I should do is come up with a good control mechanism. The first thing I thought of was a pendulum hanging from a diametric bar inside the sphere. The pendulum would be free hanging and an actuator would apply torque on it and make the sphere roll. This seemed a simple enough system to work with. So I started the process of simulating the robot and coming up with a good control system.

### The Mathy Stuff

The easiest way to model a sort of complex system like this is to use Lagrangian mechanics. Write down the difference between the kinetic and potential energies of the system, apply the Euler-Lagrange equations to get the equations of motion and plug the equations into a differential equation with initial conditions. This general procedure can be used to simulate a wide range of mechanical systems. It's an essential part of any roboticist's toolkit.

In the case of the BB-8 sphere pendulum system the equations of motion come out as:

$\frac{5}{3} m_b \ddot{x}(t)+m_p \left(l \ddot{\theta}(t) \cos (\theta (t))-l \dot{\theta}(t)^2 \sin (\theta (t))+\ddot{x}(t)\right)=\frac{u(t)}{r}$
$l m_p \left(g \sin (\theta (t))+l \ddot{\theta}(t)+\cos (\theta (t)) \ddot{x}(t)\right)=-u(t)$

Here $m_b$ is the mass of the ball, $m_p$ is the mass of the pendulum, $l$ the length of the pendulum rod and $r$ the radius of the sphere. The state of the robot is represented by the state vector $\left[ \begin{array}{c} x \\ \theta \\ \dot{x} \\ \dot{\theta} \end{array} \right]$ where $x$ is the distance the sphere has rolled and $\theta$ is the angle the pendulum makes with the vertical.

When the motor applies a torque on the pendulum, it also applies an equal and opposite torque on the shaft on the spherical body. This is the control torque we're going to use to make BB-8 roll back and forth. For the mass of the pendulum I am using another motor with a weighted disk. This motor can react against the weighted disk to make the robot turn left and right. These two mechanisms are enough to make BB-8 go anywhere! Since this robot is an open chain manipulator, the equations of motion can be written in the manipulator equation form $H(q)\ddot{q} + C(q,\dot{q})\dot{q} + G(q) = Bu$.

$\left[\begin{array}{cc}\frac{5}{3}m_b + m_p & m_pl\cos\theta \\ m_pl\cos\theta & m_pl^2\end{array}\right] \left[\begin{array}{c}\ddot{x} \\ \ddot{\theta}\end{array}\right] + \left[\begin{array}{cc}0 & -m_pl\dot{\theta}\sin\theta \\ 0 & 0 \end{array}\right] \left[\begin{array}{c}\dot{x} \\ \dot{\theta}\end{array}\right] + \left[\begin{array}{c}0 \\ m_pgl\sin\theta\end{array}\right] = \left[\begin{array}{c} \frac{1}{r} \\ -1 \end{array}\right] u$

This is what the system looks like without any motors.

We can easily write the manipulator equation to the standard form ( $\dot{x} = f(x,u)$ ), linearize it around any equilibrium point to get a linear state space model and implement a linear controller. I'm so glad that I had Mathematica to help me with the modeling. It made things very simple and I could get a simple LQR for controlling the sphere working without too much hassle. Here's what the motion of the sphere looks like with the controller.

### The Body

Now it's time to build! First I was stuck on what to make the sphere out of. I didn't want the robot to be too small and at first, I couldn't think of any place to find a sphere. So i started designing a spherical shell in Blender. I had to do it in parts because of size constraints on the 3D printer. This was very time consuming. And then it struck me! I could use on of these! :D

I bought a 20 cm diameter globe off amazon. The sphere is made out of ABS plastic which is really nice. I used a hacksaw to cut the sphere in half. The next challenge is to mount the motors. I tried reusing empty potato chip cans but most of them didn't have the dimensions that I wanted. So in the end I decided to 3D print a motor housing and some parts to couple the sphere to the motors. I used Blender (Open Source! Yay! :D) to design the parts.

 Initial Render of what the internals were going to look like.
However, it turned out that 3D printing parts is a bit time consuming. Also, the motor couplings were difficult to print without using a lot of support material. After a few iterations of redesigning and checking, the final motor coupling was printed. I made a decent looking motor housing out of some wood and metal L-shaped motor mounts.

 The motor assembly

### The Electronics

An Arduino Nano with an Atmega328 is the brain of the robot. A GY521 accelerometer/gyro board helps me measure the angle of the motor assembly with respect to the ground and a rotary encoder coupled to the dead axle helps me measure the speed and position of the ball. A HC05 Bluetooth module communicates with an Android app that I made using MIT App Inventor so that I can control the robot.

Putting it all together lead to this huge mess of wires.

After multiple attempts at rewiring, I got to this slightly less messy (but still quite messy) state.

 The battery is also included in this one.
Despite the messiness, it actually works! :D I made it roll around for a bit.

Since this was my first build, I decided that it should more of a proof of concept than a robust build. So I built it very fast. Now  that I know that it works, I can spend some time designing a nice PCB to make the wiring neater and more compact. I also need to fine tune the control loop so that the whole thing moves smoothly. No point in tuning it on this since I'll be rebuilding a part of it anyway.

Watch out for the next post about a more robust build!

## Friday, 20 November 2015

### Prime Spirals in Python

I watched this really fascinating video on prime spirals recently and decided to see if I could write my own code to generate them. It's not terribly efficient but it works! Code is on github! :D

### Machine Learning: Implementing Linear Regression and Gradient Descent in Python

It's been a while since I went through Andrew Ng's Machine Learning course on Coursera. The course is very good at introducing the basic concepts but it lacks the depth of knowledge that I need if I really want to understand how to implement and use some of the algorithms on my own. So I decided to work through a more difficult textbook on machine learning. One of the first things I decided to do was implement the machine learning algorithms from scratch using my language of choice: Python.

The most simple idea out there that can be called machine 'learning' is linear regression. I first learned about linear regression in high school math class. So when I saw that linear regression was the first topic in the machine learning course, I was a bit skeptical. I felt that linear regression was too simple to be called machine learning. But it turns out that linear regression is quite a powerful tool to make predictions from data. The term 'linear' in linear regression is a bit misleading because it makes you think about straight line graphs and gives you the impression that trying to fit straight lines to data can't really be that useful. The trick to levelling up the power of linear regression lies in how you choose your variables (or features)! So you can have features that are crazy functions of your initial variables.

To understand this think about how we can use a straight line to fit a set of data points that you think might be exponential in nature. Say yoususpect that output $y$ is related to the variable $x$ by $y = ke^{cx}$. You can rearrange this equation to $log(y) = cx + k_1$ by taking the natural logarithm of both sides of the equation. Now we can plot $log(y)$ vs $x$ on a graph and try to fit a straight line to it. Here the parameter that we 'learn' is $c$ and $k_1$. The variable $x$ is called a feature.

In a similar way you can add features that are functions of the existing features if you think that the relationship is not linear. In a sense we are choosing a 'basis' of functions which we can mix together in different amounts to (hopefully) accurately model the output $y$. The way a Fourier Series builds up a periodic sequence using sine waves of different frequencies is exactly the same as how linear regression works with non linear basis functions.

So I decided to make the system 'learn' the square wave in my first implementation of linear regression. First some math:

The output prediction $\hat{y}$ can is represented by the linear model below. Here 'n' is the number of feature vectors.

$$\hat{y} = \sum_{i=1}^{n} \theta_i x_i$$

To use gradient descent to train this model we need two things: a cost function and a gradient function.

The cost function is a function that takes the values of theta, and the features and the data samples and calculates a cost based on the error between the prediction that the model makes and the actual value. The most commonly used cost function is the mean square error cost function.

In this equation, m is the total number of training samples that we have and $x^{(i)}$ represents a feature vector that is ith training example. The cost function sums over all the training examples available to give one real number that represents the cost of the current values of the parameters.

$$J = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2 \\ \frac{\partial J}{\partial \theta_j} = \frac{2}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j$$

The gradient of a function is a vector that points in the direction in which the function changes the most. So if I want to find the place where the cost is minimum all I have to do is to start out at a point, find the gradient at that point and go in the opposite direction. I have to keep doing this until the gradient is zero (or very close to it). Think of it like a ball rolling down slopes until it comes to rest at a place where there is no slope. So in Python I have to make a loop and each time the loop is run, I update the parameters theta using the rule. Here alpha is the learning parameter.
$$\theta_j := \theta_j - \alpha \frac{2}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j$$

So I got a large set of points that were in the shape of a square wave and I used sine wave basis functions as features. After running gradient descent for a thousand iterations I got coeffecients that were really close to the actual Fourier Series expansion of a square wave!
 The learned function superimposed over the original square wave

 How the error decreases over time.

You can find my implementation on my github page.

## Sunday, 15 November 2015

### The One Thing that my College Gets Wrong About Exams

I'll be honest. Despite it's shortcomings, many of the courses that are offered actually do contain material that I find interesting. In fact, when you're learning something at the level of a college course, there are few subjects that aren't interesting. It's like Richard Feynman said; everything is interesting if you go deep enough into it. My biggest complaint about the system isn't the lack of depth in the courses or for that matter the professor's depth of knowledge (although I have had a fair share of instructors who had a less than optimal grasp of the course material themselves). My biggest complaint is the focus of the examinations that they conduct that are supposed to evaluate how well the students know the material. The exams  - in addition to having a low correlation to the students' grasp of the material - also actively discourage them from gaining a true understanding of the course material.

Typically, the exams I've written (particularly exams of courses that are highly mathematical) test only the ability of the students to memorize two things: derivations and formulae. I'm not saying that people should be exempted from memorizing formulae. There are plenty of formulae that are so important that they must be memorized. The problem is when you have to memorize formulae that look like this:

$L_F = -10 log(\cos{\theta} \{\frac{1}{2} - \frac{1}{\pi} p(1-p^2)^{\frac{1}{2}} - \frac{1}{\pi}arcsin(p) - q[\frac{1}{\pi}y(1-y^2)^{\frac{1}{2}} + \frac{1}{\pi}arcsin(y) + \frac{1}{2}]\})$

$p = \frac{\cos{(\theta_c)}(1-\cos{(\theta)})}{\sin{(\theta_c)}\sin{(\theta)}}$

$q = \frac{\cos^3(\theta_c)}{(\cos^2(\theta_c) - \sin^2(\theta_c))^\frac{1}{2}}$

$y = \frac{\cos^2(\theta_c)(1 - \cos \theta) - \sin^2(\theta)}{\sin{(\theta_c)} \cos{(\theta_c)} \sin{(\theta)}}$

For anyone interested it's a formula that tells you the coupling loss between two fiber optic cables when they're misaligned by an angle theta. The book doesn't even attempt to derive this formula because the authors know that the derivation on its own doesn't really add to a person's understanding of fiber optics. It may be a nice exercise that tests the student's algebraic manipulation but that's pretty much it. In the real world, if there is a situation in which this formula needs to be used, it's pretty much guaranteed that the engineer will just look it up. It's completely pointless to memorize it.

Why is this focus bad? Firstly, it does not actually test understanding and leads to students not getting the score that they deserve. There have been countless instances where I knew  how to solve a problem but didn't get any credit because I had to remember some stupidly long equation. Secondly, it forces examiners to limit their questions to really simple ones because they know that remembering the formula or it's contrived derivation is half the question. So the number of questions that test my understanding of the material is next to zero. Thirdly, it actively discourages students from trying to understand the material. Once people figure out that they can get scores by just memorization, they will start focusing on memorization. Sometimes, there is so much to memorize that there isn't enough time to try and understand even if I wanted to.

If I were the person who was in charge of the setting the examinations, I would do the sensible thing and give every student a booklet full of the formulae that need to be used for the questions. This allows the questions to be far more interesting and test how the student is able to use the information at his disposal to solve actual problems that practicing engineers face. The brain is a processor with a limited amount of cache! It's not a hard disk! Introducing this simple change of focus will go a long way towards courses that are actually useful. It will also give the professors the freedom to ask tougher questions!

Now I know that this approach actually works! Before college I did my Cambridge IGCSEs and A-Levels. Both these exams were very focused on understanding rather than recall. Exams either had all the required formulae included as part of the questions or (as was the case for Further Mathematics at A-Levels) a large booklet of formulae. This didn't affect the difficulty of the exam at all. It's no use knowing formulae if you don't know how to use it. This might sound a bit strange but I actually enjoyed writing my IGCSEs and my A-Levels. The exams were designed well enough that sometimes I came out of the exam hall with a new understanding of what I had learned. I still feel that I understood and learned more in class during my high school than I did in most of my college courses. Any understanding that I have about my course material is a result of my independent reading and efforts rather than the course instruction.

Now I'm not the first person to talk about this. People have been complaining about the education system for a very very long time. And many have even tried to change it by approaching the administration. Invariably though, I've seen that - at least in my college - the people in charge are very reluctant to bring about any of these changes. The biggest roadblock is probably the fact that a lot the of the professors in my college disagree with what I have said above. They believe that memorization is very important perhaps because that is how they were taught. I've seen that things are changing, but not nearly fast enough. I don't really think that the changes that I've suggested will be implemented any time soon. So why am I writing all this down? I really just really needed to vent. XP

## Thursday, 15 October 2015

### Simulating a Double Pendulum in Mathematica

I've been playing around with Mathematica's Non-Linear Control Systems toolbox over the past few days and it's been brilliant! One of the first systems that I tried to simulate is the double pendulum since it's such a commonly used sample problem in non-linear controls.

The first step is to write down the total kinetic and potential energies of the system and find the Lagrangian. Once you have that, you just need to run the Lagrangian through Mathematica's EulerEquations function and then set up a non-linear state space model using the differential equations.

Once you have the system model it's really simple to get the response and make a simple simulation.

And out pops a wonderful demonstration of chaos! Here are are three simulations with slightly different initial conditions showing diverging trajectories.

 Initial conditions: {160, -160}
 Initial Conditions: {161, -160}

 Initial Conditions: {161,-161}

The next thing I tested was whether I could make a simple controller to stabilize this double pendulum pointing straight up. I went with a standard Linear Quadratic regulator. In this simulation there is only one actuator at the elbow joint. So this is also an example of an underactuated robotic system. Works great! :D

## Sunday, 11 October 2015

### Simulating Mechanical Systems in Mathematica

I've been trying a lot of different software for simulating mechanical systems for my project. By far, the most fun I've had is when I was simulating it on Mathematica. Mathematica is a seriously cool piece of software. I got a simulation of the simple pendulum up and running almost 3 times faster than I did when I was working with Python and Matplotlib. Although, to be fair, the time I did it in Python was the first time I was doing the simulation. So I was learning about how to do the simulation as well as trying to make it. So that might have affected the amount of time I took.