F# digits recognisor

How can a computer recognise handwritten digits done by many people?

Get a picture of each digit written, turn the image into a long list of numbers representation colours and find a way to compare these with ones it already knows the answer to.

That’s what the digits recogniser I discuss below does.

This is the first moderately complex F# code I’ve given time to and is about 6 hours of proper coding time (conveniently ignoring twitter and google searching time!). However, I’ve highlighted each of the steps and hopefully made the code reasonably accessible to follow.

Any comments and suggestions, please find me on twitter or github. The code can be found here- LiveScript.fsx is the code file in question.

A simple project that springboards directly from a digits recogniser coding dojo found via the F# community – thank you to both!.

The code dojo format offers a scripted step by step approach to learning f# whilst attempting a moderately complex task. As my aim with F% involves machine learning and AI, this is an ideal first step.

The digits recogniser works on vectorised images of handwritten digits – the training set consists of 10,000 samples, the validation set, 50,000. Vectorised – imagine of pulling the thread from a square coloured piece of cloth until you have just a thread and no cloth: you get bits of colour on a long thread. When we vectorise an image, our thread is simply a line of numbers – an array with each number between 0 and 255 (representing 256 colours).

It is a simple example of supervised learning, As we have known data to work with. What will happen is each example in the training set is mathematically compared to an example in the validation set to find the nearest match. The calculation is a simple matrix calculation that finds the Euclidean distance between one vector and another good explanation here!. For our purposes it’s a calculation that simply tries to match on of the vectors in the training set with an unknown sample that we present to the program; in other words, one thread of numbers that looks most like another!

The steps are: for each point in vectors A and B take the difference then square it, then sum them together and take the square root, the result is the Euclidean distance.

However, this is not the only possible calculation we could apply to two vectors and we may need to try others.

The organisation of my code reflects this need by allowing for an arbitrary list of functions to be applied to a given set of data, timing and measuring each for accuracy.

The reason is it may be reasonable to implement a basic genetic algorithm, using code quotations, to generate new functions ; the only rule being they have to fit the data type expected. Using such a procedure we could use a second level of machine learning to attempt to evolve a better formula as measured by time and accuracy.

For each of the stages, I’ll provide the code and explanation next:

First, acquire the data, strip the headers with an array slice then using mapi (this allows us to get a tuple of an index label and an array of pixels to work with) pipe the data straight into an array of records.

open System
open System.IO

let trainingFile = @"C:\Dropbox\Dojo-Digits-Recognizer\Dojo\trainingsample.csv"
let validationFile = @"C:\Dropbox\Dojo-Digits-Recognizer\Dojo\validationsample.csv"

type Example = { Label:int; Pixels:int[] }

let getData file  =
     File.ReadAllLines(file).[1 ..]
         |> Array.map(fun x -> x.Split(',') |> Array.map(fun s -> (int) s)) 
         |> Array.mapi(fun i x -> {Label = i; Pixels = x})

let trainingExamples,validationExamples = getData trainingFile, getData validationFile

We then define the basic distance function that will work through the numbers in both vectors:

let distance (points1:int[],points2:int[], calcFn : int->int->float) = 
                let f (n:int) = float n
                Array.map2(calcFn) points1 points2 
                |> Array.sum |> sqrt

Next we have a simple comparison function that will use the distance function to compare a single unknown sample against all the vectors in the training set:

let findMinimumPixels s sample calcFn = 
    |> Array.map(fun x -> (x.Label,  x.Pixels), distance (x.Pixels , s.Pixels, calcFn ))
    |> Array.minBy(fun x -> snd x) |> fun x -> (fst x |> fun a -> snd(a).[0]) , s.Pixels.[0] 

Then we need the basic classifier function that will map all the unknown vectors in a required set using the findMinimumPixels function. It takes an array of integers to use as an index thus allowing us to choose how many or even which samples in the validation set to work with. Here I simply want the first 50 as a quick trial.

let classify (unknown:int[]) calcFn =
    unknown |> Array.map(fun i-> validationExamples.[i] |> fun s -> findMinimumPixels s trainingExamples calcFn) |> Array.map(fun x ->     fst x, snd x)
let dataSet = [|0 .. 50|]

Collate the results, calculate the accuracy:

let results data calcFn = classify data calcFn|> Array.map(fun x -> match x with | n, m when n = m -> 1 | _,_ -> 0) |> Array.sumBy(fun x -> float (x) / float (data.Length) * 100. ) 

Use a simplistic timer to gauge performance – optional:

let timedResult cmd =
let timer = System.Diagnostics.Stopwatch.StartNew()
printfn "%f" timer.Elapsed.TotalMilliseconds 

Define the functions we’d like to trial against – this work is not complete but I’ve setup the code to allow for more functions than Euclidean distance to be explored. The list is a simple example of using functions as data as well. It would be fairly straightforwards to utilise quotations to compile together other functions for this list, provided they matched the data type required.

let functions  = [fun p1 p2 -> (float p1  -  float p2)**2.; //find minimum euclidean distance
              fun p1 p2 -> ((float p1  -  float p2)/2.)**2.; //variation of the first one, no difference in accuracy.
              fun p1 p2 -> sqrt(float p1 * float p2)]    //A nonsense one added to test the notion of a list of functions

Finally, for each function defined try the algorithm plugging each function in and return the accuracy/timing:

for i in functions do
  timedResult (printfn "%A  accuracy achieved" (results dataSet i)) |> ignore

The code so far produces 94.06% accuracy on 100 samples, however at time of writing I confess to an index error with 500 samples. There are several issues to address to complete this code: index error @ > 500 samples, speed and the fact that sqrt is fixed rather than a parameter.

Future versions could include exploring parallel execution – trying out the functions listed in their own separate tasks for example, in addition, the function list could be populated using a genetic algorithm that tries to find a better function that fits int->int->float. The accuracy and timing could be used in combination as a form of cost function – where we hunt for the best tradeoff.

Notes on F#: I like this language, it connects well to other languages with type providers, doesn’t require excessive formality and if you can make the compiler happy then code generally just runs.Problems tend to come from thinking more than forgetting to check for null and so on. Though the trade off is making the compiler happy can be a challenge due to it’s strong type inference.

There are a lot more people and sites to mention, so I’ll limit this to a few that will lead easily to the rest: