Recently several open-source tools for exploring the possibilities unlocked by deep learning have been released, and at the front of the pack is scikit-learn.
Scikit gives you a bunch of powerful tools that you can easily use to do things like make predictions about new, unlabeled data based upon previously observed, labeled data.
You can, for example, take a bunch of "labeled data", such as email subject lines labeled "spam" or "not spam", split it into two piles, one for training and one for testing.
Next you "train a model" with the first pile, and test the accuracy of its predictions by feeding it the data in the second pile.
Then, see if the labels it predicts match the actual labels on the data. If accuracy is poor, what can be done to improve it?
The keys to success appear to be:
We released a playground project for anyone wanting to begin experimenting with scikit-learn. The project uses scikit to train and test a simple model with just a few lines of code. Next, we build the Nearest Neighbor classifier.
That demo loads scikit's built in Iris dataset, which consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length.
The goal is to train a model to be able to predict the type of iris given the petal and sepal length.
You could use the project to quickly get insights about your own labeled data.