Access all resources and documentation or contact us for additional assistance.
Collect your data
For this tutorial, we assume that your data has already been collected and labeled. For collecting and labeling the data, we recommend other tutorials, such as: https://towardsdatascience.com/how-to-build-a-data-set-for-your-machine-learning-project-5b3b871881ac
When working with Brainome, you’ll always be doing classification. This is, your predictor will always be placing its output into one of two or more categories. It’s up to you to decide what categories you need. For example, in binary classification you are asking a predictor to choose between two possible outcomes (such as Yes/No, Up/Down or Red/Green). Alternatively, you may want to perform multi-way classification (“multi-class”), in which the predictor is picking one of several possible outcomes (such as Small/Medium/Large or Spring/Summer/Autumn/Winter). Brainome can deal with any number of categories as long as there are enough samples per category (we suggest at least 100) in the training data.
Whether you go with binary or multi-class, you’ll need to define your outcome categories clearly and make sure your training data is labeled accordingly. If labeling is hard because the task is difficult (as in the case of emotion classification), consider using more than one annotator to label your training data and explicitly measure agreement between the annotators to get consistency. Be aware that your predictor can never be better than your annotation — i.e., garbage in, garbage out! Another thing to consider when creating your training data set is the balance between the number of training examples available for each output class. Ideally there should be an even balance of available training data across all outputs (targets).
Once your data is labeled, the next step is to tabularize it. For Brainome, a training data set therefore consists of one or more columns of data that capture input variables representing your problem, combined with one column that contains the expected outcome (or “target”). The values of data in the output column have to match the output classes you chose above – and they have to match exactly. The input columns can be anything, as long as all the values in the column are of the same type (Brainome currently supports integers, floating point numbers and text strings). Each row represents a single training data point. Regardless of which real-world situation or simulated model from which you’re drawing your training data, what matters is that the input values in the row correspond reliably to the output value (target) of the row. Choosing what to initially use as the columns in your initial training data set usually comes down to a combination of (1) what you think is likely to create good predictions and (2) what can readily be gathered and used. The beauty of working with Brainome is that once you’ve made an initial choice about what data to use, it’s easy and quick to figure out whether your choice was a good one. The measurements generated by Brainome give you immediate feedback on what, if anything, needs to be done to improve your choices. And because Brainome operates very quickly, executing a few iterations of the tool to improve your training data is straightforward. One important thing to keep in mind: A column that contains a unique value in each row (for example a database key or a timestamp) will never contribute to generalization. It is therefore advised to not include database keys or other unique ID columns in any table that is used for machine learning. If you decide to include one or more columns of this sort in your training data set (perhaps for provenance or quality assurance reasons), you must be sure to tell Brainome to ignore the column, using the -ignorecolumns option (see below).
There are no minimum or maximum requirements for how many rows to include in your training data set. However, if you are doing multi-class prediction, we recommend that you start with at least 100 rows (samples) per output class.
Big picture questions before building a model
Creating a predictor is like buying a car — there are many options even though all of them should ultimately bring you from an origin to a destination. With the data collected, you already have a good idea about the big picture, including questions like:
- What do you want to predict?
- What do you gain from predicting it with a certain accuracy?
- What are the consequences of correct or incorrect predictions?
Another set of high-level questions you may want to ask when building a model is:
- Is it more important that predictions be accurate right now or is it more important that they adapt properly to changing operating conditions? For the former, accuracy is more important; for the latter, generalization is more important.
- Where will my predictor run? If you are planning to run a predictor on a small device, you want a small model that does exactly what you need. You can afford a larger model that is not as finely tuned when you are able to budget cloud computing.
- How resilient does my model have to be? For example, if adversarial examples to your predictor could impose a risk to property or life, you want a small model with high generalization.
It’s also important to think up front about how your classification results will be used in your deployment workflow. This can vary from “just helping the humans out a little” to “fully automated decision-making”. You might, for instance:
- Deploy a predictor that helps a human by calling attention to or prioritizing something in an existing workflow, with the goal of increasing efficiency or effectiveness.
- Create a predictor that fully replaces humans who are already doing a task, with the goal of doing things more reliably or at lower cost.
- Invent a new predictor to do a task that humans were never previously able to do in order to expand production capabilities.
As you progress through the steps laid out below, you’ll be making choices that trade off various aspects of your predictor’s performance. Having thought about how the predictor will actually be used before you have to make these choices will be very helpful.
And having thought about these things in advance will help you a great deal as you progress through the steps we lay out below.