Frequently Asked Questions
Discover our frequently asked questions about Brainome Table Compiler or contact us if you have more questions.
Brainome is the first data compiler that automatically solves supervised machine learning problems with repeatable and reproducible results and creates standalone python predictors.
It is not. Brainome is a licensed software by Brainome Inc. However Brainome comes with a generous demo license for personal use and evaluation that allows individuals to try it out on data sets that have up to 100 features and 20,000 rows of data. For problems of larger dimensionality and commercial use, please contact our sales team at firstname.lastname@example.org.
Brainome Table Compiler
Brainome requires approximately the following:
- Memory size : about 3 times your data file size
- Disk space: about 4 times your data file size.
- Processor: most of the computation is done on CPU cores. Multi-cores will speed up random forest modeling. A GPU with speed up neural network training (GPU docker images only).
Missing targets are handled differently than missing feature cells.
- Missing values in the target column of a row of training data will generate an error.
- Empty cells within a row of training data are handled as if they were empty strings. A unique numerical identifier is generated for each empty cell. This allows Brainome to learn relationships between missing values and the target column.
Brainome supports Neural Networks (NN), Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) Classification Models and Linear Regressions Models.
Starting with version 1.9, Brainome supports Linear Regression.
In some cases, a regression target can be quantized into a series of consecutive classes and can thus be trained using Brainome after mapping.
If you’re starting from scratch and don’t have any data to measure yet, a good rule of thumb is to gather a minimum of 100 data points for each class. For some problems, it is possible to train with fewer samples than that if the data is highly generalizable. For others, it is possible that more data might be required. The best thing to do is to run the learnability measurement process on a regular basis to be sure you’re investing in collecting the right amount of training data for your task.
In that repository, the file testfiles.csv contains one column ($ID) with IDs and another column with filenames ending in CSV ($Fname). The URLs encode as: https://www.openml.org/d/$ID/$Fname . The script validate.sh composes the URL and does a wget to copy the file locally.
Without doing the up-front work of measuring and “right-sizing” the machine learning model you’re building before you train on your data, you have no way of knowing whether the predictor you build will actually do what you want it to do.
There are lots of things that can go wrong if you only look at the performance accuracy of your machine learning model, including:
- You might accidentally build a model that makes predictions that are too closely fixed to what was in the training data, and which won’t generalize well when new data is processed. (This is called “overfitting”.)
- You have no way of knowing whether you built your model using the right amount of training data – did you gather enough data? Or, perhaps, did you use way more data than you really needed?
If you measure first, you can quickly find out if there are any pre-processing steps needed to prepare the data. You can also detect bias in the data, determine what the right sort of machine learning model would work best with your data, and learn about how resilient your eventual model will be to changes in input data and operating conditions.
We have tested our measurements on a wide variety of data sets, including a 176-task subset from OpenML which includes binary and multiclass classification problems from a multitude of sources, including bio/medical, finance, speech, vision, and natural language data. Our measurement-based process is able to automate the creation of predictors on these datasets and completely eliminate the need for hyper-parameter tuning. Our predictors are usually 2 orders of magnitude smaller than the state of the art and the reduction in training time is usually at least on order of magnitude. In 70% of the cases, our measurement-based approach beats the state-of-the-art accuracy reported on OpenML.
Our results are reproducible at: http://github.com/brainome/OpenML
When you’re building a machine learning model for deployment in the real world, you want to be sure that the model is the right one for use on all the data that may show up to be processed. To be successful, you must have the ability to trade off generalization and accuracy to get the most successful performance of your system. In most cases, it’s much better to give away a few small points of accuracy to gain the stability and long-term usefulness that a more general model will provide.
As part of its up-front measurement process, Brainome can rapidly analyze which of your data attributes (columns) contain useful information for the predictions you’re trying to make. Knowing what subset of your data is actually needed to create general and accurate models is very powerful, both because you can greatly reduce your training data creation costs and because you can build much faster and smaller models.
Without doing the up-front work of measuring and “right-sizing” the machine learning model you’re building before you do training on your data, you have no way of knowing whether you’re spending the right amount of compute and storage to achieve your goals.
There are multiple things that often go wrong if you don’t “measure before you cut”:
- You build a model that is much larger than it needs to be.
- Your training time is much longer than it needs to be.
- Your run times are much longer than they need to be.
- Your model is less general than it could be.
- Your model relies on more attributes (columns) than it should.
If you don’t measure the learnability of your training data before you build your model, you have no way of knowing whether you can really build a quality model from what you have. Is the data essentially random? At the other extreme, is it demonstrably reliable? Or, as is often the case, is it somewhere in between? We use an iterative measurement approach over what we call a “capacity progression” to give a quantifiable answer to this question for your training data so that you can make conscious choices about investing in the gathering and labeling of the right amount of training data.
Noise resilience is just another way of talking about generalization. Generalization is measured in bits/bit – the higher the generalization, the more a machine learning model can predict using the same sized model. Noise resilience is measured in decibel (dB) – the more general the model, the greater the noise resilience.