Access all resources and documentation or contact us for additional assistance.
Number of correctly predicted outcomes as a percentage of total number of predictions made. Measured in percent (%).
Example: John predicted 2 out of 10 dice rolls correctly. John’s accuracy is 20%.
Attribute is synonymous with experimental factor, feature, dimension, or simply a column in the data file.
Each column available in the data set is evaluated to determine how important the information it contains is for creating the desired prediction model. Brainome can rank order columns by their importance and can filter out columns that are not important.
Bit (Binary digit)
Unit of measurement for memory capacity and for information content. One bit of information corresponds to the maximum possible reduction of uncertainty achieved when one instance of a binary classification task with balanced classes has been predicted correctly.
Example: A data set contains 100 distinct, uniform random entries for a binary classification task. The classes are balanced. The information gained by the observation of each classification outcome is exactly 1 bit. Equivalently, each data-to-classification mapping requires exactly 1 bit of Memory Equivalent Capacity in a machine learner.
Capacity progression measures the learnability of a dataset, by plotting the number of decisions needed to memorize the function presented by the training data relative to the number of instances presented to the predictor (for an ideal model). In an ideal situation, as the size of the subset of the presented data is increased beyond the size needed to infer a rule, the number of decision points no longer grows. This is called convergence and it indicates that there is enough training data to identify a generalizable function. If, instead, the capacity progression increases linearly as the presented data is increased, then all the predictor can do is memorize (see overfitting). Anywhere in between convergence and linear increase indicates parts of the data can be generalized and other parts need to be memorized.
Defined as: List of number of decision points with increased subset sizes from left to right.
Example: Our data consists of 100 fair coin tosses and capacity progression for a decision tree is measured with 10%, 25%, 50% and 100% of the data. The output is [10, 25, 50, 100].
The basic predictive model used in machine learning to go from experimental observations (represented in the branches) to conclusions (represented in the leaves). All machine learners can be converted into decision trees.
A measure of the ability of a predictor to give the right answer on input data without relying on having memorized answers to specific inputs. The more a machine learning model can correctly handle a variety of inputs without having to add parameters to do its processing correctly, the more general it is. Measured in bits/bit. For binary balanced classes, generalization is defined as:
Brainome understands this generalization ratio as being equivalent to a compression ratio and will warn about overfitting if the machine learner’s Memory Equivalent Capacity is larger or equal to the information capacity needed to memorize the dataset using the machine learner as the encoder for the function implied by the dataset. This information capacity is dependent on the number of classes, the number of attributes, the class balance, and the concrete machine learner used. In general, we want to maximize generalization G. A G<=1 is definitely overfitting. Predictors with G’s greater than 1 can still be overfitting. However, the larger the generalization, the lower the chance of overfitting. Generalization is closely related to resilience (see Resilience).
- The measured generalization G of a binary classifier is 0.1 bits/bit. This means, on average, there are parameters worth 10 binary predictions for one correctly classified instance. The machine learner overfits. We expect the correctly classified instances of the test set to be equivalent to the best guess.
- The measured generalization G of a binary classifier trained on balanced classes is 10 bits/bit. This is, on average, 10 instances (worth 1 bit of information each) are correctly classified by a set of parameters that can handle 1 bit each. We expect this predictor to work well with a test set of the same complexity.
- The pre-training estimate G for a neural network is 5bits/bit and for a decision tree 10bits/bit. We expect the decision tree to be able to model the training dataset with less parameters than the Neural Network, which given a representative test dataset, should lead to a smaller difference between training and test accuracy.
- Two boolean-variable XOR has a truth table with 4 rows. This is, the function can be completely memorized with 4 bits of Memory Equivalent Capacity. Two boolean-variable AND also has a truth table with 4 rows. However, if the first variable is 0, the value of the second variable does not matter. The table can therefore be reduced to three rows. This means, a Memory Equivalent Capacity of 4 bits memorizes AND but is possible that a machine learner trained on 3 rows (0,1)->0, (1,0)->0, (1,1)->1 only uses 3 bits of Memory Equivalent capacity and can generalize to predict the unseen case (0,0)->0. This is, the generalization G achieved for AND is G= 4/3 > 1 bits/bit and for XOR G= 4/4 =1 bits/bit
Memorization / Overfitting
A model that corresponds too closely to a particular set of training data, and may therefore fail to predict future observations. Obviously, the closest correspondence to a particular set of data is copy or any isomorphic transcoding of the data. Brainome therefore equates overfitting with memorization. A machine learner is memorizing when it’s reproducing the training predictions with 100% accuracy and the generalization G (see Generalization) is below the information capacity of the machine learner. At that point, every single internal parameter is set to the number and values required to get the exact right answer on every instance of a specific training set. The model acts as a dictionary and will perform very poorly on unseen data.
Memory Equivalent Capacity (MEC)
The explanation of this concept boils down into three definitions.
- Representation Function: The parameterized function a machine learner uses (either standalone or in composition) to adapt to the target function represented in the training data. For example, the representation function in a neural network is the activation function.
- Intellectual Capacity: The number of unique target functions a machine learner is able to represent (as a function of the number of model parameters).
- Memory Equivalent Capacity: With the identity function as representation function, N bits of memory are able to adapt to 2^N target functions. A machine learner’s intellectual capacity is memory-equivalent to N bits when the machine learner is able to represent all 2^N binary labeling functions of N uniformly random inputs.
The Memory Equivalent Capacity for a machine learner is dependent on the number of parameters used, the topology of the machine learner, the training method, as well as the training efficiency. It can be estimated as an upper limit. Measured in bits.
Example: A machine learner with 10 bits of Memory Equivalent Capacity is guaranteed to memorize any binary classification task of 10 instances or less.
Neural Network (NN)
Neural Networks are networks of perceptrons. Brainome interprets each perceptron as an energy threshold in an electrical circuit and predicts architecture and other parameters using electrical engineering techniques. Just as the notion of energy generalizes to kinetic, electric, gravitational, etc, neuronal networks can generalize any type of data when configured correctly.
Capacity Utilized by a trained NN
The portion of the Neural Network’s Memory Equivalent Capacity (MEC) that is actually utilized to implement the trained classification function. This is analogous to the utilization of the total capacity a hard drive.
Resilience is the amount of variance an instance is allowed to assume before changing a prediction outcome. It is the inverse of generalization (see generalization). Historically, signal variance is measured in deciBel (dB), while Brainome measures generalization in bits/bit. The conversion is:
Measured in dB.
The higher the generalization, the higher the noise resilience. However, generalization being a ratio of information is a positive number while noise resilience measures the amount of uncertainty that can be added to an instance. Since information is reduction of uncertainty, resilience is expressed as a negative number. This is: -n dB of resilience can withstand n db of noise.
Example: The noise resilience of a machine learner is -6 dB. This is, each instance can add on average one bit of noise before the classification result changes.
% Training Data Memorized
The percentage of the dataset that is directly encoded into the machine learner without any generalization. Sometimes referred to as “exceptions to the rule” whereby the rule is the general rule encoded by the machine learner to predict outcomes.