WRONG! There is no data like enough data and the right data.
Let’s play a game!
If I give you the following sequence –2, 4, 6, 8– and ask you what comes next, you will more than likely answer 10. For that same sequence, if I ask what comes after 1000, you will tell me 1002. When providing you with the initial number sequence, there is no benefit in going all the way to 100. You will tell me very quickly: That’s enough. I got it! It’s +2. More data will take more time; use more compute, etc…. For your ML task, it means more unnecessary $ spent.
Let’s try another one. Using the following sequence of 6, 5, 1, 3 can you to guess what comes next? Don’t waste brain power; you won’t be able to answer me. Why? Because it’s the last 4 digits of my phone number. Doesn’t matter how many data points I give you, there is no way to extrapolate a rule. There is no rule. Phone numbers are random so that people can’t guess them. The emphasis on having the right data is key. Wasting time and money trying to solve the unsolvable makes no sense.
Gerald Friedland, CTO at Brainome and UC Berkeley professor says: “Memorization is worst case generalization”. The holy grail in ML is to achieve the best case generalization and avoid overfitting.
The state of the art in machine learning is to take existing models and throw as much data as possible at them and see how they perform. Then you tune hyperparameters to increase accuracy. This method only promotes the continued collection of data and ever increasing need for compute power. Should we keep on doing that?
When building a bridge, the construction crew doesn’t duplicate the 100 other bridges they built previously and see which one doesn’t collapse. This would take way too much time, cost too much and will most likely not be a perfect fit.