SMOTE is a Machine Learning method that has been around for a while. In fact, it has been around since 2002, and the original paper has been cited over 12,600 times. However, I heard about this method about a month ago during an interview. After looking it up, I realized it was total nonsense. The fact that unsubstantiated methods like this have been promoted throughout the Machine Learning community should shock and disappoint us. Like the boy in the emperor’s new clothes, it took me only a short time to realize this sham. In this article I will provide a logical debunking and reframing of the method’s accidental nature.
Per Jason Brownlee’s article:
“One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.”
The simple fact is that YOU CAN NOT ADD NEW INFORMATION BY COMMINGLING OBSERVATIONAL DATA. Yes, that is right. That makes absolutely no sense. You can try to create alternative methods to extract the information in the data, but you can’t add new information. Any new information added is not indicative of predictive improvements. It is simply erroneous, but perhaps in a random way. Duplicating examples in the minority class, or weighting evaluation criteria on model updates are both more reliable in the sense that they are more honest to the original data generative process.
The SMOTE method logically will not work since the distribution of features which are not observed (i.e. synthetically produced) is not determinate of the targets with certainty. To produce the synthetic feature set, the modeler assumes the targets are the same as the observational data points that are being synthesized. However, in actuality the probability of the target being of a different value as the the two targets that are being synthesized is consistent with the amount of “surprise” created by the new data point. In other words, we are not able to create new, meaningful, information by synthesizing observed data. This should intuitively make sense. If you could just create new information by blending two sources of independent information, then why even bother collecting additional data?
Imagine for a moment two concentric circles that are very clearly divided in space but are relatively close. The outer circle is target one, and the inner circle is target two. Now, we want to make a synthetic data point, so we take two points on target circle one and, consistent with SMOTE, draw a vector between these, and then modify one of the original points by a factor of this vector. This new synthetic data point will start to encroach upon target two. If those two observed data points are diametric, the new (synthetic) data point actually will be more likely to be generated from the actual distribution of target two. If these observed data points lie near one another, they will be less likely to lie in the density of target two. The same pattern would obviously not apply for the inner circle, which synthesizing data with linear combinations of a vector could not exit the circumference of the circle.
However, we can make these two circles arbitrarily close so that no matter how tightly spread around the circumference the data is, it will still be more likely that target one’s data is generated by circle two. Increasing or decreasing noise will further distort the likelihood of what the label to the synthetic data is. In essence, our label is undefined (and therefore uncertain), yet we assume a label with certainty under the SMOTE method.
Let’s visualize this now. We will use a normal distribution to determine a radius and then generate (x, y) points out of the polar coordinate system.
C1 = Normal(5, 1), C2 = Normal(4, 1)
Let’s draw 4 points from the first distribution, rotate them to be along the x and y axes, and then create a synthetic data point via the SMOTE algorithm. Then we will check how likely that data is to come from either distribution. In producing random numbers from C1, we might get this: [5.9123, 4.8277, 6.4053, 5.719]
Rotated, the points are (5.9123, 0), (0, 4.8277), (-6.4053, 0), (0, -5.719)
A simple y = mx + b can give us our vectors to synthesize from. For example, on points 1 and 2, we have b = 4.8277 (since x = 0) and therefore m = -0.8165519341 (i.e. -4.8277/5.9123). So we have y = -0.8165519341 * x + 4.8277. Ok, now we need to generate a datapoint here. we will multiply a random uniform 0 to 1 by the domain of x = [0, 5.9123]. A random generator gave us 0.2295, so let’s do the math. Our synthetic data point is:
x = 0.2295 * 5.9123 = 1.35687285
y = -0.8165519341 * 1.35687285 + 4.8277 = 3.71974285
So lets plot it. (1.35687285, 3.71974285)
the radius is sqrt(1.35687285 ^ 2 + 3.71974285 ^ 2) = 3.9595
Which is actually less than the mean of our radius from our Normal distribution C2. Let’s take Z scores and find out the likelihood of the label based on the p-value, despite both labels from the observational data being drawn from C1.
z1 = (3.9595–5) / 1 = -1.04 => p = .149
z2 = (3.9595–4) / 1 = -.04 => p = .484
So, it is a little over 3 times as likely that data this extreme comes from the second distribution, BUT we synthesized it using (x, y) features generated from the first distribution (which is our target). The likelihood could be determined as:
ps(c1) = .149 / 0.633 = .2354
ps(c2) = .484 / 0.633 = 0.7646.
These numbers are the same whether using a one-sided or two-sided test. Using the PDF instead of the CDF gives us numbers leaning slightly more towards .5 (and is the statistically correct approach), but proves the same point.
Let’s calculate the likelihoods on the original data points.
z1 = (5.9123–5) / 1 = .91 => p = .1814
z2 = (5.9123–4) / 1 = 1.91 => p = .0281
p1(c1) = .1814 / 0.2122 = 0.8676
p1(c2) = .0281 / 0.2122 = 0.1324
z1 = (4.8277–5) / 1 = -.17 => p = .4325
z2 = (4.8277–4) / 1 = .83 => p = .2033
p2(c1) = .4325 / 0.6358 = 0.6802
p2(c2) = .2033 / 0.6358 = 0.3198
The method of synthesizing a new point is the one suggested in applying SMOTE. As you can see, the probability of getting the commingled data coming from distribution C2 is more likely when mixing this data as suggested by the algorithm than it is independently under either sample from distribution C1. This might be thought of as the continuous analog to Simpson’s Paradox. Real information is distorted when linearly combining data that ignores the underlying process. This demonstrates that synthesizing data based off feature mixing and assuming the same target as both the observational points is not logically sound. Despite the worked example, this logic generalizes to any data with nonlinear components. We just can not say with certainty what happens between two data points.
It’s trivial to prove that by modifying the standard deviation of C1 and C2 to be arbitrarily close to 0, and modifying the mean of C2 to be the resulting radius of the linear combination of our observations from C1 and C2, the probability that the linear combination of points from C1 came from the distribution C2 is arbitrarily close to 1. In this case, the label of two observed data points is different from the original observations with certainty. QED!
Where SMOTE might appear to have to predictive performance improvement is not a consequence of the actual SMOTE algorithm at all. It is the consequence of using a prior to define the bounds of the observed data (and therefore how to synthesize it). We have just shown above that the entirety of the probability of the label in the synthetic nonlinear data is due to variance. SMOTE can be seen as using a surrogate model on the data. In the typical SMOTE algorithm, using nearest neighbors and varying the influence to lie somewhere along the vector that unites these points, we can imagine the distribution of potential synthetic data as a sort of lattice that is defined between all points and their nearest neighbors. We could even expand the algorithm to allow synthetic data points to connect to observational data or synthetic to synthetic.
None of this does anything to define new, meaningful information over the distribution we are concerned with predicting, namely P(Y | X). All it does is create a new predictive distribution P(Y | X*) where X* is both the combination of observational and synthetic data. In other words, SMOTE is potentially as good as its surrogate model. If the synthetic data puts more emphasis on certain features of the data that you are more concerned with predicting (for example, maybe they represent a larger proportion of predictive accuracy) then we might see what appears like improvement. However, even that standard is shaky.
Treating all data as equal, brings us back to the problem SMOTE was created to solve, namely imbalanced datasets. Now instead of imbalance in the targets, we end up with imbalance in the features through a surrogate model (with the intuition that a larger proportion of targets will be predicted successfully, and that this behavior is optimal). The utility of correctly predicting a target with unusual features may by much higher. Imagine an autonomous vehicle careening off the road. The utility of knowing which of those objects in blurry images or sensor readings are people is much higher than the utility of identifying people during normal activity. It is the same label set but a different feature set, yet the predictions are not all equal. This problem may exist anyways, and should be considered more often than it currently is in predictive modeling. SMOTE both adds and removes information in unusual ways, without consideration to the primary issue of extracting accurate information from the data.
Jonas Peter’s work makes a strong claim about the predictive nature of causal data as well. Since we technically do not have the labels on the feature set, creating new, synthesized data is consistent with defining a broader distribution of unlabeled data. If our feature data X is the cause of our predictive data Y, then even our surrogate model (i.e. the prior on the feature distribution) should offer us no predictive hints about the desired distribution P(X | Y). If the feature data is causal, we are essentially all out of luck. This is due to the independence of cause and mechanism. Anticausal data should still be looked upon doubtfully, since there is no logical reason SMOTE should add predictive value to it since it adds no information.
The original SMOTE paper does something that should be frowned upon (but seems is often done) when first presenting a method into Machine Learning. It uses real-world, disparate data to prove a methodological point. No matter how much this data is cleaned, it almost certainly has quirks that make it unreliable. Furthermore, none of the datasets used in this paper is a well-established dataset for testing algorithms. We are expected to take the authors at their word. As the paper grows older (and the original data harder to validate), the only criteria for evaluation is how many times it has been cited and how many people are actively using this method. Groupthink is not an adequate tool for the discovery of scientific truth. To reliably prove their method (which I counterfactually believe would have been disproved), the authors should have developed a purely statistical generative process to demonstrate success in multiple scenarios (if they choose a computational approach). In the paper, the authors’ explanation of why SMOTE should work is founded on literally nothing. It claims to be inspired by the idea of transforming data via rotation and skew (which has nothing in common with commingling IID data). It was more or less an artistic, creative data idea that should have been immediately shot down by legitimate researchers.
I have consistently claimed that all science is data science (otherwise it is philosophy, politics, art, or religion — yes science is a religion to many). Although it appears that not all data science is science, certainly some of what is pitched is not. Or it might be better to say that SMOTE is not data science, and is not an honest or effective Machine Learning technique. It is more of a parlor trick that can be used to confuse and defraud those without esoteric knowledge of Machine Learning.