Gradient Descent is, in essence, a simple optimization algorithm. It seeks to find the gradient of a linear slope, by which the resulting linear line best fits the observed data, resulting in the smallest or lowest error(s). It is THE inner working of the linear functions we get taught in university statistics courses, however, many of us will finish our Masters (business) degree without having heard the term. Hence, this blog.

Linear regression is among the simplest and most frequently used supervised learning algorithms. It reduces observed data to a linear function (Y = a + bX) in order to retrieve a set of general rules, or to predict the Y-values for instances where the outcome is not observed.

One can define various linear functions to model a set of data points (e.g. below). However, each of these may fit the data better or worse than the others. How can you determine which function fits the data best? Which function is an optimal representation of the data? Enter stage Gradient Descent. By iteratively testing values for the intersect (a; where the linear line intersects with the Y-axis (X = 0)) and the gradient (b; the slope of the line; the difference in Y when X increases with 1) and comparing the resulting predictions against the actual data, Gradient Descent finds the optimal values for the intersect and the slope. These optimal values can be found because they result in the smallest difference between the predicted values and the actual data – the least error.

The video below is part of a Coursera machine learning course of Stanford University and it provides a very intuitive explanation of the algorithm and its workings:

A recent blog demonstrates how one could program the gradient descent algorithm in R for him-/herself. Indeed, the code copied below provides the same results as the linear modelling function in R’s base environment.

gradientDesc max_iter) {
abline(c, m)
converged = T
return(paste("Optimal intercept:", c, "Optimal slope:", m))
}
}
}
# compare resulting coefficients
coef(lm(mpg ~ disp, data = mtcars)
gradientDesc(x = disp, y = mpg, learn_rate = 0.0000293, conv_theshold = 0.001, n = 32, max_iter = 2500000)

Although the algorithm may result in a so-called “local optimum”, representing the best fitting set of values (a & b) among a specific range of X-values, such issues can be handled but deserve a separate discussion.

A time series can be considered an ordered sequence of values of a variable at equally spaced time intervals. To model such data, one can use time series analysis (TSA). TSA accounts for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend, or seasonal variation) that should be accounted for.

TSA has several purposes:

Descriptive: Identify patterns in correlated data, such as trends and seasonal variations.

Explanation: These patterns may help in obtaining an understanding of the underlying forces and structure that produced the data.

Forecasting: In modelling the data, one may obtain accurate predictions of future (short-term) trends.

Intervention analysis: One can examine how (single) events have influenced the time series.

Quality control: Deviations on the time series may indicate problems in the process reflected by the data.

TSA has many applications, including:

Economic Forecasting

Sales Forecasting

Budgetary Analysis

Stock Market Analysis

Yield Projections

Process and Quality Control

Inventory Studies

Workload Projections

Utility Studies

Census Analysis

Strategic Workforce Planning

AlgoBeans has a nice tutorial on implementing a simple TS model in Python. They explain and demonstrate how to deconstruct a time series into daily, weekly, monthly, and yearly trends, how to create a forecasting model, and how to validate such a model.

Analytics Vidhya hosts a more comprehensive tutorial on TSA in R. They elaborate on the concepts of a random walk and stationarity, and compare autoregressive and moving average models. They also provide some insight into the metrics one can use to assess TS models. This web-tutorial runs through TSA in R as well, showing how to perform seasonal adjustments on the data. Although the datasets they use have limited practical value (for businesses), the stepwise introduction of the different models and their modelling steps may come in handy for beginners. Finally, business-science.io has three amazing posts on how to implement time series in R following the tidyverse principles using the tidyquant package (Part 1; Part 2; Part 3; Part 4).

Data preparation forms a large part of every data science project. Claims go to extremes, stating that 80-95% of the workload for data scientists consists of data preparation.

Outlier detection is one of the actions that make up this preparation phase. It is the process by which the analyst takes a closer look at the data and observes whether there are data points that behave differently. Such anomalies we call outliers and depending on the nature of the outlier the analyst might (not) want to handle them before continuing on to the modeling phase.

Outliers exist for several reasons, including:

The data may be incorrect.

The data may be missing but has not been registered as such.

The data may belong to a different sample.

The data may have (more) extreme underlying distributions (than expected).

Moreover, there are various types of outliers:

Point outliers are individual data points that are different from the rest of the dataset. This is the most common outlier in practice. An example would be a person of 2.10 meters tall in a dataset of sampled population lengths.

Contextual outliers are individual data points that would not necessarily be outliers based on their value but are because of the combination of their value with their current context. An example would be an outside temperature of 25 degrees Celcius, which is not necessarily weird but is most definitely unusual during the December month.

Collective outliers are collections of data points that are collective different from the rest of the data sample. Again, the individual data points would not necessarily be outliers based on their individual value but are because of their values combined. An example would be a prolonged period of extreme drought. Where days without rain may not be outliers necessarily, a long stretch without precipitation can be considered an anomaly.

There is no rigid definition of what makes a data point an outlier. One could even state that determining whether or not a data point is an outlier is a quite subjective exercise. Nevertheless, there are multiple approaches and best practices to detecting (potential) outliers.

Univariate outliers: When a case or data point has an extreme value on a single variable, we refer to it as a univariate outlier. Standardized values (Z-scores) are a frequently used method to detect univariate outliers on continuous variables. However, here the researcher will have to determine a certain threshold. For example, (-)3.29 is frequently used, where data points whose Z-value lies beyond this value are considered outliers. Here, chances would be 0.005%, or 1 in 2000, of obtaining this value if the variable follows a normal distribution. As you can see, the larger the dataset, the more likely you are to find such extreme values.

Bi- & multivariate outliers: A combination of unusual values on multiple variables simultaneously is referred to as a multivariate outlier. Here, a bivariate outlier is an outlier based on two variables. Normally you’d first check and handle univariate outliers, before turning to bi- or multivariate outliers. The process here is somewhat more complicated than for univariate outliers, and there are multiple approaches one can take (e.g. distance, leverage, discrepancy, influence). For example, you can look at the distance of each data point in the multivariate space (X1 to Xp) compared to the other data points in that space. If the distance is larger than a certain threshold, the data point can be considered a multivariate outlier, as it is that much different from the rest of the data considering multiple variables simultaneously.

Visualization: In trying to detect univariate outliers, data visualizations may come in handy. For example, histograms or frequency distributions will quickly demonstrate any data point that has unusually high or low values. Boxplots can similarly hint at values that fall only just outside of the expected range or are really extreme outliers. Boxplots combine visualization with model-based detected.

Model-based: Apart from the above-mentioned standardization with Z-values, there are multiple model-based methods for outlier detection. Most assume that the data follows a normal or Gaussian distribution, and hence identify which data points are unlikely based on the data’s mean and standard deviation. Examples are Dixon’s Q-test, Tukey’s test, Thompson Tau test, and Grubb’s test.

Grouped data: If there is a grouping variable involved in the analysis (e.g., logistic regression, analyses of variance) then the data of each group can best be assessed for outliers separately. What can be considered an outlier in one group is not necessarily an unusual observation in a different group. If the analysis to be performed does not contain a grouping variable (e.g., linear regression, SEM), then the complete dataset can be assessed for outliers as a whole.

There are several ways to handle outliers:

Keep them as is.

Exclude the data point (i.e., censoring/trimming/truncating).

Replace the data point with a missing value.

Replace the data point by the nearest ‘regular’ value (i.e., Winsoring).

Run models both with and without outliers.

Run model corrections within the analysis (only possible in specific models)

There are several reasons why you may not want to deal with outliers:

When taking a large sample, outliers are part of what one would expect.

Outliers may be part of the explanation for the phenomena under investigation

Several machine learning and modeling techniques are robust to outliers or may be able to correct for them.

To end on a light note, Malcolm Gladwell wrote a wonderful book called Outliers. In it, he examines the factors for personal success; the reasons why “outliers” such as Bill Gates have become so stinkingly wealthy. He goes to show that these successful people are not necessarily random anomalies or outliers, but that there are perfectly sensible explanations for their success.

References and further reading:

Barnett, V., & Lewis, T. (1974). Outliers in statistical data. Wiley.

Tabachnick, B. G., Fidell, L. S., & Osterlind, S. J. (2001). Using multivariate statistics. Pearson.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.

Gladwell, M. (2008). Outliers: The story of success. Hachette UK.