To test the predictability of the statistics of a channel, John wants to do some regression to try to predict the last segment of every time serie of channels. The progression of those indicators does not appear to be very linear but he has heard of a great library called XGBoost that uses gradient boosting. It can do regression and he has heard that it could be used for time series forecasting so he wants to try it on the three base indicators of a channel, its views, number of subscribers and videos.

Unfortunately, training the machine learning model on the whole dataset takes way too much time, but as he is only doing this as a proof of concept he decides to only run the model on a sample of channels. If this approach is successful, he is decided to run the model on the full dataset on a cluster, as the library supports Hadoop and Spark, but this small scale will suffice for now.

Like we did in clustering, exponential smoothing is used to remove the oscillations and make the time series easier to interpret visually. The indexes are changed to work with XGBoost and John runs it to try to predict the last quarter of the datapoints and plot the predicted results compared to the real evolution of the statistics.


As we can see, John has as little success with the forecasting as he had with the clustering. Even if by chance sometimes the model manages to follow the reality, most often it is flat out wrong. It seems like the best guess is generally to just pick the last known value, the model rarely deviates much from it. This is typical of time series that are too unpredictable and chaotic. It could also be from a lack of data or maybe another type of model could perform better due to the complexity of the time series data but most probably forecasting is just not possible on this kind of data. At this point, John gives up on this idea.