Preparing the data

Having seen that ponctual events don't have a significant importance in the evolution of its channel, John now wants to know if there are general trends in the growth of a channel basic statistics, views, subscribers and videos. He wants to try to cluster the channels into groups that had a similar progress over time and see if there exist certain "profiles" on the platform. This would help him later grow his own YouTube channel if he could target a specific profile that would lead it into the top-charts of the platform.

He noticed that the timeseries showed significant oscillations. A first step was therefore to reduce this phenomenon. To do this, he tried two techniques: rolling average which was not conclusive and exponential smoothing which gave satisfactory results. He then divided the dataset into training, validation and testing sets.

A second problem was the scales in which these channels evolve: as seen previously, there are important quantitative differences between them. It was therefore necessary to normalize the data. He first used the min-max method but it was problematic: this technique relies on the whole series to normalize it, which results in a leakage of data from the "future". To counter this problem, it is possible to do a partial min-max normalization which relies only on a segment of the the beginning of the data. Even if it overcomes this leakage problem, this normalization is sensitive to outliers, which are common in this type of data. John decides to change the method using a max-abs normalization which is more resilient to outliers, also implemented with a partial normalization to not leak future information.


After researching on multiple clustering algorithms, he decides to use K-Means because it is fast and easily interpretable. To decide on the number of clusters, he uses cross-validation on the validation dataset using the silhouette score of the clusters as a metric.

For this clustering tasks, John decides to use the delta-indicators because, as they are not cumulative, the differences between them are easier to see even if they are harder to explain. He starts the analysis with the number of delta-views.

As we can, the data is not partitioned equally in the clusters at all. On the contrary, there appear to be one large cluster that concentrate almost all the time series. Before drawing any conclusions, John repeats this clustering approach on the delta-subscribers and delta-videos.

The results are similar across all statistics analyzed. One big cluster and so no significant results. John, unhappy with the results, thinks about why clustering would fail. He tries other clustering algorithms but he still get the same results. He also tries to discard the low-end and high-end data, for example the lowest and highest quartile in a statistic, but it doesn't change the results either.

Focusing on one category

At this point of the analysis, John thinks that maybe trying to cluster all of the data at once is too difficult. There is too much noise and that might explain why he gets those kinds of results. As different communities act and use the platform very differently maybe it would be better to differentiate them. His channel focuses on the category "Howto & Style" so why should he bother with the rest ? He decides to retry his clustering approach but limiting it to only the channel making videos of the category "Howto & Style" to see if he gets better results.

But as you can see on the above plots, focusing on only one category doesn't help with the clustering either. There is still an extreme distribution between clusters with one of them concentrating almost all the data.


At this point, John is forced to conclude that his clustering approach won't work. Maybe there is still too much noise in the data to cluster it. The results he is getting when trying to cluster the channels into profiles won't help him grow his channel. It could be different in other categories but that won't help him either, he has decided to focus his channel on the "Howto & Style" category as we have seen previously that it could be a good category to start.

Seeing that he cannot group the channels based on their evolution, he wants to know if the channels progress in their main statistics, the views, subscribers and videos, are truly unpredictable and choatic or if it still would be possible to forecast the their evolution a bit. He moves into the next step of his analysis.