The role of streaming machine learning in analyzing encrypted traffic

Organizations are now creating and moving more data than at any other time in human history. Network traffic continues to increase and global internet bandwidth increased by 29% in 2021, reaching 786 Tbps. In addition to record traffic volumes, 95% of traffic is now encrypted according to Google. As threat actors continue to evolve their tactics and techniques (for example, hiding attacks in encrypted traffic), securing organizations becomes increasingly difficult.

To help address these issues, many network operations and security teams are relying more on machine learning (ML) technologies to identify faults, anomalies, and threats in network traffic. But as encrypted traffic increasingly becomes the norm, traditional ML technologies must also evolve. In this article, I’d like to examine the type of ML models in use today and explore how they can be combined with Deep Packet Dynamics (DPD) technology to gain visibility into threats that might be lurking in encrypted traffic.

To be successful with ML, NOC and SOC teams need three things: data collection, data engineering, and model scoring.

Data collection involves extracting metadata directly from the network packet stream. Data engineering is the process of moving raw data to the right place and transforming it to fit into a model. This includes tasks such as data normalization and feature creation. Model scoring is the final step where ML algorithms are applied to the data. This includes the necessary steps of training and testing the models.

Historically, ML has relied on batch processing models. With big data of all kinds, traditional data pipelines work pretty well. Models are trained offline using historical and retrospective data. Later, it is deployed on data that has been recorded for analysis.

It works something like this: first, the team creates a highly designed data pipeline to transfer all the data into a huge data lake. Then, historical features are created by running preprocessing queries and scripts. Finally, the models are trained on the large collection of data. Once ready, the trained model is put into production, which requires translating each data processing step into an outward-facing application.

The cost of storing and processing heavy data (which is “big data” that requires specialized tools for storage and processing, and which is not stored in traditional database record formats) such as network data can be prohibitive. This method of ML requires significant scale and resources. It is useful for model development and predictive models with an extended time horizon.

However, as network traffic has increased, there is a new alternative called ML streaming. It uses a much smaller resource footprint while exceeding the performance requirements of the highest bandwidth networks. When combined with encrypted traffic analysis, organizations have a powerful tool that provides visibility into network threats. Historically, examining network traffic was done using deep packet inspection (DPI), but as more of that traffic is now encrypted, this is becoming less and less useful. This has driven the market towards a new technology called Deep Packet Dynamics (DPD), which offers a rich set of metadata without payload inspection.

DPD features include traffic characteristics such as producer/consumer ratio, jitter, RSTs, retransmissions, sequence of packet lengths and times (SPLT), byte distributions, set-up time connection, round trip time, etc. It offers superior functionality well-suited to ML and effective in identifying patterns and anomalies that simple and enhanced approaches fail to detect. But they cannot be calculated retrospectively, they must be captured as the traffic passes through in real time. This form of cryptanalysis enhances privacy by eliminating the intensive man-in-the-middle processing (MITM) technique of traffic decryption and inspection.

By combining ML streaming with DPD, SOC and NOC teams can more easily detect advanced threats in real time. This approach can, for example, uncover ongoing ransomware attacks on the network, including lateral movements, advanced phishing and watering holes attacks, insider threat activity and much more. This approach also eliminates encryption blindness and restores visibility for network defenders.

By 2025, almost all network traffic will be encrypted. As encryption grows (as well as new threats), enterprises must rely more on ML streaming (including machine learning engines) and analysis of encrypted traffic to gain the necessary visibility into the abnormal traffic. Without it, attackers will continue to bypass traditional security mechanisms, hide in encryption, and carry out attacks.

Charles J. Kaplan