Seasonality in Malicious Machine-Machine Communication
Is AI essential for DLP? On sequence analysis for cyber security
Target’s infamous breach marked one of the most influential cyber attacks of all times when some 40 million credit card numbers were stolen and sold in the darknet (fall 2013). More recently, another high profile breach documented at Home Depot. The breach leveraged a covert exfiltration channel, as detailed in a collaborative paper between Ben Gurion University and Akamai Technologies. The paper describes various Domain Name Server (DNS) vulnerabilities, as well as the architecture of a Command and Control server (C&C, also called C2).
The most damaging cyber attacks are accomplished through C&C. In a typical scenario, an attack starts by infecting a single computer, and later on, the attack spreads to the entire network. In this context, C&C are set to send control commands to infected computers. Such commands often include theft of credentials, documents, and other types of data (see Figure 1, top left).
Recently I had the opportunity to build an important component in a Data Loss Prevention (DLP) system to identify and mitigate such threats. At what capacity? I am a part of Tikal Knowledge, a software consultancy. Our customers are software development companies who seek the help of experts. In this case, the customer is an Israeli unicorn. I was very excited to perform this consultancy, as this specific customer owns a huge and well-organized data lake.
A key component of a C&C malicious server is often a keep-alive signal (see Figure 1, bottom left). The keep-alive signal is used by the malware to keep track of the ‘inventory’ of breached machines (similar to the way an anti-virus would send a keep-alive signal to keep track of the install base and also to collect telemetry data). So I decided to focus on C&C beacon behavior, that is, a signal that is repeated in time during a fixed (or semi-fixed) time interval.
- Data. I did not have a large labeled set of C&C communication. To address this I decided to simulate beacon behavior with two parameters: periodicity and jitter (see figure 2). For example, with parameters of periodicity=50m and jitter=30%, the sleep time between signals is 50 minutes with a 30% jitter factor added to the time window between fires.
- Preparing test data, picking up customers among the hundreds of customers of the company, all sending telemetry data for all their agents. Finally, we picked 3 customers at various sizes: huge telecommunication, a medium-size hotel network, and a service company that was convenient to approach, from the product management perspective (a semi design partner).
- The last mile of pushing ideas to production, and specifically prioritizing engineering effort over non data driven algorithms (quick wins). Here, the stakeholders (both engineering managers and product) are veterans of pushing anomaly detection algorithms to production. They had a creative idea on the specific tool I am to use: an open-source framework named Yurita (based on Scala Jupyter Notebook deployed to a SPARK cluster). Put simply, Yurita bridges the gap by using a rigid API (rather than using any library out there). And so the pipeline is production-grade from the time it is first coded. Put simply, my role was to ‘beta test’ this concept.
- Applying algorithms at a high velocity (telemetrics from 10s of millions of agents), huge volume (data is kept forever), and large variety ( process creation, network events and some other 50 types of events). Fortunately, the customer has leveraged the most modern AWS architecture and tools, including a proper understanding of terms like data lake (S3) as well as cloud computing data infrastructure (including SPARK-EMR and Presto).
- Cost: to balance need and price, I used EMR-Spark. In a typical experiment, I used 21 instances of m5.8xlarge machines, (total of 672 cores, 2688 GiB RAM 32), a total of about 38$/Hour for on-demand. A typical run of Yurita would perform detection for a single day of traffic for a medium-size customer in about two hours.
To this end, I tested a few seasonality algorithms, including fft which was fruitful in a previous seasonality project for cyber security that I led (see Figure 2). In the case of beacons, it turned out fft does not perform well with real-life data, as the seasonality signal is often jittered deliberately by cyber criminals (see Figure 3b). Instead, I built the following algorithm: first, identify C&C candidates by calculating the mean and standard deviation of the counter of DNS requests per time window (see Figure 3c). Then, take to the next step only candidates with a standard deviation below a threshold. Finally, apply a URL reputation to cleanup the data.
In other words, the idea is to reduce the data from all DNS requests of tens of millions of computers (the company install base) by using a simple statistical threshold and then apply a more sophisticated algorithm to the narrow set of candidates, including stl, and RNN. This approach, coupled with some domain expertise knowledge of a few known lists (malicious known C2 as well as legit known such as Trusteer, WhatsApp) has detected super interesting findings at the site of our design partner (active viruses).
At this point, I had a clear idea of the desired algorithm, and one thing that was very handy was a tight connection with the product management director. We had a few iterations of brainstorming and the model improvements, mainly to convey the message in a clear way. At the same time, the product director started to engage with the customer, to draw his attention to the malicious activity in his servers. This resulted in higher business support for the entire project, and this was actually what I was called for at first — to show that the methodology can be fruitful in real-life scenarios. The last phase of the project was to finalize the promise to a working production pipeline. This is where all the details we ignore at the research phase have to be completed, the technical debt has to be paid — including proper logs, monitoring, cost management, and identification of relevant prospective customers. However, with a working MVP, all those challenges are much easier to evaluate and address.
In the future, this analysis will be a part of a data loss prevention solution (DLP), to identify data transfers that are anomalous or suspicious. Other components of the system include the detection of DNS covert channels (data loss through a DNS server, hidden in a DNS request), as well as suspicious TCP communication events. The solution would also alert the security staff of a possible data leak.
The code for the simulation and detection is here.
Seasonality in Malicious Machine-Machine Communication was originally published in Everything Full Stack on Medium, where people are continuing the conversation by highlighting and responding to this story.