Machine Learning applied to Cybersecurity and Hacking

Machine Learning (ML) is a hot topic these days…why is that ? And first of all, what is ML all about ? What are the mainstream applications of ML ? What are the potential applications to Cybersecurity ? Are Threat Actors likely to use ML in the near future, if so, in which context ? I try to explore these questions in this blog post, and more

What is ML

While being a hot topic nowadays, and like other IT related research fields, ML is not a brand new technique

ML started to be theoreticized in the 50s with the first breakthough of Artificial Intelligence, then the first mainstream application of ML was a spam filter in the 90s, while a more complex architecture of algorithms, known as deep learning, started to grow some years ago

Machine Learning is the science of programming computers so they can learn from data. The system uses examples, also called training set, to learn from data, using an ML algorithm. Accuracy of the learning output is measured so the cycle can repeat and converge to an even better result

The typical ML process is as follows :

# Get data : acquire a big enough dataset/training set

# Clean, Prepare & Manipulate Data : load your dataset from storage, do basic data analysis and visualization, transform the input data into numeric data

# Train Model : run the data through a machine learning algorithm, use the model to make predictions, classifications and other tasks

# Test Data : verify the accuracy of the output versus expectations

# Improve : repeat the above cycle trying to improve the accuracy

Machine learning is great for fluctuating environments as the ML system can adapt to new data, getting insights about complex problems and large amounts of data, problems for which classical approaches would require a lot of fine tuning or long lists of rules

There are three main categories of ML systems :

Supervised learning : the training set you feed to the algorithm includes the desired solutions, also called labels, and the machine will learn from it, it is “task driven”
Unsupervised learning : the training set is unlabeled, the machine tries to learn without a teacher, it is “data driven”
Reinforcement learning : the learning machine can observe the environment, select and perform actions, and get rewards in return. It must learn by itself what is the best strategy to get the most reward over time, the algorithm “learns to react to an environment”

There are many algorithms used in ML, and the trick is to select the best algorithm for the user case, to achieve the best accuracy, in the shortest time, as computations can be very long in some cases

From a mathematical point of view, the main tools used by all these ML algorithm are a blend of :

Linear algebra

Analytic geometry

Matrix decompositions

Vector calculations

Optimization

Probability

Statistics

Why is ML a hot topic these days

As ML is not a brand new technique but dates back from decades, one could wonder why ML is making the headlines nowadays. There are several good reasons for this :

Computers are more powerfull than ever : running an ML algorithm on a huge dataset is not a small task. Today’s personal computers are so powerfull that most ML algorithms can be run from home in a reasonable amount of time and computer resource

A lot of data are available : there are many good datasets and public sources available in the Internet to feed ML algorithms

Many tools and libraries are available : ML tools, mostly based on Python, such as SciKit, Keras, TensorFlow are easily available with many good tutorials online

Research is very active : advances in mathematics, new algorithms, are stronger than ever. Improving existing algorithms and designing new ones is a key topic, especially in fields such as deep learning

There are great books to discover this topic, often with ready made codes and solutions that you can download on GitHub and Jupyter Notebooks

Startup have been growing very fast and now there is a strong ML ecosystem to propose efficient solutions applied to common Business issues

For these reasons, ML is now used at scale in many real life applications

What are the mainstream applications of ML ?

Regression

Forecasting company’s revenue based on many performance metrix

Predicting stock exchange

Clustering

Segmenting clients based on their purchases

Segmenting articles in a library

Convolutional Neural Network (CNN)

Analyzing images of products on a production line to automatically classify them

Detecting tumors in brain scans

Making your app react to voice commands using speech recognition, which requires processing audio samples

Recurrent Neural Network (RNN) and Natural Language Processing (NLP)

Automatically classifying news articles based upon text classification

Automatically flagging offensive comments on discussion forums

Summarizing long documents automatically

Creating a chatbot or a personal assistant

What are the current ML applications to Cybersecurity

Cybersecurity is a very wide discipline, ML too. So the possibilities are huge. Let’s have a look at some existing applications, which are not necessarily fully fledged ML applications, but embbed ML technologies at some point

Darktrace : https://www.darktrace.com/en/

Darktrace was founded in 2013, in a collaboration between British intelligence agencies and Cambridge University mathematicians. It’s a solution based upon behavior analytics, evolving towards an active defense

As perimeter-based protection strategies are not always sufficient, Darktrace is focused on detecting and mitigating attacks in their earliest stages. It calls its detection piece the Enterprise Immune System, modeled after the human body’s defenses. Using unsupervised machine learning — it doesn’t look for signatures or known examples of malware — without knowing what to look for, it develops a pattern of “normal” for the network, then looks for anomalies

Darktrace is based upon the following technologies :

Probabilistic approach and machine learning to analyse your network, data exchanges, user and devices behaviour. It looks for trends, can cluster and find subtle deviations

It can alert IT administrators of any discrepancies to normal behaviours; as such it will complement traditional firewall systems

The plateform is based upon a physical appliance running on CentOS, together with a distributed database to store metadata (such as IP headers, Ethernet, app logs,…), and follows all interactions inside and outside your Company

Raw datas are not stored nor retained for security and confidentiality reasons

If an anomaly is detected, an alert will be triggered and send to the IT administrator with a copy of the metadata

An HTML interface (called Threat Vizualizer) displays the network and devices topology and offers a user friendly 3D navigation, allowing you to drill down into the activity of a specific device, and watch for actions such as calling out to a suspicious region or sending out data

https://www.youtube.com/watch?v=nc32nUDyxaI

Although a very interesting appliance, there are several drawbacks to take into account :

The very definition of the product requires visibility of all network traffic to get the full potential of the tool. In distributed and complex networks, this can be very expensive in deployment and configuration

It requires a regular health check, as all ML applications, thus requiring maintenance of the deployment and extra care about false logs

Splunk : Splunk User Behavior Analytics

Splunk UBA is a machine learning driven solution that helps organizations find hidden threats and anomalous behavior across users, devices, and applications. Its data science-driven approach produces actionable results with risk ratings and supporting evidence, augmenting SOC analysts’ existing techniques

Splunk User Behavior Analytics not only captures the footprint of threat actors as they traverse enterprise, cloud, and mobile environments, but also runs them through its advanced machine learning algorithms to baseline, detect deviations and find anomalies continuously. These aberrations are then stitched into a meaningful sequence over time using pattern detection and advanced correlation to reveal the actual kill chain, which is not only comprehensible but also immediately actionable

Splunk is based upon the following technologies :

Big data platform

Peer group analytics

Unsupervised machine learning algorithms

Behaviour analytics

Cylance : https://www.cylance.com/en-us/index.html

Cylance Inc. is a software firm that develops antivirus programs, to block computer viruses or malware before they have an effect on a user’s computer. Cylance has been described as the first company to apply artificial intelligence and machine learning to cybersecurity

In February 2019, the company was acquired by BlackBerry Limited

In 2017, the Cylance team released an excellent, must read, E-book to train cybersecurity professionals : http://shorturl.at/ijzM6

Some of the key concepts used for antivirus detection are explained in this value added document

A more extensive view of ML applications to cybersecurity

Network protection

Regression to predict the network packet parameters and compare them with the normal ones

Classification to identify different classes of network attacks such as scanning and spoofing

Clustering for forensic analysis

Endpoint protection

Regression to predict the next system call for executable process and compare it with real ones

Classification to divide programs into such categories as malware, spyware and ransomware

Clustering for malware protection on secure email gateways (e.g., to separate legal file attachments from outliers

Application security

Regression to detect anomalies in HTTP requests (for example, XXE and SSRF attacks and auth bypass)

Classification to detect known types of attacks like injections (SQLi, XSS, RCE, etc.)

Clustering user activity to detect DDOS attacks and mass exploitation

User behaviour

Regression to detect anomalies in User actions (e.g., login in unusual time)

Classification to group different users for peer-group analysis

Clustering to separate groups of users and detect outliers

Process behaviour

Regression to predict the next user action and detect outliers such as credit card fraud

Classification to detect known types of fraud

Clustering to compare business processes and detect outliers

What Black Hat Hackers and Threat Actors could do with ML

There could be many applications for hackers and offensive security teams. Here below a few examples

Breaking captcha : http://shorturl.at/cdfI0. The captcha protection is broken using computer vision and image classification ML techniques

Keyboard strokes detection and password guessing : http://shorturl.at/ruwAX. By listening to the keystrokes, the model is able to detect the key with a good reliability, hence guess passwords and other credentials

Automated spear phishing : http://shorturl.at/gjnrE. This technique allows to target high profile accounts on Twitter

A Machine Learning Case Study : Network intrusion detection using CNN

In this long paragraph, we are going to follow a realistic application of ML, using Convolutional Neural Network (CNN)

Nowadays, there are an enormous number of attacks over the Internet that makes our information to be continuously at risk. Intrusion Detection Systems (IDS) are used as a second line of defense. They observe suspicious actions in the network to detect attacks

Intrusion detection in Networks requires to be able to identify typical signatures of intruders, and do not confuse them with regular anomalies or normal traffic

In an attempt to find known attacks or unusual behavior, IDS traditionally inspect the contents of every packet. The problem of packet inspection, however, is that it is hard, or even impossible, to perform it at the speed of multiple Gigabits per second. And, the quantity of data is usually so huge that it makes the analysis practically difficult

For high-speed lines, it is therefore important to investigate alternatives to packet inspection. One option is flow-based intrusion detection. With such approach, the communication patterns within the network are analyzed, instead of the contents of individual packets

This Case Study uses this approach, selecting only the features related to the network flow. Because of their ability to handle huge amount of data and identify patterns, deep learning techniques are a tooling of choice

So, we are going to go through some recent research to showcase the ability of deep learning for IDS. We will use the Research Study available here : https://github.com/CharlesMure/cassiope-NIDS

The overall principle will be as per the flow chart below. The target is to collect data from our Network, and compare it with the training set, to be able to predict if our Network is free from cyber threats. We will be able to categorize the attacks per usual types

Basically, the software argus will collect data from our Network flow

The ML model will perform the following tasks :

Process the data from our network flow
train using an existing dataset
categorize our network flow

The training dataset will be downloaded from the University of New South Wales : The UNSW-NB15 data set description (adfa.edu.au)

This dataset was obtained recording Network data, detecting suspicious flows of data (attacks). These attacks are categorized as per the tables below, and can be categorized because each type of attack has a typical Network signature (number of connections to the same IP, packet size…)

Along the way, we will be using some well-known ML libraries :

SciKit-learn :https://scikit-learn.org/stable/

It is a tooling set for ML, in Python

Simple and efficient tools for predictive data analysis

Accessible to everybody, and reusable in various contexts

Built on NumPy, SciPy, and matplotlib

Tensorflow : https://github.com/tensorflow/tensorflow

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications

Keras : https://keras.io/

Keras is an API. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear & actionable error messages

“Keras is to Deep Learning what Ubuntu is to Operating Systems”

So, let’s dive in our Case Study

Data collection from our Network flow

So, we are going to use argus : openargus – Using Argus

Argus processes packet data and generates summary network flow data. If you have packets, and want to know something about whats going on, argus is a great way of looking at aspects of the data that you can’t readily get from packet analyzers. How many hosts are talking, who is talking to whom, how often, is one address sending all the traffic, are they doing the bad thing? Argus is designed to generate network flow status information that can answer these and a lot more questions that you might have

Many sites use argus to generate audits from their live networks. argus can run in an end-system, auditing all the network traffic that the host generates and receives, and it can run as a stand-alone probe, running in promiscuous mode, auditing a packet stream that is being captured and transmitted to one of the systems network interfaces. This is how most universities and enterprises use argus, monitoring a port mirrored stream of packets to audit all the traffic between the enterprise and the Internet. The data is collected and then the data is stored in what is described as an argus archive, or a MySQL database

From there, the data is available for forensic analysis, or anything else one may want to do with the data, such as performance analysis, or operational network management

To install argus, we shall download both argus and argus-client openargus – Getting Argus

You get two archives :

argus-3.0.8.2.tar.gz

argus-clients-3.0.8.2.tar.gz

You just need to unzip these files with the Linux command : tar -xvf myfile

argus will need also additional packages

To install and run argus, follow the steps : openargus – Documentation

You can start collecting network data with argus ! Here’s an example of captured data

Installation of ML packages

Then, you can download the necessary files from https://github.com/CharlesMure/cassiope-NIDS

You will get some Python scripts

model_train.py

To train the model, you just need to launch the script model_train.py

python3 model_train.py

This script is the “learning” component, that will train using the dataset

It will train using the CNN model. Here below an example of the CNN model structure using an image recognition (interpret a written number “3” as the correct number)

The corresponding Python script is the following. It includes several layers, with convolutions, and the activation function “ReLu”

Once the learning phase is done, the results are recorded into a file together with the associated CNN “weights”, for re-use in the next steps of prediction

monitoring.py

This script is going to extract features from the logs collected by argus

These features are useful to check if the captured packets are suspicious or not

Some features that can be extracted by argus are the ones below. They will be necessary for our model

This allows to extract the necessary features for our ML model

At first, we open one argus log

The features are calculated and extracted “on the go”, in a python list : for example with the feature ct_dst_sport_ltm

An additional treatment is done to consolidate the data

DeepLInspect.py

Now, this script can “correlate” the extracted features from argus, with our training set output

At first, it will load all necessary files

Then, the script will perform all necessary “correlations” to predict if our Network flow is suspicious or not, if so, it will classify it in the appropriate categories which are coming from the training dataset

What are the next steps ?

From the prediction script, we could automatize the detection and classification and Network flow between normal and suspicious flow, making our IDS a powerfull tool

There are a number of limitations to bear in mind, however

At first, to make good predictions, you will need a good Dataset, that means fresh data (difficult to get and classify, as this is quite a huge task) and a good balance between categories (our dataset above, is from years ago, and we don’t have a good balance between categories). That means also, that such a dataset needs to be representative of the latest types of attacks and their typical signatures. In addition, the datas need to be comprehensive (our dataset misses the timestamp, for example)

Secondly, an IDS machine learning model would need to dive deeper into the analysis, by capturing some parts of the packets (at least the headers), to understand which content, which applications are used by the suspicious packets. This would require much more data “Big Data”to be acquired and huge repositories and hardware to proceed with the calculations

Third, an IDS that would be using a Hybrid approach, both signature based and Machine Learning based, would be quite powerful, as it would be able to classify the attacks with more accuracy, thus allowing to implement retaliation strategies against these attacks, to block them without doubt

Last but not least, to continue developing efficient IDS, there will be a strong need to deeply understand the attacks, the techniques used by them, in order to define the right features and categories. And, a strong understanding of the appropriate ML algorithms will be needed. Therefore a combined Expertise in Network Forensics and Machine Learning will be necessary

Recommendations for the future

With machine learning, cybersecurity systems can analyze patterns and learn from them to help prevent similar attacks and respond to changing behavior. It can help cybersecurity teams be more proactive in preventing threats and responding to active attacks in real time. It can reduce the amount of time spent on routine tasks and enable organizations to use their resources more strategically

As ML is based upon data, a must have for an efficient ML cybersecurity is to gather data in a structured way. The data must have complete, relevant and rich context collected from every potential source—whether that is at the endpoint, on the network or in the cloud

It all starts with taking the right approach to data. One of the biggest challenges is getting data from the endpoint, network and cloud and normalizing it into one state, so that it can be used effectively for machine learning

When it comes to cybersecurity, the potential for machine learning to have a dramatic and lasting impact is real. But only for companies that are forward-thinking enough to take care of their data first

Additional reading

A very good synthesis and potential applications of ML to cybersecurity : https://towardsdatascience.com/machine-learning-for-cybersecurity-101-7822b802790b

A huge ML repository dedicated to cybersecurity : https://github.com/jivoi/awesome-ml-for-cybersecurity

The challenges of data gathering for an efficient ML application to cybersecurity : https://www.securityroundtable.org/the-growing-role-of-machine-learning-in-cybersecurity/

Leave a Reply Cancel reply