Machine Learning (ML) is a hot topic these days…why is that ? And first of all, what is ML all about ? What are the mainstream applications of ML ? What are the potential applications to Cybersecurity ? Are Threat Actors likely to use ML in the near future, if so, in which context ? I try to explore these questions in this blog post, and more
What is ML
While being a hot topic nowadays, and like other IT related research fields, ML is not a brand new technique
ML started to be theoreticized in the 50s with the first breakthough of Artificial Intelligence, then the first mainstream application of ML was a spam filter in the 90s, while a more complex architecture of algorithms, known as deep learning, started to grow some years ago
Machine Learning is the science of programming computers so they can learn from data. The system uses examples, also called training set, to learn from data, using an ML algorithm. Accuracy of the learning output is measured so the cycle can repeat and converge to an even better result
The typical ML process is as follows :
# Get data : acquire a big enough dataset/training set
# Clean, Prepare & Manipulate Data : load your dataset from storage, do basic data analysis and visualization, transform the input data into numeric data
# Train Model : run the data through a machine learning algorithm, use the model to make predictions, classifications and other tasks
# Test Data : verify the accuracy of the output versus expectations
# Improve : repeat the above cycle trying to improve the accuracy
Machine learning is great for fluctuating environments as the ML system can adapt to new data, getting insights about complex problems and large amounts of data, problems for which classical approaches would require a lot of fine tuning or long lists of rules
There are three main categories of ML systems :
- Supervised learning : the training set you feed to the algorithm includes the desired solutions, also called labels, and the machine will learn from it, it is “task driven”
- Unsupervised learning : the training set is unlabeled, the machine tries to learn without a teacher, it is “data driven”
- Reinforcement learning : the learning machine can observe the environment, select and perform actions, and get rewards in return. It must learn by itself what is the best strategy to get the most reward over time, the algorithm “learns to react to an environment”
There are many algorithms used in ML, and the trick is to select the best algorithm for the user case, to achieve the best accuracy, in the shortest time, as computations can be very long in some cases
From a mathematical point of view, the main tools used by all these ML algorithm are a blend of :
Linear algebra
Analytic geometry
Matrix decompositions
Vector calculations
Optimization
Probability
Statistics
Why is ML a hot topic these days
As ML is not a brand new technique but dates back from decades, one could wonder why ML is making the headlines nowadays. There are several good reasons for this :
Computers are more powerfull than ever : running an ML algorithm on a huge dataset is not a small task. Today’s personal computers are so powerfull that most ML algorithms can be run from home in a reasonable amount of time and computer resource
A lot of data are available : there are many good datasets and public sources available in the Internet to feed ML algorithms
Many tools and libraries are available : ML tools, mostly based on Python, such as SciKit, Keras, TensorFlow are easily available with many good tutorials online
Research is very active : advances in mathematics, new algorithms, are stronger than ever. Improving existing algorithms and designing new ones is a key topic, especially in fields such as deep learning
There are great books to discover this topic, often with ready made codes and solutions that you can download on GitHub and Jupyter Notebooks
Startup have been growing very fast and now there is a strong ML ecosystem to propose efficient solutions applied to common Business issues
For these reasons, ML is now used at scale in many real life applications
What are the mainstream applications of ML ?
Regression
Forecasting company’s revenue based on many performance metrix
Predicting stock exchange
Clustering
Segmenting clients based on their purchases
Segmenting articles in a library
Convolutional Neural Network (CNN)
Analyzing images of products on a production line to automatically classify them
Detecting tumors in brain scans
Making your app react to voice commands using speech recognition, which requires processing audio samples
Recurrent Neural Network (RNN) and Natural Language Processing (NLP)
Automatically classifying news articles based upon text classification
Automatically flagging offensive comments on discussion forums
Summarizing long documents automatically
Creating a chatbot or a personal assistant
What are the current ML applications to Cybersecurity
Cybersecurity is a very wide discipline, ML too. So the possibilities are huge. Let’s have a look at some existing applications, which are not necessarily fully fledged ML applications, but embbed ML technologies at some point
Darktrace : https://www.darktrace.com/en/
Darktrace was founded in 2013, in a collaboration between British intelligence agencies and Cambridge University mathematicians. It’s a solution based upon behavior analytics, evolving towards an active defense
As perimeter-based protection strategies are not always sufficient, Darktrace is focused on detecting and mitigating attacks in their earliest stages. It calls its detection piece the Enterprise Immune System, modeled after the human body’s defenses. Using unsupervised machine learning — it doesn’t look for signatures or known examples of malware — without knowing what to look for, it develops a pattern of “normal” for the network, then looks for anomalies
Darktrace is based upon the following technologies :
Probabilistic approach and machine learning to analyse your network, data exchanges, user and devices behaviour. It looks for trends, can cluster and find subtle deviations
It can alert IT administrators of any discrepancies to normal behaviours; as such it will complement traditional firewall systems
The plateform is based upon a physical appliance running on CentOS, together with a distributed database to store metadata (such as IP headers, Ethernet, app logs,…), and follows all interactions inside and outside your Company
Raw datas are not stored nor retained for security and confidentiality reasons
If an anomaly is detected, an alert will be triggered and send to the IT administrator with a copy of the metadata
An HTML interface (called Threat Vizualizer) displays the network and devices topology and offers a user friendly 3D navigation, allowing you to drill down into the activity of a specific device, and watch for actions such as calling out to a suspicious region or sending out data
Although a very interesting appliance, there are several drawbacks to take into account :
The very definition of the product requires visibility of all network traffic to get the full potential of the tool. In distributed and complex networks, this can be very expensive in deployment and configuration
It requires a regular health check, as all ML applications, thus requiring maintenance of the deployment and extra care about false logs
Splunk : Splunk User Behavior Analytics
Splunk UBA is a machine learning driven solution that helps organizations find hidden threats and anomalous behavior across users, devices, and applications. Its data science-driven approach produces actionable results with risk ratings and supporting evidence, augmenting SOC analysts’ existing techniques
Splunk User Behavior Analytics not only captures the footprint of threat actors as they traverse enterprise, cloud, and mobile environments, but also runs them through its advanced machine learning algorithms to baseline, detect deviations and find anomalies continuously. These aberrations are then stitched into a meaningful sequence over time using pattern detection and advanced correlation to reveal the actual kill chain, which is not only comprehensible but also immediately actionable
Splunk is based upon the following technologies :
Big data platform
Peer group analytics
Unsupervised machine learning algorithms
Behaviour analytics
Cylance : https://www.cylance.com/en-us/index.html
Cylance Inc. is a software firm that develops antivirus programs, to block computer viruses or malware before they have an effect on a user’s computer. Cylance has been described as the first company to apply artificial intelligence and machine learning to cybersecurity
In February 2019, the company was acquired by BlackBerry Limited
In 2017, the Cylance team released an excellent, must read, E-book to train cybersecurity professionals : http://shorturl.at/ijzM6
Some of the key concepts used for antivirus detection are explained in this value added document
A more extensive view of ML applications to cybersecurity
Network protection
Regression to predict the network packet parameters and compare them with the normal ones
Classification to identify different classes of network attacks such as scanning and spoofing
Clustering for forensic analysis
Endpoint protection
Regression to predict the next system call for executable process and compare it with real ones
Classification to divide programs into such categories as malware, spyware and ransomware
Clustering for malware protection on secure email gateways (e.g., to separate legal file attachments from outliers
Application security
Regression to detect anomalies in HTTP requests (for example, XXE and SSRF attacks and auth bypass)
Classification to detect known types of attacks like injections (SQLi, XSS, RCE, etc.)
Clustering user activity to detect DDOS attacks and mass exploitation
User behaviour
Regression to detect anomalies in User actions (e.g., login in unusual time)
Classification to group different users for peer-group analysis
Clustering to separate groups of users and detect outliers
Process behaviour
Regression to predict the next user action and detect outliers such as credit card fraud
Classification to detect known types of fraud
Clustering to compare business processes and detect outliers
What Black Hat Hackers and Threat Actors could do with ML
There could be many applications for hackers and offensive security teams. Here below a few examples
Breaking captcha : http://shorturl.at/cdfI0. The captcha protection is broken using computer vision and image classification ML techniques
Keyboard strokes detection and password guessing : http://shorturl.at/ruwAX. By listening to the keystrokes, the model is able to detect the key with a good reliability, hence guess passwords and other credentials
Automated spear phishing : http://shorturl.at/gjnrE. This technique allows to target high profile accounts on Twitter
A Machine Learning Case Study : Network intrusion detection using CNN
In this long paragraph, we are going to follow a realistic application of ML, using Convolutional Neural Network (CNN)
Nowadays, there are an enormous number of attacks over the Internet that makes our information to be continuously at risk. Intrusion Detection Systems (IDS) are used as a second line of defense. They observe suspicious actions in the network to detect attacks
Intrusion detection in Networks requires to be able to identify typical signatures of intruders, and do not confuse them with regular anomalies or normal traffic
In an attempt to find known attacks or unusual behavior, IDS traditionally inspect the contents of every packet. The problem of packet inspection, however, is that it is hard, or even impossible, to perform it at the speed of multiple Gigabits per second. And, the quantity of data is usually so huge that it makes the analysis practically difficult
For high-speed lines, it is therefore important to investigate alternatives to packet inspection. One option is flow-based intrusion detection. With such approach, the communication patterns within the network are analyzed, instead of the contents of individual packets
This Case Study uses this approach, selecting only the features related to the network flow. Because of their ability to handle huge amount of data and identify patterns, deep learning techniques are a tooling of choice
So, we are going to go through some recent research to showcase the ability of deep learning for IDS. We will use the Research Study available here : https://github.com/CharlesMure/cassiope-NIDS
The overall principle will be as per the flow chart below. The target is to collect data from our Network, and compare it with the training set, to be able to predict if our Network is free from cyber threats. We will be able to categorize the attacks per usual types
Basically, the software argus will collect data from our Network flow
The ML model will perform the following tasks :
- Process the data from our network flow
- train using an existing dataset
- categorize our network flow
The training dataset will be downloaded from the University of New South Wales : The UNSW-NB15 data set description (adfa.edu.au)
This dataset was obtained recording Network data, detecting suspicious flows of data (attacks). These attacks are categorized as per the tables below, and can be categorized because each type of attack has a typical Network signature (number of connections to the same IP, packet size…)
Along the way, we will be using some well-known ML libraries :
SciKit-learn :https://scikit-learn.org/stable/
It is a tooling set for ML, in Python
Simple and efficient tools for predictive data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Tensorflow : https://github.com/tensorflow/tensorflow
TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications
Keras : https://keras.io/
Keras is an API. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear & actionable error messages
“Keras is to Deep Learning what Ubuntu is to Operating Systems”
So, let’s dive in our Case Study
Data collection from our Network flow
So, we are going to use argus : openargus – Using Argus
Argus processes packet data and generates summary network flow data. If you have packets, and want to know something about whats going on, argus is a great way of looking at aspects of the data that you can’t readily get from packet analyzers. How many hosts are talking, who is talking to whom, how often, is one address sending all the traffic, are they doing the bad thing? Argus is designed to generate network flow status information that can answer these and a lot more questions that you might have
Many sites use argus to generate audits from their live networks. argus can run in an end-system, auditing all the network traffic that the host generates and receives, and it can run as a stand-alone probe, running in promiscuous mode, auditing a packet stream that is being captured and transmitted to one of the systems network interfaces. This is how most universities and enterprises use argus, monitoring a port mirrored stream of packets to audit all the traffic between the enterprise and the Internet. The data is collected and then the data is stored in what is described as an argus archive, or a MySQL database
From there, the data is available for forensic analysis, or anything else one may want to do with the data, such as performance analysis, or operational network management
To install argus, we shall download both argus and argus-client openargus – Getting Argus
You get two archives :
argus-3.0.8.2.tar.gz
argus-clients-3.0.8.2.tar.gz
You just need to unzip these files with the Linux command : tar -xvf myfile
argus will need also additional packages
To install and run argus, follow the steps : openargus – Documentation
You can start collecting network data with argus ! Here’s an example of captured data
Installation of ML packages
Then, you can download the necessary files from https://github.com/CharlesMure/cassiope-NIDS
You will get some Python scripts
model_train.py
To train the model, you just need to launch the script model_train.py
python3 model_train.py
This script is the “learning” component, that will train using the dataset
It will train using the CNN model. Here below an example of the CNN model structure using an image recognition (interpret a written number “3” as the correct number)
The corresponding Python script is the following. It includes several layers, with convolutions, and the activation function “ReLu”
Once the learning phase is done, the results are recorded into a file together with the associated CNN “weights”, for re-use in the next steps of prediction
monitoring.py
This script is going to extract features from the logs collected by argus
These features are useful to check if the captured packets are suspicious or not
Some features that can be extracted by argus are the ones below. They will be necessary for our model
This allows to extract the necessary features for our ML model
At first, we open one argus log
The features are calculated and extracted “on the go”, in a python list : for example with the feature ct_dst_sport_ltm
An additional treatment is done to consolidate the data
DeepLInspect.py
Now, this script can “correlate” the extracted features from argus, with our training set output
At first, it will load all necessary files
Then, the script will perform all necessary “correlations” to predict if our Network flow is suspicious or not, if so, it will classify it in the appropriate categories which are coming from the training dataset
What are the next steps ?
From the prediction script, we could automatize the detection and classification and Network flow between normal and suspicious flow, making our IDS a powerfull tool
There are a number of limitations to bear in mind, however
At first, to make good predictions, you will need a good Dataset, that means fresh data (difficult to get and classify, as this is quite a huge task) and a good balance between categories (our dataset above, is from years ago, and we don’t have a good balance between categories). That means also, that such a dataset needs to be representative of the latest types of attacks and their typical signatures. In addition, the datas need to be comprehensive (our dataset misses the timestamp, for example)
Secondly, an IDS machine learning model would need to dive deeper into the analysis, by capturing some parts of the packets (at least the headers), to understand which content, which applications are used by the suspicious packets. This would require much more data “Big Data”to be acquired and huge repositories and hardware to proceed with the calculations
Third, an IDS that would be using a Hybrid approach, both signature based and Machine Learning based, would be quite powerful, as it would be able to classify the attacks with more accuracy, thus allowing to implement retaliation strategies against these attacks, to block them without doubt
Last but not least, to continue developing efficient IDS, there will be a strong need to deeply understand the attacks, the techniques used by them, in order to define the right features and categories. And, a strong understanding of the appropriate ML algorithms will be needed. Therefore a combined Expertise in Network Forensics and Machine Learning will be necessary
Recommendations for the future
With machine learning, cybersecurity systems can analyze patterns and learn from them to help prevent similar attacks and respond to changing behavior. It can help cybersecurity teams be more proactive in preventing threats and responding to active attacks in real time. It can reduce the amount of time spent on routine tasks and enable organizations to use their resources more strategically
As ML is based upon data, a must have for an efficient ML cybersecurity is to gather data in a structured way. The data must have complete, relevant and rich context collected from every potential source—whether that is at the endpoint, on the network or in the cloud
It all starts with taking the right approach to data. One of the biggest challenges is getting data from the endpoint, network and cloud and normalizing it into one state, so that it can be used effectively for machine learning
When it comes to cybersecurity, the potential for machine learning to have a dramatic and lasting impact is real. But only for companies that are forward-thinking enough to take care of their data first
Additional reading
A very good synthesis and potential applications of ML to cybersecurity : https://towardsdatascience.com/machine-learning-for-cybersecurity-101-7822b802790b
A huge ML repository dedicated to cybersecurity : https://github.com/jivoi/awesome-ml-for-cybersecurity
The challenges of data gathering for an efficient ML application to cybersecurity : https://www.securityroundtable.org/the-growing-role-of-machine-learning-in-cybersecurity/