Introduction
The idea behind Machine Learning is dynamic automation, the automation that improves itself over time with data analysis, and therefore, depending upon the variety of data and the use case we have, the field of Machine Learning extends limitless. In this post, find an overview of the status of Machine Learning in the field of network security and how it helps create some of the efficient Intrusion Detection Systems (IDS) such as PAYL, and at the same time, how attackers can leverage ML to break through these systems, and how can we improve and make these systems more secure?
Machine Learning in Network Security
History
The concept of using Machine Learning for Security is not a new idea. You can find research papers in this regard that even go back more than 20 years. But in earlier days, things were not as simple for Machine learning to flourish as they are in the present times. Following were some known difficulties:
-
Unavailability of Data: Machine Learning doesn’t withstand without data, and because of the limited internet access it was difficult to collect data to train your models. As a result, even if a Ph.D. student could have a great idea in the field, it was not easy to test and verify that idea. So the scope of experimentation was very limited.
-
Feature building needed expert domain knowledge: While analyzing data in Machine Learning, we use some set of features that can give us knowledge about the entity the ML model is getting trained for. Like for image processing, features could be raw pixels, histograms, etc. Finding out these features for network traffic needed experts in the field.
-
And security experts were skeptical about using ML for security: First, because the models generated at that time were not very efficient. And second, as this approach already needed expert domain knowledge to build features, so these experts could always argue that it is much more feasible to just hard code some expert rules in the system instead of depending on an ML model.
Current Situation
Nowadays, the condition is far better than before. Because of the following:
-
People are encouraging usage of ML: Because of the technological advancements in the field, people now have faith in using Machine Learning even in sensitive fields like Security Analytics.
-
Data is easily accessible: Organisations now support data logging, and it is easier to get the labeled data to train ML models for experimentation purposes. A variety of malware samples can also be found publically. Which boosts the growth in the field.
-
Wide scope of experimentation and iteration: Because of hardware advancements, even students can manage to assemble high-end systems for learning and experimentation. As a result, we see more and more research in the area.
-
Privacy concerns and sensitive data: This is still an issue, because of the privacy concerns some standard datasets are still rare. For example, datasets containing some sensitive malware samples.
Future Scope
Unfortunately, as much as ML can be used for enforcing security, it can also be used for breaching security systems. Adversarial Machine Learning details how attackers are leveraging this for malicious activities. Therefore, the future scope of ML for security includes making it more and more secure. Let’s understand this with help of an example of an IDS that uses ML for enforcing network security, which is called PAYL.
PAYL - Payload Based Anomaly Detector
What is PAYL?
PAYL is a payload based anomaly intrusion detection system, while some other IDS do filtering based on just the header data present in the packets or based on some known signatures of malicious content. On the other hand, PAYL is powered by Machine Learning and is based on the byte frequency of characters present in the payload of the packets and not just the headers or signatures. It uses ML to train a model with normal network traffic data, such that once the model is trained and a normal traffic profile is built, it can filter out the malicious packets that don’t match that normal profile.
A detailed explanation of PAYL you can find in this research paper of Columbia University here, where the authors have mentioned that “In once case nearly 100% accuracy is achieved with 0.1% false-positive rate for port 80 traffic.”
How attackers can break it?
First, to avoid signature-based IDS, attackers already knew how to alter the content of their malicious traffic, which is also called polymorphism. But as PAYL was not based on signatures, using just polymorphism wasn’t sufficient to breach this IDS. So the attackers stepped up their game and started to leverage the same ML against it. We called it Polymorphic Blending Attacks.
Polymorphic blending attack: In this attack, an attacker can alter the network packets such that it matches the normal profile and ML-based IDS such as PAYL can’t distinguish it from the normal packets. An attacker performs the following 3 steps in this attack.
- Learning The IDS Normal Profile: Finding out the network profile that IDS is allowing to pass through.
- Attack Body Encryption: Encrypt the malicious content such that the resultant content matches the normal network profile to evade the IDS.
- Polymorphic Decryptor: Appends the payload decrypter in the packets that can decrypt the encrypted malicious content once it reaches the target.
Scope of improvement
First, since PAYL is based on the byte frequency of characters in payload, it is unable to understand the meaning of these packets. And that’s why we see that attackers can blend their malicious packets to make them look normal. This gives rise to the need to use ML to build and train such models that either process the data statistically more complex or process it based on the semantics of that data to distinguish the intention behind transmitted network packets.
One example of such an enhanced IDS concept could be, PCNAD. While PAYL works on the entire payload it becomes difficult to use it in high speed and high bandwidth networks, such as on port 80, where payload size can be huge. PCNAD uses something called “content based payload partitioning” (CPP) and hence it can process the payload in partitions, making it suitable to use for larger payload scenarios.
Second, we can also add more randomness in the features of model. For example we can use multiple PAYL IDS and randomly set the variation range of their allowed byte frequency.
A lot of research work has been done to enhance PAYL, for reference see Payload Content-based Network Anomaly Detection and An Improvement of Payload-Based Intrusion Detection Using Fuzzy Support Vector Machine.
Conclusion
In this post, we talked about the increasing role of Machine Learning in the field of Network Security. We found out how the scenario is changing for the field with time. We looked at an example of an IDS that uses ML to enforce security and also understood how an attacker can still evade these security systems using the same technology and finally, explored the possibilities to further enhance these security systems.
Further Reading
- Adversarial Machine Learning
- Anomaly Based Instrusion Detection System
- PAYL - Anomalous Payload-based Network Intrusion Detection
- Analysis of Payload Based Application level Network Anomaly Detection
- Polymorphic Blending Attacks
- Payload Content based Network Anomaly Detection
- An Improvement of Payload-Based Intrusion Detection Using Fuzzy Support Vector Machine