2018 is, unfortunately, a year of huge income for hackers. Thus, WannaCry, the most broad-scale cyber attack in the history of the internet, literally subjugated over 150 thousand computers spread over more than 150 countries of the world. For that matter, the capabilities of numerous IT organizations were focused on the creation of universal and extremely advanced programming solutions for software protection. This article will tell you about a separate direction of such protecting software, namely – machine learning; you will also find out how to apply machine learning in cybersecurity.
Machine learning in cybersecurity: classification and predicting
Machine learning is one of the most complex approaches to software development to date. Let’s figure out what fundamental principles it relies on and, in particular, learn by what algorithms machine learning in cybersecurity functions once spam is detected among the emails.
Put this in perspective, an ordinary firewall merely applies the pre-installed filters (blacklist addresses, words or sentences, links, etc.). The ML-based software, in turn, compares the email letters in question with those checked and marked as spam by the firewall, in order to conduct more consistent filtration based on the completed analysis later on. This procedure is called classification. It is actually one of the basic notions of machine learning in cybersecurity automation. Another conceptual notion is predicting. Thus, the ML mechanisms are able to differentiate an ordinary behavior and state of data from the abnormal ones using the history data.
Implementation of Machine learning algorithms in cybersecurity – types and examples
Algorithms applied in the machine learning can either be supervised or unsupervised (which are the two essential types of ML).
Currently, the overwhelming majority of the ML-based cybersecurity tools are of a supervised type. They apply the strictly predetermined in-out parameters to detect malware and decide whether the system operates normally or abnormally based on the extracted data. Such software solutions are best applied in practice in detection of the commonly known web threats, the behavior of which has been thoroughly studied.
Vice versa, a software of the unsupervised type of the machine learning can be very useful in dealing with barely studied or unidentified malware. In such case, the “out” parameters are absent and a certain software tool makes decisions based only on the image of the normal system behavior composed in real-time.
There is also a third type of machine learning – reinforcement learning. During the process of this type of learning, there are no “in” parameters and the agent (i.e. the system in check) analyzes the state of environment gradually receiving feedback.
Supervised learning implementation example. Let’s examine the particular example of the reasonable choice of the supervised learning in the detection of the generated DNS. In fact, in this case, the “in” parameters of the inspected environment’s behavior can be completely useless. That is because the undirected generation of the domain names cannot clearly define the normal or abnormal behavior. On the other hand, such unusual DNS are often characterized by the typical position of vowels and consonants before the comma. In order to implement the effective ML-based analyzing tool, developers use recurrent neural networks (RNN).
First, the group of various “normal” DNS is composed (for instance, you can borrow them from the Alexa Top 1 Million Sites) alongside the group of completely undirected generated domains. The domains of the first group are assigned a “zero” value whereas the second group of domains is assigned a “one” value. Next, the RNN of two or more levels is created. The domain name (i.e. a certain sequence of symbols) is considered as an “in” parameter, while the desired comparison result (in our case, it must be equal to zero for the inspected environment to behave properly) is an “out” parameter.
Unsupervised learning implementation example. Now, let’s consider the implementation of the unsupervised machine learning in cybersecurity on the example of the Kohonen’s self-organizing maps (SOM) based on vectors and clusters. Such maps are very useful while processing large data streams when it is unknown how to characterize the behavior of the environment affected by the harmful software. The map analyzes its neurons for the purpose of the maximum similarity with the “in” data. The most suitable neurons are updated to be more similar to the “in” data. Their neighbors’ values are also optimized towards the “in” data.
Applied in the development of software solutions for IP traffic analysis, SOM checks the proximity of a certain neuron in respect of neurons in the instructional examples and, in the case of a positive result, they draw the rest of the neurons to it. In such situation, the farther the separate neurons are situated from the neurons alike, the more is the chance that there is an anomaly in the network traffic.
Reinforced learning implementation example. The third and the last example of the machine learning in cyber defence is a reinforced type of learning. It is usually applied in the robotic science, when a machine has to decide what to do next after a particular sequence of tasks was completed. Nevertheless, it is also sometimes applied in the cybersecurity. Particularly, being directed by the history data, the reinforced ML-based tool executes strictly predetermined instructions. Then, a system administrator (or engineer, technologist etc.) tells it what to do next. Such a tool becomes more precise and intelligent with each expert’s tip. After some time, it would require much less third-party involvement.
Machine learning in cybersecurity: general attributes of ML tool
So, how to use machine learning? Described below is a set of stages to complete while developing any kind of tool based on the machine learning for cybersecurity.
Data gathering. The vast majority of ML tools is based on the predictive data analysis. That means that the incoming network traffic is thoroughly analyzed and compared with the previously gathered and inspected data. There are exclusive situations though when the traffic is inspected in real time without the involvement of the previously received results (nevertheless, such tools are usually not used for fraud detection, they are actively employed on the fund markets).
Data sorting and aggregation. It is not enough to just have the analyzed information. Your agent can feature more similar data while being unable to understand what exactly can be of use in a certain situation (e.g. the PC screen’s blocking can be a feature of numerous kinds of web attacks). That is why some data is interconnected by some mutual attribute – to make a clearer vision of the inspected environment’s behavior. That is quite a complex and variable process which requires the developers to not only possess deep knowledge in the sphere of harmful software but also to dedicate much time to it.
Analysis algorithms creation. Once the data is analyzed and aggregated, the developers commence the actual composition of the machine code. The code, which would execute the network traffic blocking and protect user data when the web anomalies are detected.
Testing. The created software solution is tested in the conditions of the environment that was not pre-analyzed. In some cases, having defined the ineffectiveness of the employed ML type, the developers get back to the previous step to employ another type of the machine learning (for example, change the supervised type to an unsupervised one and so on).
Machine learning in cybersecurity: conclusion As we can see, the software implementation of the Machine Learning is an extensively complex procedure. In order to create a web attacks preventing software, you can try your hands in such platforms as Microsoft Azure, IBM, Amazon Machine Learning, etc. However, we recommend leaving the development of the professional and highly-effective solutions to experts. Do not hesitate to contact us to start your exciting project with a team of professionals in the field.