Machine Muscle Learning) Project
Goal and Background
The goal of this project is making proactive cyber defenses in real-time using various and massive log datasets from network infrastructure and social information. In this research, we analyze cyber security big data in order to prevent facilities from cyber threats. This research is characterized by employing both reactive and proactive approach to cyber security. The former can be defined as an analysis of cyber threats with online learning algorithms, and the latter is the prediction of future attacks. The key feature of our reactive approach is to extract knowledge from cyber security big data with a convolutional neural network. In our prior work, we have construct cybersecurity big data, which is comprised of sampled network traffic, DNS queries, a content of malware and malicious websites, and so much on. We have also attempted to extract knowledge from the big data with some heuristics. For example, our collected data was compared with discrimination threshold which was given by our empirical network operations. Herein, we will adopt deep learning technologies to extract knowledge automatically. In addressing to mitigate cyber threat and risks, a proactive approach is also necessary as well as reactive approach, due to that cyber security needs to be handled in a very short time period. If we could predict future cyber threats targeted to us and/or our organization, we would earn time for incident handling; it will helpful for providing better cyber security against the threats. The key feature of our proactive approach is to analyze social data with natural language processing and machine learning algorithms. The motivation of the attacks explicitly exist and is along with social trends. Text information extracted by SNS such as Twitter, Facebook, blogs, and news articles can be regarded as the context of the message was reflected by this motivation.
This Project is a joint research of The University of Tokyo, Nara Institute of Science and Technology (NAIST), Tokyo Institute of Technology (TITECH), and IIJ Innovation Institute.
Significance of this Research
Cybersecurity, which aims at protecting cyber spaces from cyber threats, becomes one of the most important research agenda for the Internet. Enterprise organizations must be aware of devices that connect to an enterprise network and should collect information for detecting network anomaly. However, due to the rapid increase the numbers of connected devices, organizations are required to analyze the large amount and diverse kinds of data in real time; the analysis results should be outputted with small latency. Our work aims at solving this issue by the reactive and proactive approach. The former is needs to be automated, and artificial intelligence has begun to garner attention. However, due to the limited computational resources, it is really hard to analyze cyber threat big data in short time period. Our research aims at solving this issue by employing JHPCN’s analysis nodes. The latter, proactive approach is necessary to each time for preparing against cyber attacks. Our social analysis will reveal the attacker’s motivation and future generated threats.
In this collaborative research, The University of Tokyo (U-TOKYO) team has experience of constructing cyber security big data. The Nara Institute of Science and Technology (NAIST) team is the expert of extracting trends from SNS (Twitter) data. In this collaborative research, all members utilized partner’s experience, analyze cyber security big data, and develop both reactive and proactive algorithms. As a prototype of the analysis, Figure 1 shows the procedure of this analysis. In this figure, the data collection is performed at NAIST and the data is transferred and stored in U-TOKYO storage. In U-TOKYO we try to analyze the relation with the SNS data and traffic data, then try to make predictions of attacks.
A method for data mining is a key issue for extracting knowledge from datasets. We try to run the convolutional neural network (CNN) to solve this issue as shown figure below. For example, we would like to analyze with deep learning for network traffic and time series prediction. A packet itself is a variable length, however, each packet can be interpreted as 5 tuples information, e.g., source address, destination address, source port number, destination port number, and time. It means, we can generate a high-dimensional matrix. We will recognize outliers, tendency, and predominant patterns in order to predict potential cyber attacks from cyber threat big data.
As a first our achievement, we publish a simple and fast full-text search engine for massive system datasets. The name is “Hayabusa”. You can get and try the software from this site. In this software, we provide the base infrastructure for analysis of massive datasets.