Insider Threat Detection with Data Analytics

Organizations face a serious threat of data breaches caused by company insiders [1,2]. Traditional data loss prevention (DLP) tools and role-based access control (RBAC) can address some of the data leakage risks by analyzing the content of data and protecting against unauthorized access [1,2]. Recent studies have investigated the use of machine learning and big data analytics to predict or detect insider threats using techniques such as anomaly detection and risk analysis [2,3]. This study begins with a discussion of insider threats and current detection methods. The study will then examine recent research into the use of machine learning and data analytics for detecting anomalous user behavior indicative of an insider threat. Finally, the study will explore the benefits of and challenges with using user behavior analytics (UBA) for insider threat detection.

Categorization of Insider Threats

Insider threats to company data and assets can be categorized as intentional or inadvertent [1]. Insiders are users who have or had authorization to legitimately access an organization’s systems or information assets [2]. Intentional insiders act with malice; whereas, inadvertent insiders cause accidental harm. Accidental insider threats include the accidental publishing of data, errors in system configurations, failure to encrypt data, lack of user awareness, loss of a computer, and assignment of excessive privileges [1,3].

Intentional insiders are malicious users who deliberately extract or exfiltrate data, tamper with the organization’s resources, destroy or delete critical data, eavesdrop with ill intent, or impersonate other users [2,4]. The actions of malicious insiders fall into three broad categories: sabotage, theft, and fraud [5]. Malicious insiders can cause significant damage to an organization because insiders have authorized access to sensitive information, including intellectual property, customer data, marketing plans, and financial records [5,6]. The motivations for intentional insider threats include financial reward, grievances against the employer, and espionage [1,4].

Malicious insiders present significant challenges for traditional detection systems. Perimeter defenses, such as network intrusion detection systems and firewalls, cannot detect the insider since the insider has legitimate access into the network [4]. Perimeter defenses guard against unauthorized access originating external to the network. The malicious insider, however, is already within the security perimeter. Traditional log monitoring can also overlook insider attacks. The actions of a malicious insider often look like normal operations since the insider possesses the rights and privileges necessary to mount the attack [7]. Finally, the diversity of the motivations and methods used by insider threats increases the complexity of threat detection [4]. The actions of a malicious insider motivated by money differ greatly from the actions of a disgruntled insider attempting sabotage.

Current State

Current methods of preventing insiders from leaking sensitive data include basic security practices, web content filtering, and the use of DLP technologies [1,3]. Basic security methods that assist in preventing insider threats include user awareness training, enforcing access rights, and encrypting data at rest and in transit. Organizations can employ web content filtering to assist in preventing users from uploading company data [3]. Data loss prevention technologies can provide a security layer to assist in the detection and prevention of data exfiltration, misuse, or destruction [3]. These traditional security tools and methods rely heavily on signature-based detection methods [8]. Signature-based detection matches activity against defined patterns of malicious behavior [9]. For signature-based detection to work, the attack pattern must be known and defined ahead of time. Signature-based detection cannot detect new or undefined attacks [1,9].

Data Loss Prevention

Most current DLP systems use a content-based analysis of monitored data to identify, monitor, and protect sensitive information [1]. Content-based DLP methods protect against unwanted data exposure by scanning data content at rest, in use, and in transit. Content-based DLP relies on data fingerprinting and lexical and statistical analysis of the monitored data [1]. Lexical analysis typically relies on rules based on a sequence of characters that define a search pattern known as regular expressions. Collection intersection can be used for statistical analysis by comparing collections of data shingles and computing a similarity score. However, protecting irregular or dynamic data, such as proprietary design documents or marketing plans, with static policies and rules is not possible unless the organization has a thorough and enforced data classification scheme [3].

Shortcomings of Content-Based Analysis

Content-based DLP approaches can be an effective solution to protect against the inadvertent insider threat but have significant shortcomings in the detection of intentional insider threats [1]. Content-based analytical solutions can effectively prevent accidental data leaks of well-defined plain-text data. Organizations can configure static rules to detect data such as credit card numbers and other well-formatted data. However, content-based DLP technologies rely heavily on human interaction to develop detection rules and configure the tools [3]. Defining the search patterns for static rules can be labor-intensive, tedious, and error-prone. Also, static rules are not effective in identifying dynamic or irregular data [3].

Static rules may prove effective against some inadvertent data leaks, but a malicious insider can easily bypass these rules [1,3]. For example, an insider might bypass a credit card number detection rule by inserting additional digits, transposing numbers, or replacing digits. Additionally, most current DLP solutions cannot identify encrypted or obfuscated information leaks effectively [1]. Therefore, a malicious insider can use encryption to circumvent DLP detection.

Anomaly Detection and Machine Learning

Anomaly detection seeks to detect threats by first defining normal behavior, and then detecting deviations from normal behavior [8]. Unlike signature-based detection, with anomaly-based detection, the behavior pattern of malicious activity need not be defined in advance. Recent research focuses on profiling user behavior to identify potential insider threats [1]. Context-based UBA approaches leverage data mining and machine learning techniques. The goal of machine-learning in UBA is to understand what the users are doing [3]. The data mining and machine learning are used to identify normal behavior for a user or users, such as what data the user accesses, what programs the user runs, at what time the user accesses the system, how long the user remains on the system, and what systems the user accesses [3].

Machine-learning approaches can perform data mining to discover outliers without the requirement for the precise description of anomalous activities [1]. Machine learning typically consists of a training phase and a testing phase [8]. Features, or attributes, are defined and extracted from training data. The training data is used to develop a model, which is then applied to the full data set. The predicted values are then measured against the actual values to determine the accuracy of the model. A basic process for applying machine learning to cyber security includes defining the information sources, capturing the data, preprocessing of the data to support analysis, feature extraction, scoring and analyzing the features, and applying the scoring to make a reasoned decision [8].

Applying Anomaly Detection to Insider Threats

Anomaly detection solutions seek to identify the intentional insider by looking for anomalous activity. A UBA system can detect both time-based and peer-based anomalies [1,5]. Time-based anomaly detection looks for anomalous activity by comparing a user’s activity over time [1]. A UBA system can compare the current behavior for a user to that user’s normal behavior. Peer-based anomaly detection detects differences in activities between a user and a set of users. User behavior analysis can compare a user’s activity to the activity of the user’s peers, such as teammates, users with the same role, or users performing similar functions [5].

The typical approach to anomaly detection involves creating a baseline model of normal user activity and then detecting deviations from that baseline [6]. User and role-based profiles can be created by extracting features that define activity and device usage patterns [10]. Researchers have applied several machine learning methods to the problem of detecting insider threats. The methods used include the Hidden Markov Model (HMM), isolation forest, decision tree, self-organizing maps (SOM), and distance vector methods [5,6,8,11].

Principal Component Analysis

Legg et al. used principal component analysis (PCA) decompositions to detect anomalous user activity [5]. Principal component analysis extracts the factor accounting for the largest variability among the measured variables [12]. The researchers used PCA to detect both time-based and peer-based anomalies. Legg et al. compared the user’s activity to the past activity of the user and the consolidation of activity by users in the same role [5]. The profiles contained details including the devices and attributes accessed and the activities performed. The researchers developed daily profiles of each user and combined role activity using tree structures. The researchers extracted 168 features within the following categories: new observations, hourly and daily usage counts, and time-based features for each device. The system then performed PCA decompositions on the daily profile feature sets to determine the degree of variance. The system computed anomaly metrics and generated alerts if a metric was above a specified threshold. The researchers ran 10 scenarios and the best performance achieved was 43% precision with 100% recall.

Hidden Markov Method and Distance Measures

The Hidden Markov Method (HMM) is one of the most frequently applied methods by researchers to the problem of insider threat detection [4,6,8]. Lo et al. applied the HMM and distance measures to detect differences in a user’s activity compared to the user’s past behavior [8]. The researchers found the HMM capable of detecting insiders within the Carnegie Melon University (CMU) Computer Emergency Response Team (CERT) dataset. However, other research has demonstrated that the application of an HMM becomes time-consuming and resource-intensive as the number of data variables increases [4,6].

Lo et al. applied distance measures to compare the results with the HMM approach [8]. Distance measure algorithms measure the similarity between two or more sets of data. The researchers used the Damerau-Levenshtein (DL), Jaccard, and cosine distance algorithms in the comparison. The researchers found that the HMM technique produced the highest detection rate of 69 percent. The distance measures had significantly lower detection rates, with cosine detecting 47 percent, DL detecting 39 percent, and Jaccard detecting 36 percent. However, the researchers noted that each of the distance measures detected some unique insiders not detected by the others. Though the HMM achieved the best detection rate, it was significantly more resource-intensive than the distance measures. The HMM took more than 24 hours to process the dataset; whereas, the distance measures each completed in a few minutes.

Isolation Forests

Isolation forest algorithms detect anomalies using multiple binary tree structures to discover instances that are few and different [13]. Gavai et al. used a modified version of an isolation forest algorithm to discover insider threats based on enterprise social and online activity [11]. The researchers extracted features from social data that could indicate insider threat behavior. The features included patterns and content of email communications, web browsing patterns, email frequency, file access patterns, and system access patterns. The unsupervised approach used anomaly detection methods to identify statistically abnormal behavior with respect to the selected features. The machine learning looked for activity that differed based on the user’s past activity or that differed from peer behaviors. The modified isolation forest algorithm produced a receiver operating characteristic (ROC) curve score of 0.77. Gavai et al. consider the results of their work promising and state that their approach worked well in detecting insider threat activity. However, the researchers acknowledged the unsupervised approach could generate many false positives.

Addressing False Positives

The detection of anomalous user activity can result in a high rate of false positives [3,11,14]. Most anomalies are not malicious; therefore, anomaly detection will create many false alarms [7]. If analysts are currently sifting through logs and manually assessing insider threats, then machine-learning anomaly detection can act as a filter and reduce the manual work effort [5]. However, alarms generated from benign anomalous user behavior decreases the effectiveness of many UBA solutions and can increase the burden on security analysts to analyze the alarms [14]. The following sections describe proposed methods to help address the issue of false positives in anomaly detection for insider threats.

Alarm Filtering with Semi-supervised Learning Algorithms

Semi-supervised learning algorithms can be applied to filter the false alarms generated from anomaly detection [7,14]. Yang et al. applied time-based anomaly frequency degree analysis to a baseline anomaly detection [14]. The baseline anomaly detection used feature extractions and classifications to identify anomalous user behavior. The researchers then applied a scenario-driven, time-based alarm filter to distinguish between benign and malicious anomalies. The researchers applied the solution to four scenarios using the CMU CERT dataset.

The true positive rates before applying the alarm filters ranged from 0.90 to 0.95, indicating the baseline anomaly feature extraction and classification performed well [14]. The false positive rates ranged from 0.21 to 0.26. After applying the alarm filters, the false positive rates for each scenario dropped, with three scenarios having a false positive rate of 0.06 or less. The remaining scenario saw a less dramatic decrease in the false positive rate to 0.18. However, there is an expected decrease in true positives when decreasing false positives [9]. Three of the scenarios saw the true positive rate decrease by 0.03 or less, indicating an acceptable tradeoff between true positives and false positives [14]. However, the remaining scenario saw the true positive rate decrease from 0.90 to 0.75.

Incorporating Visual Analytics

Incorporating visual analytics with machine-learning approaches can alleviate data analytics challenges [15]. Visual analytics use interactive visual interfaces to support human reasoning [16]. Combining visual analytics and machine-learning allows for iterative improvements in machine-based decisions. Analysts can impart their domain expertise as feedback into the system, improving future decisions based on the expert analyst’s input [15]. With visual analytics, the system can reduce false positives by more accurately capturing and applying the human rationale for the decision.

Benefits of User Behavior Analytics

A machine-learning based UBA solution can perform the big data analysis and create the rules based on this analysis using statistical models and probability metrics [3]. The use of predictive analytics within these systems can also make security more proactive than it is currently. Predicting threats from the data can provide continuous, real-time evaluation and detection[11].

Machine-learning can significantly reduce the amount of human interaction needed to configure and maintain monitoring solutions but cannot eliminate the need for human interaction [3,6,17]. The machine-learning uses statistical models to detect anomalous or potentially malicious behavior. The machine-learning requires human interaction to determine whether the UBA is correct to avoid creating many false positives [3]. Another critical benefit of using machine-learning in a UBA platform for insider threat detection is objectivity [3]. The data analytics behind a UBA platform will not pick favorites or decide not to report a security policy violation. A UBA platform will provide the same result given the same data set.

Challenges with User Behavior Analytics

The objectivity of a machine-learning based UBA that provides consistent results can also pose a challenge [3]. UBA systems and machine learning cannot discern user intention. A user who makes an error or was curious might appear to the UBA  and machine learning algorithms the same as a malicious insider. Human analysts are still required to discern whether the alerts from the UBA signify user error, misconfigurations, or malicious activity. Two fundamental considerations of an insider threat detection program are who is responsible for decision-making and whether the system is proactive or reactive [15]. Organizations must determine when and if a human must be involved in the decision-making process. The degree to which the system is proactive can have significant ethical and legal considerations [15]. If the UBA system is proactive, then predictive analytics might be used to predict the probability that a user will conduct an attack. A company taking disciplinary action against an employee to thwart a possible attack before the user conducts the attack could raise ethical and legal concerns.

The lack of training data poses a significant challenge to machine-learning user behavior anomaly detection [1,5]. Very few studies use original real-world data for their analysis [7]. The lack of available real-world data impedes the building and testing of detection models. Legal, business, and litigation issues make organizations reluctant to share incident data related to an insider attack with the research community [2,7]. Without real-world data, researchers are often forced to use synthesized data or subsets of data points when researching insider threat detection [5]. Many past studies used simulated datasets, such as the CMU CERT dataset.


Current DLP solutions, basic security methods, and user awareness training can provide an effective solution to protect against some forms of inadvertent insider threat. Combining DLP with well-defined and enforced RBAC can limit the threat of data breaches. However, these solutions have several shortcomings. Most current DLP solutions are signature-based, requiring predefined search rules. Also, conventional methods are insufficient to protect against the intentional insider, who can circumvent perimeter defenses and content-based DLP solutions. Fortunately, recent research into context-based anomaly detection holds promise.

Companies can use UBA to detect anomalies indicative of insider threats. Machine-learning and big data analytics can enhance the ability of UBA solutions to detect anomalous user behavior. User behavior analytics can use machine-learning to detect anomalous behavior over time or in comparison to a user’s peers. Researchers have applied machine learning methods, including PCA, HMM, isolation forests, distance vector, SOM, and decision trees, to user anomaly detection. Research has shown that UBA and anomaly detection can identify insider threats within simulated or synthesized datasets.

However, building UBA solutions that are effective at detecting the intentional insider remains a challenge. Current anomaly detection methods often result in high false positive rates. Semi-supervised machine learning and data visualizations can assist in filtering false positives generated by baseline UBA detection. However, organizations must understand the tradeoff between true positives and false positives.

Applying UBA to detect insider threats can provide several benefits. Machine-learning can analyze the data and develop rules based on the data instead of relying on predefined, static rules. Developing static rules in traditional DLP solutions is labor-intensive. Machine-learning can significantly reduce the effort to configure and maintain threat detection systems. Also, a UBA solution brings objectivity and consistency to threat detection and prevention decisions. Whereas a human analyst might be pressured into playing favorites, a UBA algorithm will produce consistent results.

Significant challenges remain with UBA and insider threat detection. The lack of real-world training data is a limiting factor in insider threat research. Most past research has relied on simulated or synthesized data. Another challenge facing UBA relates to predictive analysis. A company taking proactive disciplinary action based on predictive insider threat detection could raise significant ethical and legal concerns. Due to the possible severity of disciplinary action against an insider, an organization may decide that a human must be involved in the decision-making. Organizations considering predictive detection should ensure their legal and human resources teams fully understand and agree to such use.

About the author: Donnie Wendt is an information security professional focused on designing and engineering security controls and monitoring solutions. Also, Donnie is an adjunct professor of cybersecurity at Utica College. Donnie is currently pursuing a Doctorate of Science in Computer Science with a research focus on security automation and orchestration.


[1] Cheng, L., Liu, F., & Yao, D. (2017). Enterprise data breach: Causes, challenges, prevention, and future directions. WIREs Data Mining and Knowledge Discovery, 7, 1-14. doi: 10.1002/widm.1211

[2] Sanzgiri, A., & Dasgupta, D. (2016). Classification of insider threat detection techniques. Cyber & Information Security Research Conference. Oak Ridge, TN: ACM. doi:10.1145/2897795.2897799

[3] Graves, J. (2017). How machine learning is catching up with the insider threat. Cyber Security: A Peer-Reviewed Journal, 1(2), 127-133. Retrieved from

[4] Yuan, F., Cao, Y., Shang, Y., & Liu, Y. (2018). Insider threat detection with deep neural network. International Conference on Computational Science (pp. 43-54). Wuxi, China: Springer. doi:10.1007/978-3-319-93698-7_4

[5] Legg, P. A., Buckley, O., Goldsmith, M., & Creese, S. (2017). Automated insider threat detection system using user and role-based profile assessment. IEEE Systems Journal, 11(2), 503-512. doi:10.1109/JSYST.2015.2438442

[6] Le, D. C., & Zincir-Heywood, A. N. (2018). Evaluating insider threat detection workflow using supervised and unsupervised learning. 2018 IEEE Symposium on Security and Privacy Workshops (pp. 270-275). San Francisco, CA: IEEE. doi:10.1109/SPW.2018.00043

[7] Gheyas, I. A., & Abdallah, A. E. (2016). Detection and prediction of insider threats to cyber security: A systematic literature review and meta-analysis. Big Data Analytics, 1(6), 1-29. doi:10.1186/s41044-016-0006-0

[8] Lo, O., Buchanan, W. J., Griffiths, P., & Macfarlane, R. (2018). Distance measurement methods for improved insider threat detection. Security and Communication Networks, 1-18. doi:10.1155/2018/5906368

[9] Gibson, D., Stewart, J. M., & Chapple, M. (2018). (ISC)2 Certified Information Systems Security Professional: Official Study Guide. Indianapolis, IN: John Wiley & Sons.

[10] Agrafiotis, I. N., Buckley, O., Legg, P., Creese, S., & Goldsmith, M. (2015). Identifying attack patterns for insider threat detection. Computer Fraud & Security, 9-17. doi:10.1016/S1361-3723(15)30066-X

[11] Gavai, G., Sricharan, K., Gunning, D., Hanley, J., Singhal, M., & Rolleston, R. (2015). Supervised and unsupervised methods to detect insider threat from enterprise social and online activity data. Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, 47-63. doi:10.22667/JOWUA.2015.12.31.047

[12] Green, S. B., & Salkind, N. J. (2017). Using SPSS for Windows and Macintosh. NY: Pearson.

[13] Liu, F. T., Ting, K. M., & Zhou, Z.-h. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data, 6(1). doi:10.1145/2133360.2133363

[14] Yang, G., Cai, L., Yu, A., & Meng, D. (2018). A general and expandable insider threat detection system using baseline anomaly detection and scenario-driven alarm filters. 7th IEEE International Conference On Trust, Security And Privacy In Computing And Communications (pp. 763-773). New York, NY: IEEE. doi:10.1109/TrustCom/BigDataSE.2018.00110

[15] Legg, P. A. (2017). Human-machine decision support systems for insider threat detection. In I. P. Carrascosa, H. K. Kalutarage, & Y. Huang (Eds.), Data Analytics and Decision Support for Cybersecurity (pp. 33-54). Cham, Switzerland: Springer International Publishing.

[16] Connelly, T. M., & Begg, C. E. (2015). Database systems: A practical approach to design, implementation, and management. Boston, MA: Pearson.

[17] Mayhew, M. J., Atigetchi, M., & Greenstadt, R. (2015). Use of machine learning in big data analytics for insider threat detection. (pp. 1-9). Air Force Research Laboratory. doi: 10.1109/MILCOM.2015.7357562