Unlocking insights from cyber security data science

By Rennie Naidoo, Professor of Information Systems, Wits School of Business Sciences.

Johannesburg, 26 Apr 2024

Rennie Naidoo, professor in Information Systems at the Wits School of Business Sciences.

Although data is proliferating at exponential rates in the digital age, knowledge remains elusive. While recent advances in data science and big data analytics hold a lot of promise for discovering innovative ways to combat cyber crime, cyber security units and their data science teams are inundated with vast amounts of data from a variety of sources to monitor and respond to cyber incidents and threats.

The presence of ‘noise’ in this information can significantly impair the quality of analysis and lead to misleading conclusions if not addressed properly. The challenges posed by noise in these datasets are compounded by the detection of spurious relationships in data-driven models that lack a solid scientific foundation.

Despite the rapid advancements in computing power, databases, statistical methods and algorithms capable of processing and analysing large volumes of both structured and unstructured data, the enthusiasm for declaring the ‘end of theory’ and predicting ‘the death of the scientific method’ by some commentators has been premature.

Synergising theory and big data

The concept of theory-guided data science suggests a synergistic approach that combines scientific knowledge with data science, bridging the gap between data scientists and domain experts.

This approach emphasises the importance of leveraging existing scientific concepts, models, measures and hypotheses to interpret data, rather than solely relying on potentially misleading correlations.

At the same time, it's acknowledged that theory-based models can sometimes make assumptions that oversimplify complex phenomena. Nevertheless, theory-guided data science utilises the unparalleled ability of data science models to autonomously learn patterns from large datasets without disregarding the valuable accumulation of scientific knowledge.

Cyber criminals leveraging social networking site scams concentrate their efforts on seeming familiar and similar to potential victims.

This highlights the need for more effective interdisciplinary collaborations between data scientists and domain experts, as well as the combination of theory-based and data science models to enhance the utility of insights derived from complex cyber crime intelligence datasets.

There is a significant opportunity to apply both theory-based models and data science models to analyse cyber crime intelligence datasets more effectively in order to detect, prevent and respond to cyber threats and crimes more efficiently.

Given the flood of cyber crime data and the high potential for noise in these large, diverse and complex datasets, relying solely on traditional data science and big data analytics may not suffice to provide reliable insights for decision-making.

Scam data, for instance, involves social engineering techniques where individuals are manipulated into divulging sensitive information, and it includes a considerable amount of noise that does not contribute to the success of scams but constitutes a significant portion of the overall dataset.

Harnessing theory-guided insights

As a component of theory-guided data science, theory-guided feature selection involves using theoretical advancements from the problem domain to select relevant features for models. By tapping into existing theories, concepts, principles and expert insights, behavioural scientists and data scientists can collaborate to apply theory-guided feature selection to minimise the noise in complex cyber security datasets.

For example, behavioural scientists can utilise theories and principles to identify relevant features in scam content, but pinpointing the pertinent features from vast and diverse cyber crime datasets to develop more effective countermeasures remains a substantial challenge.

In a recent study, we applied social influence concepts as guiding features for data analysis to provide insights into the psychological tactics used by cyber criminals within a noisy cyber security dataset. Our pilot study on intelligence datafeeds provided by a global fraud and cyber crime tracking firm shows that combining features developed by behavioural scientists with data science techniques can yield high-quality and actionable insights from noisy scam datasets.

We find that features based on the social influence model can enhance the value and interpretability of our cyber crime intelligence dataset. Cialdini’s social influence principles, rooted in the psychology of compliance, predict that compliance professionals will use tactics such as authority, consistency, liking, scarcity, reciprocity and social proof to trick their targets.

For instance, in analysing our dataset, we observed a notable clustering of scarcity and authority. This pattern implies that cyber criminals might be inadvertently or deliberately applying these complex psychological principles, exploiting victims’ perceptions of rare opportunities (scarcity) and manipulating their trust in certain institutions and personalities (authority).

Moreover, we found that cyber criminals leveraging social networking site scams concentrate their efforts on seeming familiar and similar to potential victims to appear as friends, exploiting the principle of liking. These insights are invaluable for strengthening organisations' security posture.

For instance, our theory-guided feature selection can significantly contribute to offender behaviour analytics by enabling analysts to anticipate the types of social engineering activities cyber criminals are likely to employ based on emerging threat patterns, thereby improving cyber threat responses.

Insights using a theory-guided analysis

By focusing on the most significant features at any given time, cyber security experts can devise more effective methods to spot unusual activities and initiate prompt incident responses. Additionally, the insights gained from feature selection can inform the design of cyber security awareness training programmes, making them more relevant and directly applicable to the latest challenges faced by individuals and organisations.

Our findings, for example, suggest that for the scams mentioned above, awareness initiatives could focus on the interplay between scarcity and authority, educating users on common scenarios where these tactics are used, thereby empowering them to recognise and resist manipulative social engineering schemes.

In offender behaviour analytics, feature selection methods can also help identify the social engineering modus operandi that is relevant to cyber crime victimisation for different scam types. We also propose a robust collaboration framework outlining a systematic approach to theory-guided feature selection in cyber crime dataset analysis.

This framework fosters collaboration between domain experts and data scientists, ensuring a comprehensive examination of cyber crime data. Moreover, we propose a phased approach to feature-selected data analysis, including dataset preparation, exploratory analysis to uncover key patterns, and the application of automated selection techniques for model training and evaluation.

The enduring power of theory in a data-obsessed world

As evidenced in other fields, such as mineralogy and materials science, we believe theory-guided feature selection will enhance the accuracy and efficacy of machine learning models for cyber security data science.

Models trained on theory-guided features are expected to generally surpass traditional data science models when dealing with complex cyber intelligence datasets. Finally, our findings highlight the importance of adopting multifaceted countermeasures in cyber security that integrate technical, psychological and behavioural defences.

Our approach also underscores the importance of fostering collaboration between behavioural scientists and data scientists in cyber security data science teams to enhance data-driven decision-making processes.

We hope more data science teams will explore theory-guided data science and theory-guided feature selection to fruitfully harvest the ever-expanding, increasingly complex and noisy cyber crime and cyber security datasets.

* Based on a conference paper with graduate student Shiven Naidoo presented at the 19th International Conference on Cyber Warfare and Security ICCWS 2024 hosted at the University of Johannesburg.

Unlocking insights from cyber security data science

How behavioural science theories remain the key to shaping our cyber security management decisions.

Synergising theory and big data

Harnessing theory-guided insights

Insights using a theory-guided analysis

The enduring power of theory in a data-obsessed world