Predicting Cyber Crimes using Confusion Matrix in Classification

4 min readJun 6, 2021

So first of all lets understand,

What is Cyber Crime ?

Cybercrime is criminal activity that either targets or uses a computer, a computer network or a networked device. Most, but not all, cybercrime is committed by cybercriminals or hackers who want to make money. Cybercrime is carried out by individuals or organizations. Some cybercriminals are organized, use advanced techniques and are highly technically skilled. Others are novice hackers.

Rarely, cybercrime aims to damage computers for reasons other than profit. These could be political or personal.

Examples of the different types of cybercrime :

Email and internet fraud.
Identity fraud.
Theft of financial or card payment data.
Theft and sale of corporate data.
Cyberextortion (demanding money to prevent a threatened attack).
Ransomware attacks (a type of cyberextortion).
Cryptojacking (where hackers mine cryptocurrency using resources they do not own).
Cyberespionage (where hackers access government or company data).

Most cybercrime falls under two main categories :

Criminal activity that targets
Criminal activity that uses computers to commit other crimes.

What is a Confusion Matrix?

A Confusion matrix is the comparison summary of the predicted results and the actual results in any classification problem use case. The comparison summary is extremely necessary to determine the performance of the model after it is trained with some training data.

For a binary classification use case, a Confusion Matrix is a 2×2 matrix which is as shown below.

Lets, Understand Terms from above table one by one,

TN(True Negative): Machine predicted cyber-attack happened and this is right attack actually happened.

2. TP(True Positive): Machine predicted cyber-attack hasn’t happened and this is right actually attack hasn’t happened.

3. FP(False Positive): Machine predicted attack hasn’t happened and this is the wrong result actually cyber-attack has happened. , FP also called a Type 1error.

4. FN(False Negative): Machine predicted attack happened and this is the wrong result actually cyber-attack hasn’t happened. FP also called a Type 2 error.

Confusion Matrix gives two types of errors :-

By understanding above example we observe False Positive (FP) and False Negative (FN) are the errors of matrix and this is also called type1 error and type2 error respectively .

From our confusion matrix, we can calculate five different metrics measuring the validity of our model.

Accuracy = (TP + TN) /( TP + TN + FP + FN)
Misclassification = (FP + FN )/( TP + TN + FP + FN)
Precision = TP / (TP + FP)
Sensitivity aka Recall = TP /( TP + FN)
Specificity =TN / (TN + FP)

Why Actually we need ML for Predicting Cyber Crimes in Todays Era?

Particularly in the last decade, Internet usage has been growing rapidly. However, as the Internet becomes a part of the day to day activities, cybercrime is also on the rise. Cybercrime will cost nearly $6 trillion per annum by 2021 as per the cybersecurity ventures report in 2020. For illegal activities, cybercriminals utilize any network computing devices as a primary means of communication with a victims’ devices, so attackers get profit in terms of finance, publicity and others by exploiting the vulnerabilities over the system. Cybercrimes are steadily increasing daily. Evaluating cybercrime attacks and providing protective measures by manual methods using existing technical approaches and also investigations has often failed to control cybercrime attacks. Existing literature in the area of cybercrime offenses suffers from a lack of a computation methods to predict cybercrime, especially on unstructured data. Therefore, this study proposes a flexible computational tool using machine learning techniques to analyze cybercrimes rate at a state wise in a country that helps to classify cybercrimes. Security analytics with the association of data analytic approaches help us for analyzing and classifying offenses from India-based integrated data that may be either structured or unstructured. The main strength of this work is testing analysis reports, which classify the offenses accurately with 99 percent accuracy.

At present, there is no generalized framework is available to categorize cybercrime offenses by feature extraction of the cases. In the present work, data analysis and machine learning are incorporated to build a cybercrime detection and analytics system. The proposed system’s design and implementation utilize classification, clustering and supervised algorithms. Here, naive Bayes is used for classification and k-means are used for clustering . For feature extraction in the proposed work, the TFIDF or tf-idf vector process is used . This developed methodology is based on 4 phases that are applied to the data, which are reconnaissance, preprocessing, data clustering and classification and prediction analysis.