Confusion Matrix and IDS and IPS

Suyash garg
7 min readJun 6, 2021

Introduction

In today’s world internet has become daily life part of our life this means now we are using internet to do our daily task like shopping, entertainment, finding service and traveling and much more and the best things about this no single company own the whole internet its manage by different people.

But there is one big problem with not owning by anyone is this no one is here to protect us from the dangerous and harmful things like virus, scam fake news etc each and every individual has to take responsibility for its own security but service provider can’t say to there customer come to our site and if you attack by virus we are not responsible well if someone said this we never goes to this site in our whole life and there business is flop so that is the solution of this simple company has to secure their website by attacks and cyber crime.

Theirs are many method come by which a company can secure their website like firewall, honeypot, ssl/tls, encryption, certificate etc but in this article I am going to talk about two such thing ids/ips(intrusion detection system/intrusion prevention system) and how organization and security researchers use confusion matrix use to make it more effective.

IDS/IPS

If the way I write is look like ids/ips is the same things but it's two different product but generally they always always used together so nowadays it's become very common to write like this.

So now talk what is ids in simple words they are the tool that detect intrusion(attack, breach, malicious traffic etc any thing that is bad for us) and make set the logs or notify by some alarm and ips is like ids but it’s also block the also block the traffic based on the settings

By looking at the definition is look very simple things well this is not the true because in today worlds there is no single definition of intrusion and place where its come from anyone and anything can be intrusion and number of hit can goes upto 1000/seconds and for site like amazon and facebook its in millions so we can not rely on human to analysis each and every traffic in less than one seconds we have to use tool like ids/ips.

So to solve above use case organization start using ml(machine learning) to train the ids/ips to make it more effective but there is still problem in this whole approach if we create the model we can never say it’s 100% right there is one saying in ml world if the model is 100% accurate you either you become god or the model you created is wrong nothing in the world is 100% accurate we have to test it and create again and again to make it nearly 100% accurate and confusion matrix is one of the testing technique we use to test ids/ips result.

Confusion Matrix

Don’t worry it’s not confusing like its name so what is confusion matrix(cm) to understand in simple words in above eg of ids/ips they generate two kind of output finally yes(intrusion) and no(not the intrusion) this kind of output in ml world is known binary classification.

Binary classification is the type of ml model in which model only gives output between 0 and 1 and we normalize this output between true or false

Generally ture means positive or good result and false means negative or bad result but it’s up to the one who created the matrix this can be opposite

Confusion Matrix for IDS/IPS

So now let understand above result in more detail:-

  • Every confusion matrix has 2 rows and 2 column means it is 2 X 2 matrix means it has 4 cells
  • In our eg we set top left corner as true negative(TN)
  • bottom left as false negative(FN)
  • top right as false positive(FP)
  • and bottom right as true positive(TP)

Remember as is said confusion matrix is used to test the result of binary classification which is come under supervised learning means we divide dataset in two parts testing and training. training part is used to train(create) the model and testing part is used to test the model.

Now let understand dig deep dive more into confusion matrix:-

  • true negative(TN):- means the answers is no and our model said no means the connection is legitimate and model said connection is legitimate
  • true positive(TP):- means the answers is yes and model said yes means the connection is intrusion and model said intrusion
  • false positive(FP):- means the connection yes and model said no means the connection is intrusion and model said legitimate also known as type I error
  • false negative(FN):- means the connection is no and model said yes means the connection is legitimate and model said intrusion also known as type II error

Looking at the matrix its harder to make conclusion to make things more clear let put the result in some formula and analysis them one by one

Remember the meaning of formula changes as the use case change maybe this is the reason why its known as confusion matrix

Result of all the formula always lie between zero and one that why all the formula multiply by 100 to give the answers in percentage

Sensitivity

In my case the higher value of sensitivity is good but if the value of sensitivity is to high is bad but there is only two case possible:-

  • The model is actually very good and giving right answers
  • and the dataset is use weather for testing and training the model is not balance

Sensitivity is also known as recall

Specificity

Just like sensitivity higher the value is good but to high value is bad because:-

  • The model is actually very good and giving right answers
  • and the dataset is use weather for testing and training the model is not balance

Precision

Means the number right positive result between all the positive result

Negative Predictive Value

This has no as such name because this value we generally didn’t calculate because all the above value is more than enough to gives us the good idea

Accuracy

It means the total number of right answers between all the predicted value the well by name it’s clear that high accuracy is good well this not the case if we have very high accuracy there may be two cases

  • 0.9 or 90%(if we multiply by 100) accuracy said that in data nearly 90% of data belongs either belongs to either intrusion or legitimate and model is very accurate
  • and if accuracy is very low let 60% dataset nearly 50% intrusion and 50% legitimate

so if the accuracy is 90% there is good changes that our model is very right or very wrong depends on the dataset

generally the value accuracy is between 60–70% is considered as good

why use accuracy it’s good to find the balance of dataset balanced dataset gives good result but in case of binary classification also give false sense of failing

can’t we just test the dataset in before creating model yes but what happen if dataset comes in real time and model train again as the new dataset comes in that case no because it’s takes time to do manual process and attacker comes with new ways every seconds and we can don’t have enough time to waste.

F-measure

If you read till here you might notice I didn’t explain precision that much simply because the value of precision alone is not enough to give good result we have to use some more technique why

well if we have low precision high recall its means positive examples are correctly recognized but there are a lot of false positives means our model is generating to much logs and blocking most of the connections

and if the low recall high precision this shows that we miss a lot of positive examples but those we predict as positive are indeed positive this means we are missing most of the answers

to solve this use case we introduce the new formula known as f-measure

value closer to the lower value between recall and precision is considered as good means if the value

Generally f-measure is used to measure the actual accuracy of the model rather than accuracy

Conclusion

In this article I try to explain what is confusion matrix and how we can use different formula of confusion matrix to do prediction there are many ways to predict the right answers but they are either very consuming or very cpu intensive and always not possible in fast world.

While the confusion matrix can easily calculated in most basic calculator of the world in just a seconds.

But the only problem of confusion matrix in that meaning of result changes as the dataset and the person who created confusion matrix changes so the person who created confusion matrix have to documents all the result and carefully with proper explanation.

Contact Detail

LinkeDin [https://www.linkedin.com/in/suyash-garg-50245b1b7]

Additional Tags

#worldrecordholder #training #internship #makingindiafutureready #summer #summertraining #python #machinelearning #docker #rightmentor #deepknowledge #linuxworld #vimaldaga #righteducation

--

--