Intrusion Detection

Course of Network softwarization

Machine Learning for Networking

University of Rome “Tor Vergata”

Lorenzo Bracciale

Flow classification problem

We consider a Intrusion Detection Evaluation Dataset which we download from the NetML Competition repository.

In this dataset we have several network flows, each one associated with several features such as the source port, or the number of bytes transmitted. Such features are automatically derived from a traffic analysis conducted with a tool called Joy.

Then, each flow is labelled as “benign” and “malware”.

We want to train a learn this classification, to be able to understand if a flow is “malware” just analysing the flow features.

To launch this notebook you need to:

clone the NetML Competition repo in this directory
install the usual packages for data analysis (pandas, numpy, sklearn, matplotlib, seaborn)

import pandas as pd
import numpy as np

# reading the train dataset
train_filepath = "./NetML-Competition2020/data/CICIDS2017/2_training_set/2_training_set.json.gz"
train = pd.read_json(train_filepath, lines=True)
train.head()

	src_port	pld_distinct	bytes_out	hdr_mean	num_pkts_out	pld_ccnt	pld_mean	rev_hdr_distinct	hdr_bin_40	pr	...	tls_len	tls_svr_len	tls_cs_cnt	tls_ext_cnt	http_host	http_method	http_code	http_uri	http_content_len	http_content_type
0	56565	1	58	8.0	2	[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	29.00	1	0	17	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	52995	1	148	8.0	4	[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	37.00	1	0	17	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	40805	1	70	8.0	2	[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	35.00	1	0	17	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	31833	2	146	8.0	4	[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	36.50	1	0	17	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	64062	7	3272	32.8	15	[11, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]	218.13	3	14	6	...	[112, 262, 1, 48, 1376, 1600]	[80, 1, 48, 4912, 32]	19.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 62 columns

#reading the annotation dataset 
anno = pd.read_json("./NetML-Competition2020/data/CICIDS2017/2_training_annotations/2_training_anno_top.json.gz",typ="series")
anno.head()

3145728     benign
8388611    malware
1048583    malware
3145736    malware
1048585    malware
dtype: object

Preparing the dataset

The dataset cannot be used “as is”, since many columns contains some lists (e.g. tls_len).

This is because it comes from a json with nested field, while we need a flatten structure to pass it to our classifier.

To make things worse, some of the lists have variable lengths and can be alternated with NaN values.

Let’s change the dataset, expanding the elements contained in such nested lists inside new columns.

# first we list the columns with the "list problem"
col_list = [] 
for col in train.columns:
    if str(train[col].dtype) == 'object':
        col_list.append(col)

print(f"Columns to clean: {col_list}")

Columns to clean: ['pld_ccnt', 'ack_psh_rst_syn_fin_cnt', 'dns_query_class', 'sa', 'rev_pld_ccnt', 'dns_query_type', 'rev_hdr_ccnt', 'dns_answer_ttl', 'rev_intervals_ccnt', 'hdr_ccnt', 'dns_answer_ip', 'da', 'intervals_ccnt', 'rev_ack_psh_rst_syn_fin_cnt', 'dns_query_name', 'dns_query_name_len', 'tls_ext_types', 'tls_svr_ext_types', 'tls_svr_cs', 'tls_cs', 'tls_len', 'tls_svr_len', 'http_host', 'http_uri', 'http_content_type']

# then we pad NaN with empty list
for col in col_list:
    train[col] = train[col].apply(lambda d: d if isinstance(d, list) else [])
# now all the list columns contain always lists

# finally we expand the original training set with the new columns taken from the list values
from tqdm import tqdm # for a progress bar, it can take a while
for col in tqdm(col_list):
    df = pd.DataFrame(train[col].to_list()).add_prefix(f'{col}_')
    train = train.join(df)
    train.drop(col, axis=1, inplace=True)
# now it's all set
train.head()

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:02<00:00,  4.91s/it]

	src_port	pld_distinct	bytes_out	hdr_mean	num_pkts_out	pld_mean	rev_hdr_distinct	hdr_bin_40	pr	rev_hdr_bin_40	...	tls_svr_len_171	tls_svr_len_172	tls_svr_len_173	tls_svr_len_174	tls_svr_len_175	tls_svr_len_176	tls_svr_len_177	tls_svr_len_178	tls_svr_len_179	tls_svr_len_180
0	56565	1	58	8.0	2	29.00	1	0	17	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	52995	1	148	8.0	4	37.00	1	0	17	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	40805	1	70	8.0	2	35.00	1	0	17	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	31833	2	146	8.0	4	36.50	1	0	17	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	64062	7	3272	32.8	15	218.13	3	14	6	12	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 734 columns

Aligning the dataset

The annotation and the train dataset are not aligned i.e. the first annotation does not correspond to the first flow described in the train dataset. We sort both by the flow id (unique) to align them.

train = train.sort_values(by='id')
anno = anno.sort_index()
train.drop('id', axis=1, inplace=True)

### Encoding annotations

The labels are written as string such as “benign”. We need to encode them as numbers. We are going to use the LabelEncoder of sklern.

from sklearn import preprocessing
class_names = anno.unique() # contains the name of the classes
le = preprocessing.LabelEncoder()
le.fit(class_names)
encoded_annotations = le.transform(anno) 
#list(le.inverse_transform([2, 2, 1]))

/Users/lorenzo/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1)
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "

Splitting train and test set

from sklearn.model_selection import train_test_split
# we use only the first 10 features to speedup training
FEATURE_TO_CONSIDER = 10

X_train, X_test, y_train, y_test = train_test_split(
     train.iloc[:, :FEATURE_TO_CONSIDER].values, encoded_annotations, test_size=0.33, random_state=42)

# final check
print(X_train.shape)
print(y_train.shape)

(295547, 10)
(295547,)

Classify

We are going to use a decision tree classifier since it is very visual.

We set the maximum depth of the tree to 3, to plot it effectively

from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=3, random_state=42)
lf = clf.fit(X_train, y_train)

from matplotlib import pyplot as plt

DO_PLOTTING = True
if DO_PLOTTING:
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(clf, 
                       feature_names=train.columns[:FEATURE_TO_CONSIDER],  
                       class_names=class_names,
                       filled=True)
    #fig.savefig("decistion_tree.png")

# let's predict the test test
test_pred_decision_tree = clf.predict(X_test)

Evaluate the results

First with a confusion matrix to get a quick view.

Then using conventional metrics (accuracy, f1 etc).

#get the confusion matrix

from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt


confusion_matrix = metrics.confusion_matrix(y_test,  
                                            test_pred_decision_tree)#turn this into a dataframe
matrix_df = pd.DataFrame(confusion_matrix)#plot the result
ax = plt.axes()
sns.set(font_scale=1.3)
plt.figure(figsize=(10,7))
sns.heatmap(matrix_df, annot=True, fmt="g", ax=ax, cmap="magma")#set axis titles
ax.set_title('Confusion Matrix - Decision Tree')
ax.set_xlabel("Predicted label", fontsize =15)
ax.set_xticklabels(['']+class_names)
ax.set_ylabel("True Label", fontsize=15)
ax.set_yticklabels(list(class_names), rotation = 0)
plt.show()

<Figure size 720x504 with 0 Axes>

# then with the 

from sklearn import metrics

print(metrics.classification_report(y_test,
                                    test_pred_decision_tree))

              precision    recall  f1-score   support

           0       0.92      0.97      0.95     65577
           1       0.97      0.93      0.95     79992

    accuracy                           0.95    145569
   macro avg       0.95      0.95      0.95    145569
weighted avg       0.95      0.95      0.95    145569

Last updated on Sep 2, 2022