Intrusion Detection

Course of Network softwarization

Machine Learning for Networking

University of Rome “Tor Vergata”

Lorenzo Bracciale

Flow classification problem

We consider a Intrusion Detection Evaluation Dataset which we download from the NetML Competition repository.

In this dataset we have several network flows, each one associated with several features such as the source port, or the number of bytes transmitted. Such features are automatically derived from a traffic analysis conducted with a tool called Joy.

Then, each flow is labelled as “benign” and “malware”.

We want to train a learn this classification, to be able to understand if a flow is “malware” just analysing the flow features.

To launch this notebook you need to:

  • clone the NetML Competition repo in this directory
  • install the usual packages for data analysis (pandas, numpy, sklearn, matplotlib, seaborn)
import pandas as pd
import numpy as np
# reading the train dataset
train_filepath = "./NetML-Competition2020/data/CICIDS2017/2_training_set/2_training_set.json.gz"
train = pd.read_json(train_filepath, lines=True)
train.head()

src_port pld_distinct bytes_out hdr_mean num_pkts_out pld_ccnt pld_mean rev_hdr_distinct hdr_bin_40 pr ... tls_len tls_svr_len tls_cs_cnt tls_ext_cnt http_host http_method http_code http_uri http_content_len http_content_type
0 56565 1 58 8.0 2 [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 29.00 1 0 17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 52995 1 148 8.0 4 [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 37.00 1 0 17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 40805 1 70 8.0 2 [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 35.00 1 0 17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 31833 2 146 8.0 4 [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 36.50 1 0 17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 64062 7 3272 32.8 15 [11, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2] 218.13 3 14 6 ... [112, 262, 1, 48, 1376, 1600] [80, 1, 48, 4912, 32] 19.0 1.0 NaN NaN NaN NaN NaN NaN

5 rows × 62 columns

#reading the annotation dataset 
anno = pd.read_json("./NetML-Competition2020/data/CICIDS2017/2_training_annotations/2_training_anno_top.json.gz",typ="series")
anno.head()
3145728     benign
8388611    malware
1048583    malware
3145736    malware
1048585    malware
dtype: object

Preparing the dataset

The dataset cannot be used “as is”, since many columns contains some lists (e.g. tls_len).

This is because it comes from a json with nested field, while we need a flatten structure to pass it to our classifier.

To make things worse, some of the lists have variable lengths and can be alternated with NaN values.

Let’s change the dataset, expanding the elements contained in such nested lists inside new columns.

# first we list the columns with the "list problem"
col_list = [] 
for col in train.columns:
    if str(train[col].dtype) == 'object':
        col_list.append(col)

print(f"Columns to clean: {col_list}")
Columns to clean: ['pld_ccnt', 'ack_psh_rst_syn_fin_cnt', 'dns_query_class', 'sa', 'rev_pld_ccnt', 'dns_query_type', 'rev_hdr_ccnt', 'dns_answer_ttl', 'rev_intervals_ccnt', 'hdr_ccnt', 'dns_answer_ip', 'da', 'intervals_ccnt', 'rev_ack_psh_rst_syn_fin_cnt', 'dns_query_name', 'dns_query_name_len', 'tls_ext_types', 'tls_svr_ext_types', 'tls_svr_cs', 'tls_cs', 'tls_len', 'tls_svr_len', 'http_host', 'http_uri', 'http_content_type']
# then we pad NaN with empty list
for col in col_list:
    train[col] = train[col].apply(lambda d: d if isinstance(d, list) else [])
# now all the list columns contain always lists
# finally we expand the original training set with the new columns taken from the list values
from tqdm import tqdm # for a progress bar, it can take a while
for col in tqdm(col_list):
    df = pd.DataFrame(train[col].to_list()).add_prefix(f'{col}_')
    train = train.join(df)
    train.drop(col, axis=1, inplace=True)
# now it's all set
train.head()
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:02<00:00,  4.91s/it]

src_port pld_distinct bytes_out hdr_mean num_pkts_out pld_mean rev_hdr_distinct hdr_bin_40 pr rev_hdr_bin_40 ... tls_svr_len_171 tls_svr_len_172 tls_svr_len_173 tls_svr_len_174 tls_svr_len_175 tls_svr_len_176 tls_svr_len_177 tls_svr_len_178 tls_svr_len_179 tls_svr_len_180
0 56565 1 58 8.0 2 29.00 1 0 17 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 52995 1 148 8.0 4 37.00 1 0 17 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 40805 1 70 8.0 2 35.00 1 0 17 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 31833 2 146 8.0 4 36.50 1 0 17 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 64062 7 3272 32.8 15 218.13 3 14 6 12 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 734 columns

Aligning the dataset

The annotation and the train dataset are not aligned i.e. the first annotation does not correspond to the first flow described in the train dataset. We sort both by the flow id (unique) to align them.

train = train.sort_values(by='id')
anno = anno.sort_index()
train.drop('id', axis=1, inplace=True)

### Encoding annotations

The labels are written as string such as “benign”. We need to encode them as numbers. We are going to use the LabelEncoder of sklern.

from sklearn import preprocessing
class_names = anno.unique() # contains the name of the classes
le = preprocessing.LabelEncoder()
le.fit(class_names)
encoded_annotations = le.transform(anno) 
#list(le.inverse_transform([2, 2, 1]))
/Users/lorenzo/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1)
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "

Splitting train and test set

from sklearn.model_selection import train_test_split
# we use only the first 10 features to speedup training
FEATURE_TO_CONSIDER = 10

X_train, X_test, y_train, y_test = train_test_split(
     train.iloc[:, :FEATURE_TO_CONSIDER].values, encoded_annotations, test_size=0.33, random_state=42)
# final check
print(X_train.shape)
print(y_train.shape)
(295547, 10)
(295547,)

Classify

We are going to use a decision tree classifier since it is very visual.

We set the maximum depth of the tree to 3, to plot it effectively

from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=3, random_state=42)
lf = clf.fit(X_train, y_train)
from matplotlib import pyplot as plt

DO_PLOTTING = True
if DO_PLOTTING:
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(clf, 
                       feature_names=train.columns[:FEATURE_TO_CONSIDER],  
                       class_names=class_names,
                       filled=True)
    #fig.savefig("decistion_tree.png")

png

# let's predict the test test
test_pred_decision_tree = clf.predict(X_test)

Evaluate the results

First with a confusion matrix to get a quick view.

Then using conventional metrics (accuracy, f1 etc).

#get the confusion matrix

from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt


confusion_matrix = metrics.confusion_matrix(y_test,  
                                            test_pred_decision_tree)#turn this into a dataframe
matrix_df = pd.DataFrame(confusion_matrix)#plot the result
ax = plt.axes()
sns.set(font_scale=1.3)
plt.figure(figsize=(10,7))
sns.heatmap(matrix_df, annot=True, fmt="g", ax=ax, cmap="magma")#set axis titles
ax.set_title('Confusion Matrix - Decision Tree')
ax.set_xlabel("Predicted label", fontsize =15)
ax.set_xticklabels(['']+class_names)
ax.set_ylabel("True Label", fontsize=15)
ax.set_yticklabels(list(class_names), rotation = 0)
plt.show()

png

<Figure size 720x504 with 0 Axes>
# then with the 

from sklearn import metrics

print(metrics.classification_report(y_test,
                                    test_pred_decision_tree))
              precision    recall  f1-score   support

           0       0.92      0.97      0.95     65577
           1       0.97      0.93      0.95     79992

    accuracy                           0.95    145569
   macro avg       0.95      0.95      0.95    145569
weighted avg       0.95      0.95      0.95    145569
Previous