icon

Tree-based methods

Isolation Forest

Isolation Forest (IForest) is a density-based and the most famous tree-based approach for anomaly detection. IForest tries to isolate the outlier from the rest of the normal points of subsequences [Liu et al. 2008]. The key idea remains on the fact that, in a normal distribution, anomalies are more likely to be isolated (i.e., requiring fewer random partitions to be isolated) than normal instances. If we assume the latter statement, we only have to produce a partitioning process that indicates well the isolation degree (i.e., anomalous degree) of instances.

The TSB-UAD implementation of IForest is a wrapper of Scikit-learn implementation of IsolationForest.

class TSB_UAD.models.iforest.IForest(*args: Any, **kwargs: Any)

Wrapper of scikit-learn IsolationForest Class with more functionalities.

Parameters
  • n_estimators (int, optional (default=100)) – The number of base estimators in the ensemble.

  • max_samples (int, float or string, optional, default "auto") –

    The number of samples to draw from X to train each base estimator.

    • If int, then draw max_samples samples.

    • If float, then draw max_samples * X.shape[0] samples.

    • If “auto”, then max_samples=min(256, n_samples).

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

  • max_features (int or float, optional (default=1.0)) –

    The number of features to draw from X to train each base estimator.

    • If int, then draw max_features features.

    • If float, then draw max_features * X.shape[1] features.

    • this attribute is useless of a sensor timeseries anomly

  • bootstrap (bool, optional (default=False)) – If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.

  • n_jobs (integer, optional (default=1)) – The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

  • random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

decision_scores_

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

Type

numpy array of shape (n_samples,)

estimators_

The collection of fitted sub-estimators.

Type

list of DecisionTreeClassifier

estimators_samples_

The subset of drawn samples (i.e., the in-bag samples) for each base estimator.

Type

list of arrays

max_samples_

The actual number of samples

Type

integer

threshold_

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

Type

float

labels_

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Type

int, either 0 or 1

decision_function(X)

Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (numpy array of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X, y=None)

Fit detector. y is ignored in unsupervised methods.

Parameters
  • X (numpy array of shape (n_samples, n_features)) – The input samples (time series length). n_features corresponds to the subsequence length.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

self – Fitted estimator.

Return type

object

Example

Here is a code snippet that shows how to run Isolation Forest.

import os
import numpy as np
import pandas as pd
from TSB_UAD.utils.visualisation import plotFig
from TSB_UAD.models.iforest import IForest
from TSB_UAD.models.feature import Window
from TSB_UAD.utils.slidingWindows import find_length
from TSB_UAD.vus.metrics import get_metrics

#Read data
filepath = 'PATH_TO_TSB_UAD/ECG/MBA_ECG805_data.out'
df = pd.read_csv(filepath, header=None).dropna().to_numpy()
name = filepath.split('/')[-1]

data = df[:,0].astype(float)
label = df[:,1].astype(int)

#Pre-processing    
slidingWindow = find_length(data)
X_data = Window(window = slidingWindow).convert(data).to_numpy()


#Run IForest
modelName='IForest'
clf = IForest(n_jobs=1)
x = X_data
clf.fit(x)
score = clf.decision_scores_

# Post-processing
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
score = np.array([score[0]]*math.ceil((slidingWindow-1)/2) + list(score) + [score[-1]]*((slidingWindow-1)//2))

#Plot result
plotFig(data, label, score, slidingWindow, fileName=name, modelName=modelName) 

#Print accuracy
results = get_metrics(score, label, metric="all", slidingWindow=slidingWindow)
for metric in results.keys():
    print(metric, ':', results[metric])
AUC_ROC : 0.9216216369841076
AUC_PR : 0.6608577550833885
Precision : 0.7342093339374717
Recall : 0.4010891089108911
F : 0.5187770129662238
Precision_at_k : 0.4010891089108911
Rprecision : 0.7486112853253205
Rrecall : 0.3097733542316151
RF : 0.438214653167952
R_AUC_ROC : 0.989123018780308
R_AUC_PR : 0.9435238401582703
VUS_ROC : 0.9734357459251715
VUS_PR : 0.8858037295594041
Affiliation_Precision : 0.9630674176380548
Affiliation_Recall : 0.9809813654809071

Result

References

  • [Liu et al. 2008] F. T. Liu, K. M. Ting, and Z.-H. Zhou. 2008. Isolation Forest. In Proceedings of the International Conference on Data Mining (ICDM), pp. 413–422. IEEE. ISBN 978-0-7695-3502-9. DOI:10.1109/ICDM.2008.17.