
Tree-based methods¶
Isolation Forest¶
Isolation Forest (IForest) is a density-based and the most famous tree-based approach for anomaly detection. IForest tries to isolate the outlier from the rest of the normal points of subsequences [Liu et al. 2008]. The key idea remains on the fact that, in a normal distribution, anomalies are more likely to be isolated (i.e., requiring fewer random partitions to be isolated) than normal instances. If we assume the latter statement, we only have to produce a partitioning process that indicates well the isolation degree (i.e., anomalous degree) of instances.
The TSB-UAD implementation of IForest is a wrapper of Scikit-learn implementation of IsolationForest.
- class TSB_UAD.models.iforest.IForest(*args: Any, **kwargs: Any)¶
Wrapper of scikit-learn IsolationForest Class with more functionalities.
- Parameters
n_estimators (
int,optional (default=100)) – The number of base estimators in the ensemble.max_samples (
int,floatorstring, optional, default"auto") –The number of samples to draw from X to train each base estimator.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples.
If “auto”, then max_samples=min(256, n_samples).
contamination (
float in (0.,0.5),optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.max_features (
intorfloat,optional (default=1.0)) –The number of features to draw from X to train each base estimator.
If int, then draw max_features features.
If float, then draw max_features * X.shape[1] features.
this attribute is useless of a sensor timeseries anomly
bootstrap (
bool,optional (default=False)) – If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.n_jobs (
integer,optional (default=1)) – The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.random_state (
int,RandomState instanceorNone,optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- decision_scores_¶
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- Type
numpy arrayofshape (n_samples,)
- estimators_samples_¶
The subset of drawn samples (i.e., the in-bag samples) for each base estimator.
- Type
listofarrays
- max_samples_¶
The actual number of samples
- Type
integer
- threshold_¶
The threshold is based on
contamination. It is then_samples * contaminationmost abnormal samples indecision_scores_. The threshold is calculated for generating binary outlier labels.- Type
- labels_¶
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_ondecision_scores_.- Type
int,either 0or1
- decision_function(X)¶
Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
- Parameters
X (
numpy arrayofshape (n_samples,n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.- Returns
anomaly_scores – The anomaly score of the input samples.
- Return type
numpy arrayofshape (n_samples,)
- fit(X, y=None)¶
Fit detector. y is ignored in unsupervised methods.
- Parameters
X (
numpy arrayofshape (n_samples,n_features)) – The input samples (time series length). n_features corresponds to the subsequence length.y (
Ignored) – Not used, present for API consistency by convention.
- Returns
self – Fitted estimator.
- Return type
Example¶
Here is a code snippet that shows how to run Isolation Forest.
import os
import numpy as np
import pandas as pd
from TSB_UAD.utils.visualisation import plotFig
from TSB_UAD.models.iforest import IForest
from TSB_UAD.models.feature import Window
from TSB_UAD.utils.slidingWindows import find_length
from TSB_UAD.vus.metrics import get_metrics
#Read data
filepath = 'PATH_TO_TSB_UAD/ECG/MBA_ECG805_data.out'
df = pd.read_csv(filepath, header=None).dropna().to_numpy()
name = filepath.split('/')[-1]
data = df[:,0].astype(float)
label = df[:,1].astype(int)
#Pre-processing
slidingWindow = find_length(data)
X_data = Window(window = slidingWindow).convert(data).to_numpy()
#Run IForest
modelName='IForest'
clf = IForest(n_jobs=1)
x = X_data
clf.fit(x)
score = clf.decision_scores_
# Post-processing
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
score = np.array([score[0]]*math.ceil((slidingWindow-1)/2) + list(score) + [score[-1]]*((slidingWindow-1)//2))
#Plot result
plotFig(data, label, score, slidingWindow, fileName=name, modelName=modelName)
#Print accuracy
results = get_metrics(score, label, metric="all", slidingWindow=slidingWindow)
for metric in results.keys():
print(metric, ':', results[metric])
AUC_ROC : 0.9216216369841076
AUC_PR : 0.6608577550833885
Precision : 0.7342093339374717
Recall : 0.4010891089108911
F : 0.5187770129662238
Precision_at_k : 0.4010891089108911
Rprecision : 0.7486112853253205
Rrecall : 0.3097733542316151
RF : 0.438214653167952
R_AUC_ROC : 0.989123018780308
R_AUC_PR : 0.9435238401582703
VUS_ROC : 0.9734357459251715
VUS_PR : 0.8858037295594041
Affiliation_Precision : 0.9630674176380548
Affiliation_Recall : 0.9809813654809071

References¶
[Liu et al. 2008] F. T. Liu, K. M. Ting, and Z.-H. Zhou. 2008. Isolation Forest. In Proceedings of the International Conference on Data Mining (ICDM), pp. 413–422. IEEE. ISBN 978-0-7695-3502-9. DOI:10.1109/ICDM.2008.17.