
Proximity-based methods¶
Local Outlier Factor (LOF)¶
The most commonly used proximity-based approach is the Local Outlier Factor (LOF) [Breunig et al. 2000], which measures the degree of being an outlier for each instance. Unlike the previous proximity-based models, which directly compute the distance of sub-sequences, LOF depends on how the instance is isolated to the surrounding neighborhood. This method aims to solve the outlier detection task where an outlier is considered as an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism (Hawkins definition [Hawkins 1980]). This definition is coherent with the anomaly detection task in time series where the different mechanism can be either an arrhythmia in an electrocardiogram or a failure in the components of an industrial machine.
The TSB-UAD implementation of LOF is a wrapper of Scikit-learn implementation of LOF.
- class TSB_UAD.models.lof.LOF(*args: Any, **kwargs: Any)¶
Wrapper of scikit-learn LOF Class with more functionalities.
- Parameters
n_neighbors (
int,optional (default=20)) – Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used.algorithm (
{'auto', 'ball_tree', 'kd_tree', 'brute'}, optional) –Algorithm used to compute the nearest neighbors:
’ball_tree’ will use BallTree
’kd_tree’ will use KDTree
’brute’ will use a brute-force search.
’auto’ will attempt to decide the most appropriate algorithm based on the values passed to
fit()method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
leaf_size (
int,optional (default=30)) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.metric (
stringorcallable, default'minkowski') –metric used for the distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used. If ‘precomputed’, the training input X is expected to be a distance matrix. If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string. Valid values for metric are:
from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
p (
integer,optional (default = 2)) – Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distancesmetric_params (
dict,optional (default = None)) – Additional keyword arguments for the metric function.contamination (
float in (0.,0.5),optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.n_jobs (
int,optional (default = 1)) – The number of parallel jobs to run for neighbors search. If-1, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods.
- decision_scores_¶
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- Type
numpy arrayofshape (n_samples,)
- labels_¶
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_ondecision_scores_.- Type
int,either 0or1
- threshold_¶
The threshold is based on
contamination. It is then_samples * contaminationmost abnormal samples indecision_scores_. The threshold is calculated for generating binary outlier labels.- Type
- decision_function(X)¶
Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
- Parameters
X (
numpy arrayofshape (n_samples,n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.- Returns
anomaly_scores – The anomaly score of the input samples.
- Return type
numpy arrayofshape (n_samples,)
- fit(X, y=None)¶
Fit detector. y is ignored in unsupervised methods.
- Parameters
X (
numpy arrayofshape (n_samples,n_features)) – The input samples (time series length). n_features corresponds to the subsequence length.y (
Ignored) – Not used, present for API consistency by convention.
- Returns
self – Fitted estimator.
- Return type
Example¶
import os
import numpy as np
import pandas as pd
from TSB_UAD.utils.visualisation import plotFig
from TSB_UAD.models.lof import LOF
from TSB_UAD.models.feature import Window
from TSB_UAD.utils.slidingWindows import find_length
from TSB_UAD.vus.metrics import get_metrics
#Read data
filepath = 'PATH_TO_TSB_UAD/ECG/MBA_ECG805_data.out'
df = pd.read_csv(filepath, header=None).dropna().to_numpy()
name = filepath.split('/')[-1]
data = df[:,0].astype(float)
label = df[:,1].astype(int)
#Pre-processing
slidingWindow = find_length(data)
X_data = Window(window = slidingWindow).convert(data).to_numpy()
# Run LOF
modelName='LOF'
clf = LOF(n_neighbors=20, n_jobs=1)
clf.fit(X_data)
score = clf.decision_scores_
#Post-processing
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
score = np.array([score[0]]*math.ceil((slidingWindow-1)/2) + list(score) + [score[-1]]*((slidingWindow-1)//2))
#Plot result
plotFig(data, label, score, slidingWindow, fileName=name, modelName=modelName)
#Print accuracy
results = get_metrics(score, label, metric="all", slidingWindow=slidingWindow)
for metric in results.keys():
print(metric, ':', results[metric])
AUC_ROC : 0.41096068975774547
AUC_PR : 0.048104473111295544
Precision : 0.21794871794871795
Recall : 0.16831683168316833
F : 0.1899441340782123
Precision_at_k : 0.16831683168316833
Rprecision : 0.3095238095238095
Rrecall : 0.304812834224599
RF : 0.3071502590673575
R_AUC_ROC : 0.6916553096198312
R_AUC_PR : 0.4549204085910081
VUS_ROC : 0.6545868021121983
VUS_PR : 0.35228784121262147
Affiliation_Precision : 0.942248287092041
Affiliation_Recall : 0.978882103900466

References¶
[Breunig et al. 2000] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. 2000b. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93–104.
[Hawkins 1980] D. M. Hawkins. 1980. Identification of Outliers. Springer Netherlands, Dordrecht.