First, we need data

Multiple labeled time series datasets exist for anomaly detection. We provide below a detailed list of these datasets, which are included in TSB-UAD benchmarks and can be downloaded here.

Datasets Descriptions

Dataset

Description

Dodgers

is a loop sensor data for the Glendale on-ramp for the 101 North freeway in Los Angeles and the anomalies represent unusual traffic after a Dodgers game.

ECG

is a standard electrocardiogram dataset and the anomalies represent ventricular premature contractions. We split one long series (MBA_ECG14046) with length ∼ 1e7) to 47 series by first identifying the periodicity of the signal.

IOPS

is a dataset with performance indicators that reflect the scale, quality of web services, and health status of a machine.

KDD21

is a composite dataset released in a recent SIGKDD 2021 competition with 250 time series.

MGAB

is composed of Mackey-Glass time series with non-trivial anomalies. Mackey-Glass time series exhibit chaotic behavior that is difficult for the human eye to distinguish.

NAB

is composed of labeled real-world and artificial time series including AWS server metrics, online advertisement clicking rates, real time traffic data, and a collection of Twitter mentions of large publicly-traded companies.

NASA-SMAP and NASA-MSL

are two real spacecraft telemetry data with anomalies from Soil Moisture Active Passive (SMAP) satellite and Curiosity Rover on Mars (MSL). We only keep the first data dimension that presents the continuous data, and we omit the remaining dimensions with binary data.

SensorScope

is a collection of environmental data, such as temperature, humidity, and solar radiation, collected from a typical tiered sensor measurement system.

YAHOO

is a dataset published by Yahoo labs consisting of real and synthetic time series based on the real production traffic to some of the Yahoo production systems.

Daphnet

contains the annotated readings of 3 acceleration sensors at the hip and leg of Parkinson’s disease patients that experience freezing of gait (FoG) during walking tasks.

GHL

is a Gasoil Heating Loop Dataset and contains the status of 3 reservoirs such as the temperature and level. Anomalies indicate changes in max temperature or pump frequency.

Genesis

is a portable pick-and-place demonstrator which uses an air tank to supply all the gripping and storage units.

MITDB

contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the BIH Arrhythmia Laboratory between 1975 and 1979.

OPPORTUNITY (OPP)

is a dataset devised to benchmark human activity recognition algorithms (e.g., classiffication, automatic data segmentation, sensor fusion, and feature extraction). The dataset comprises the readings of motion sensors recorded while users executed typical daily activities.

Occupancy

contains experimental data used for binary classiffication (room occupancy) from temperature, humidity, light, and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.

SMD (Server Machine Dataset)

is a 5-week-long dataset collected from a large Internet company. This dataset contains 3 groups of entities from 28 different machines.

SVDB

includes 78 half-hour ECG recordings chosen to supplement the examples of supraventricular arrhythmias in the MIT-BIH Arrhythmia Database.

Datasets characteristics

The following table summarizes different characteristics of the datasets.

Dataset

Count

Avg length

Avg number of anomalies

Avg number of abnormal points

Dodger

1

50400.0

133.0

5612

ECG

53

230351.9

195.6

15634

IOPS

58

102119.2

46.5

2312.3

KDD21

250

77415.06

1.0

196.5

MGAB

10

100000.0

10.0

200.0

NAB

58

6301.7

2.0

575.5

SensorScope

23

27038.4

11.2

6110.4

YAHOO

367

1561.2

5.9

10.7

NASA-MSL

27

2730.7

1.33

286.3

NASA-SMAP

54

8066.0

1.26

1032.4

Daphnet

45

21760.0

7.6

2841.0

GHL

126

200001.0

1.2

388.8

Genesis

6

16220.0

3.0

50.0

MITDB

32

650000.0

210.1

72334.3

OPP

465

31616.9

2.0

1267.3

Occupancy

10

5725.8

18.3

1414.5

SMD

281

25562.3

10.4

900.2

SVDB

115

230400.0

208.0

27144.5

You may find more details (and the references) in this paper.