Data files format¶
The main unit of data this tool works with is a run. A run is data collected in a specific period, and each research proposal given beantime at European XFEL may collect hundreds of runs.
A run is stored as a directory containing HDF5 data files from different sources. These fall into two important categories:
Detector data, from the main X-ray detectors in the various experiments.
Each detector module writes separate files, e.g.
RAW-R0348-AGIPD00-S00000.h5
. The number in the third part of the filename identifies the module (0 in this example).The detectors in use as of April 2018 are LPD and AGIPD in the file names. Each has 16 modules numbered 0–15.
All the other data, such as motor positions, beam measurements, etc., are recorded through a data aggregator, and stored in a file with the letters DA in the name, e.g.
RAW-R0450-DA01-S00000.h5
.
The last part of the file name (e.g. S00000
) is a sequence number. The
data within a run may be broken into a number of sequences. So
RAW-R0450-DA01-S00000.h5
and RAW-R0450-DA01-S00001.h5
will contain data
from the same set of devices, with sequence 1 continuing just after the end of
sequence 0. Though all data within a run may be broken into sequences, different
data sets do not necessarily break at the same point, so the various ‘sequence 0’
data files in a run do not have corresponding data.
HDF5 file structure¶
METADATA¶
The METADATA
group in an HDF5 file contains three datasets, each of which
is a 1D array of strings:
METADATA/dataSourceId
lists data groups in the file. The values are either:CONTROL/
followed by a Karabo device name, e.g.CONTROL/SA1_XTD2_XGM/DOOCS/MAIN
.INSTRUMENT/
followed by a Karabo device name, a colon, the name of the output channel, a slash, and the name of a data group (?), e.g.INSTRUMENT/SA1_XTD2_XGM/DOOCS/MAIN:output/data
METADATA/deviceId
lists the part of each dataSourceId after the first slash.METADATA/root
lists the parts before the first slash, soconcat(root, "/", deviceId) == dataSourceId
.
These three data sets always have the same number of values. They may be padded with empty strings, so empty entries are ignored.
INDEX¶
INDEX/trainId
is a 1D array of uint64, listing the pulse trains which the
file holds data for. This is crucial, since all other data has to be matched up
according to train IDs.
For each entry in METADATA/deviceId
, the INDEX
group contains two
datasets, both uint64 data with the same length as the train IDs:
INDEX/{ deviceId }/count
: for each train ID, how many data samples did this device record. This may be 0 if no data was recorded for this train.INDEX/{ deviceId }/first
: for each train ID, the index at which the corresponding data starts in the arrays for this device.
Thus, to find the data for a given train ID, we could do:
train_index = trainIds.index(train_id)
first = device_firsts[train_index]
count = device_counts[train_index]
train_data = data[first : first+count]
Control data is always (?) recorded once per train, so count is 1 and first counts up from 0 to the number of trains. Instrument data is more variable.
Some older files use a different index format with first/last/status instead of first/count. In this case, a status of 0 means that no data was recorded for that train.
CONTROL and RUN¶
For each CONTROL entry in METADATA/dataSourceId
, there is a group with
that name in the file. This may have further arbitrarily nested subgroups
representing different properties of that device, e.g.
/CONTROL/SA1_XTD2_XGM/DOOCS/MAIN/current/bottom/output
.
The leaves of this tree are pairs of datasets called timestamp
and value
.
Each dataset has one entry per train, and the timestamp
record when the
value was updated, which is typically less than once per train. The value
dataset may have extra dimensions, but in most cases it is 1D.
(Does timestamp update if value is re-read but doesn’t change?)
RUN
holds a complete duplicate of the CONTROL
hierarchy, but each pair
of timestamp
and value
contain only one entry, taken at the start of
the run. There is still a dimension for this, so 2D value datasets in CONTROL
have corresponding 2D datasets in RUN, but the first dimension has length 1.
(Is RUN exactly duplicated in subsequent sequence files?)
INSTRUMENT¶
For each INSTRUMENT entry in METADATA/dataSourceId
, there is a group with
that name in the file. Each such group holds a 1D trainId
dataset, and a
number of other datasets (possibly nested in subgroups). All these datasets have
the same length in the first dimension: this represents the successive readings
taken. The slices defined by the corresponding datasets in INDEX work on
this dimension.
The trainId
dataset for each instrument group thus appears to be redundant
with the information in INDEX.