Reading data with karabo_data

This command creates the sample data files used in the rest of this example. These files contain no real data, but they have the same structure as European XFEL’s HDF5 data files.

[1]:
!python3 -m karabo_data.tests.make_examples
Written examples.

Single files

[2]:
!h5ls fxe_control_example.h5
CONTROL                  Group
INDEX                    Group
INSTRUMENT               Group
METADATA                 Group
RUN                      Group
[3]:
from karabo_data import H5File
f = H5File('fxe_control_example.h5')
[4]:
f.control_sources
[4]:
frozenset({'FXE_XAD_GEC/CAM/CAMERA',
           'SA1_XTD2_XGM/DOOCS/MAIN',
           'SPB_XTD9_XGM/DOOCS/MAIN'})
[5]:
f.instrument_sources
[5]:
frozenset({'FXE_XAD_GEC/CAM/CAMERA:daqOutput',
           'SA1_XTD2_XGM/DOOCS/MAIN:output',
           'SPB_XTD9_XGM/DOOCS/MAIN:output'})

Get data by train

[6]:
for tid, data in f.trains():
    print("Processing train", tid)
    print("beam iyPos:", data['SA1_XTD2_XGM/DOOCS/MAIN']['beamPosition.iyPos.value'])

    break
Processing train 10000
beam iyPos: 0.0
[7]:
tid, data = f.train_from_id(10005)
data['FXE_XAD_GEC/CAM/CAMERA:daqOutput']['data.image.dims']
[7]:
array([1024,  255], dtype=uint64)

These are just a few of the ways to access data. The attributes and methods described below for run directories also work with individual files. We expect that it will normally make sense to access a run directory as a single object, rather than working with the files separately.

Run directories

An experimental run is recorded as a collection of files in a directory.

Another dummy example:

[8]:
!ls fxe_example_run/
RAW-R0450-DA01-S00000.h5   RAW-R0450-LPD04-S00000.h5  RAW-R0450-LPD10-S00000.h5
RAW-R0450-DA01-S00001.h5   RAW-R0450-LPD05-S00000.h5  RAW-R0450-LPD11-S00000.h5
RAW-R0450-LPD00-S00000.h5  RAW-R0450-LPD06-S00000.h5  RAW-R0450-LPD12-S00000.h5
RAW-R0450-LPD01-S00000.h5  RAW-R0450-LPD07-S00000.h5  RAW-R0450-LPD13-S00000.h5
RAW-R0450-LPD02-S00000.h5  RAW-R0450-LPD08-S00000.h5  RAW-R0450-LPD14-S00000.h5
RAW-R0450-LPD03-S00000.h5  RAW-R0450-LPD09-S00000.h5  RAW-R0450-LPD15-S00000.h5
[9]:
from karabo_data import RunDirectory
run = RunDirectory('fxe_example_run/')
[10]:
run.files[:3]   # The objects for the individual files (see above)
[10]:
[FileAccess(<HDF5 file "RAW-R0450-LPD04-S00000.h5" (mode r)>),
 FileAccess(<HDF5 file "RAW-R0450-LPD11-S00000.h5" (mode r)>),
 FileAccess(<HDF5 file "RAW-R0450-LPD15-S00000.h5" (mode r)>)]

What devices were recording in this run?

Control devices are slow data, recording once per train. Instrument devices includes detector data, but also some other data sources such as cameras. They can have more than one reading per train.

[11]:
run.control_sources
[11]:
frozenset({'FXE_XAD_GEC/CAM/CAMERA',
           'FXE_XAD_GEC/CAM/CAMERA_NODATA',
           'SA1_XTD2_XGM/DOOCS/MAIN',
           'SPB_XTD9_XGM/DOOCS/MAIN'})
[12]:
run.instrument_sources
[12]:
frozenset({'FXE_DET_LPD1M-1/DET/0CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/10CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/11CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/12CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/13CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/14CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/15CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/1CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/2CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/3CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/4CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/5CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/6CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/7CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/8CH0:xtdf',
           'FXE_DET_LPD1M-1/DET/9CH0:xtdf',
           'FXE_XAD_GEC/CAM/CAMERA:daqOutput',
           'FXE_XAD_GEC/CAM/CAMERA_NODATA:daqOutput',
           'SA1_XTD2_XGM/DOOCS/MAIN:output',
           'SPB_XTD9_XGM/DOOCS/MAIN:output'})

Which trains are in this run?

[13]:
print(run.train_ids[:10])
[10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009]

See the available keys for a given source:

[14]:
run.keys_for_source('SPB_XTD9_XGM/DOOCS/MAIN:output')
[14]:
{'data.intensityAUXTD',
 'data.intensitySigma.x_data',
 'data.intensitySigma.y_data',
 'data.intensityTD',
 'data.trainId',
 'data.xTD',
 'data.yTD'}

This collects data from across files, including detector data:

[15]:
for tid, data in run.trains():
    print("Processing train", tid)
    print("Detctor data module 0 shape:", data['FXE_DET_LPD1M-1/DET/0CH0:xtdf']['image.data'].shape)

    break  # Stop after the first train to keep the demo short
Processing train 10000
Detctor data module 0 shape: (128, 1, 256, 256)

Train IDs are meant to be globally unique (although there were some glitches with this in the past). A train index is only within this run.

[16]:
tid, data = run.train_from_id(10005)
tid, data = run.train_from_index(5)

Series data to pandas

Data which holds a single number per train (or per pulse) can be extracted to as series (individual columns) and dataframes (tables) for pandas, a widely-used tool for data manipulation.

karabo_data chains sequence files, which contain successive data from the same source. In this example, trains 10000–10399 are in one sequence file (...DA01-S00000.h5), and 10400–10479 are in another (...DA01-S00001.h5). They are concatenated into one series:

[17]:
ixPos = run.get_series('SA1_XTD2_XGM/DOOCS/MAIN', 'beamPosition.ixPos.value')
ixPos.tail(10)
[17]:
trainId
10470    0.0
10471    0.0
10472    0.0
10473    0.0
10474    0.0
10475    0.0
10476    0.0
10477    0.0
10478    0.0
10479    0.0
Name: SA1_XTD2_XGM/DOOCS/MAIN/beamPosition.ixPos, dtype: float32

To extract a dataframe, you can select interesting data fields with glob syntax, as often used for selecting files on Unix platforms.

  • [abc]: one character, a/b/c

  • ?: any one character

  • *: any sequence of characters

[18]:
run.get_dataframe(fields=[("*_XGM/*", "*.i[xy]Pos")])
[18]:
SA1_XTD2_XGM/DOOCS/MAIN/beamPosition.ixPos SA1_XTD2_XGM/DOOCS/MAIN/beamPosition.iyPos SPB_XTD9_XGM/DOOCS/MAIN/beamPosition.ixPos SPB_XTD9_XGM/DOOCS/MAIN/beamPosition.iyPos
trainId
10000 0.0 0.0 0.0 0.0
10001 0.0 0.0 0.0 0.0
10002 0.0 0.0 0.0 0.0
10003 0.0 0.0 0.0 0.0
10004 0.0 0.0 0.0 0.0
10005 0.0 0.0 0.0 0.0
10006 0.0 0.0 0.0 0.0
10007 0.0 0.0 0.0 0.0
10008 0.0 0.0 0.0 0.0
10009 0.0 0.0 0.0 0.0
10010 0.0 0.0 0.0 0.0
10011 0.0 0.0 0.0 0.0
10012 0.0 0.0 0.0 0.0
10013 0.0 0.0 0.0 0.0
10014 0.0 0.0 0.0 0.0
10015 0.0 0.0 0.0 0.0
10016 0.0 0.0 0.0 0.0
10017 0.0 0.0 0.0 0.0
10018 0.0 0.0 0.0 0.0
10019 0.0 0.0 0.0 0.0
10020 0.0 0.0 0.0 0.0
10021 0.0 0.0 0.0 0.0
10022 0.0 0.0 0.0 0.0
10023 0.0 0.0 0.0 0.0
10024 0.0 0.0 0.0 0.0
10025 0.0 0.0 0.0 0.0
10026 0.0 0.0 0.0 0.0
10027 0.0 0.0 0.0 0.0
10028 0.0 0.0 0.0 0.0
10029 0.0 0.0 0.0 0.0
... ... ... ... ...
10450 0.0 0.0 0.0 0.0
10451 0.0 0.0 0.0 0.0
10452 0.0 0.0 0.0 0.0
10453 0.0 0.0 0.0 0.0
10454 0.0 0.0 0.0 0.0
10455 0.0 0.0 0.0 0.0
10456 0.0 0.0 0.0 0.0
10457 0.0 0.0 0.0 0.0
10458 0.0 0.0 0.0 0.0
10459 0.0 0.0 0.0 0.0
10460 0.0 0.0 0.0 0.0
10461 0.0 0.0 0.0 0.0
10462 0.0 0.0 0.0 0.0
10463 0.0 0.0 0.0 0.0
10464 0.0 0.0 0.0 0.0
10465 0.0 0.0 0.0 0.0
10466 0.0 0.0 0.0 0.0
10467 0.0 0.0 0.0 0.0
10468 0.0 0.0 0.0 0.0
10469 0.0 0.0 0.0 0.0
10470 0.0 0.0 0.0 0.0
10471 0.0 0.0 0.0 0.0
10472 0.0 0.0 0.0 0.0
10473 0.0 0.0 0.0 0.0
10474 0.0 0.0 0.0 0.0
10475 0.0 0.0 0.0 0.0
10476 0.0 0.0 0.0 0.0
10477 0.0 0.0 0.0 0.0
10478 0.0 0.0 0.0 0.0
10479 0.0 0.0 0.0 0.0

480 rows × 4 columns

Labelled arrays

Data with extra dimensions can be handled as xarray labelled arrays. These are a wrapper around Numpy arrays with indexes which can be used to align them and select data.

[19]:
xtd2_intensity = run.get_array('SA1_XTD2_XGM/DOOCS/MAIN:output', 'data.intensityTD', extra_dims=['pulseID'])
xtd2_intensity
[19]:
<xarray.DataArray (trainId: 480, pulseID: 1000)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
Coordinates:
  * trainId  (trainId) uint64 10000 10001 10002 10003 ... 10477 10478 10479
Dimensions without coordinates: pulseID

Here’s a brief example of using xarray to align the data and select by train ID. See the examples in the xarray docs for more on what it can do.

In this example data, all the data sources have the same range of train IDs, so aligning them doesn’t change anything. In real data, devices may miss some trains that other devices did record.

[20]:
import xarray as xr
xtd9_intensity = run.get_array('SPB_XTD9_XGM/DOOCS/MAIN:output', 'data.intensityTD', extra_dims=['pulseID'])

# Align two arrays, keep only trains which they both have data for:
xtd2_intensity, xtd9_intensity = xr.align(xtd2_intensity, xtd9_intensity, join='inner')

# Select data for a single train by train ID:
xtd2_intensity.sel(trainId=10004)

# Select data from a range of train IDs.
# This includes the end value, unlike normal Python indexing
xtd2_intensity.loc[10004:10006]
[20]:
<xarray.DataArray (trainId: 3, pulseID: 1000)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
Coordinates:
  * trainId  (trainId) uint64 10004 10005 10006
Dimensions without coordinates: pulseID

You can also specify a region of interest from an array to load only part of the data:

[21]:
from karabo_data import by_index

# Select the first 5 trains in this run:
sel = run.select_trains(by_index[:5])

# Get the whole of this array:
arr = sel.get_array('FXE_XAD_GEC/CAM/CAMERA:daqOutput', 'data.image.pixels')
print("Whole array shape:", arr.shape)

# Get a region of interest
arr2 = sel.get_array('FXE_XAD_GEC/CAM/CAMERA:daqOutput', 'data.image.pixels', roi=by_index[100:200, :512])
print("ROI array shape:", arr2.shape)
Whole array shape: (5, 255, 1024)
ROI array shape: (5, 100, 512)

General information

karabo_data provides a few ways to get general information about what’s in data files. First, from Python code:

[22]:
run.info()
# of trains:    480
Duration:       0:00:47.900000
First train ID: 10000
Last train ID:  10479

16 detector modules (FXE_DET_LPD1M-1)
  e.g. module FXE_DET_LPD1M-1 0 : 256 x 256 pixels
  128 frames per train, 61440 total frames

4 instrument sources (excluding detectors):
  - FXE_XAD_GEC/CAM/CAMERA:daqOutput
  - FXE_XAD_GEC/CAM/CAMERA_NODATA:daqOutput
  - SA1_XTD2_XGM/DOOCS/MAIN:output
  - SPB_XTD9_XGM/DOOCS/MAIN:output

4 control sources:
  - FXE_XAD_GEC/CAM/CAMERA
  - FXE_XAD_GEC/CAM/CAMERA_NODATA
  - SA1_XTD2_XGM/DOOCS/MAIN
  - SPB_XTD9_XGM/DOOCS/MAIN

[23]:
run.detector_info('FXE_DET_LPD1M-1/DET/0CH0:xtdf')
[23]:
{'dims': (256, 256), 'frames_per_train': 128, 'total_frames': 61440}

The lsxfel command provides similar information at the command line:

[24]:
!lsxfel fxe_example_run/RAW-R0450-LPD00-S00000.h5
RAW-R0450-LPD00-S00000.h5 : Raw detector data from LPD module 00
480 trains

256 × 256 pixels
128 frames per train, 61440 total
[25]:
!lsxfel fxe_example_run/RAW-R0450-DA01-S00000.h5
RAW-R0450-DA01-S00000.h5 : Aggregated data
400 trains

4 instrument sources
  -  FXE_XAD_GEC/CAM/CAMERA:daqOutput
  -  FXE_XAD_GEC/CAM/CAMERA_NODATA:daqOutput
  -  SA1_XTD2_XGM/DOOCS/MAIN:output
  -  SPB_XTD9_XGM/DOOCS/MAIN:output

4 control sources
  -  FXE_XAD_GEC/CAM/CAMERA
  -  FXE_XAD_GEC/CAM/CAMERA_NODATA
  -  SA1_XTD2_XGM/DOOCS/MAIN
  -  SPB_XTD9_XGM/DOOCS/MAIN

[26]:
!lsxfel fxe_example_run
fxe_example_run : Run directory

# of trains:    480
Duration:       0:00:47.900000
First train ID: 10000
Last train ID:  10479

16 detector modules (FXE_DET_LPD1M-1)
  e.g. module FXE_DET_LPD1M-1 0 : 256 x 256 pixels
  128 frames per train, 61440 total frames

4 instrument sources (excluding detectors):
  - FXE_XAD_GEC/CAM/CAMERA:daqOutput
  - FXE_XAD_GEC/CAM/CAMERA_NODATA:daqOutput
  - SA1_XTD2_XGM/DOOCS/MAIN:output
  - SPB_XTD9_XGM/DOOCS/MAIN:output

4 control sources:
  - FXE_XAD_GEC/CAM/CAMERA
  - FXE_XAD_GEC/CAM/CAMERA_NODATA
  - SA1_XTD2_XGM/DOOCS/MAIN
  - SPB_XTD9_XGM/DOOCS/MAIN