Release Notes


Data access

  • A new get_dask_array() method to access data as a Dask array (PR #212). Dask is a powerful tool for working with large amounts of data and doing computation in parallel.

  • open_run() and RunDirectory() now take an optional include= glob pattern to select files to open (PR #221). This can make opening a run faster if you only need to read certain files.

  • Trying to open a run directory to which you don’t have read access now correctly raises PermissionError (PR #210).

  • stack_detector_data() has a new parameter real_array. Passing real_array=False avoids copying the data into a temporary array on the way to assembling images with detector geometry (PR #196).

  • When you open a run directory with open_run() or RunDirectory(), karabo_data tries to cache the metadata describing what data is in each file (PR #206). Once the cache is created, opening the run again should be much faster, as it only needs to open the files containing the data you want. See Cached run data maps for the details of how this works.

  • Importing karabo_data is faster, as packages like xarray and pandas are now only loaded if you use the relevant methods (PR #207).

  • lsxfel and info() are faster in some cases, as they only look in one file for the detector data shape (PR #219).

  • get_array() is slightly faster, as it avoids copying data in memory unnecessarily (PR #209)

  • When you select sources with select() or deselect(), the resulting DataCollection no longer keeps references to files with no selected data. This should make it easier to then combine data with union() in some situations (PR #202).

  • Data validation now checks that indexes have one entry per train ID.

Detector geometry

  • plot_data_fast() is much more flexible, e.g. if you want to add a colorbar or draw the image as part of a larger figure (PR #205). See its documentation for the new parameters.


Data access

  • The karabo-bridge-serve-files command now takes --source and --key options to select data to stream. They can be used with exact source names or with glob-style patterns, e.g. --source '*/DET/*' (PR #183).

  • Skip checking that .h5 files in a run directory are HDF5 before trying to open them (PR #187). The error is still handled if they are not.

Detector geometry


Data access

Detector geometry

  • New class DSSC_Geometry for handling DSSC detector geometry (PR #155).

  • LPD_1MGeometry can now read and write CrystFEL format geometry files, and produce PyFAI distortion arrays (PR #168, PR #129).

  • write_crystfel_geom() (for AGIPD and LPD geometry) now accepts various optional parameters for other details to be written into the geometry file, such as the detector distance (clen) and the photon energy (PR #168).

  • New method get_pixel_positions() to get the physical position of every pixel in a detector, for all of AGIPD, LPD and DSSC (PR #142).

  • New method data_coords_to_positions() to convert data array coordinates to physical positions, for AGIPD and LPD (PR #142).


  • Python 3.5 is now the minimum required version.

  • Fix compatibility with numpy 1.14 (the version installed in Anaconda on the Maxwell cluster).

  • Better error message from stack_detector_data() when passed non-detector data.


New features:


  • stack_detector_data() can handle missing detector modules.

  • Source sets have been changed to frozen sets. Use select() to choose a subset of sources.

  • get_array() now only loads the data for selected trains.

  • get_array() works with data recorded more than once per train.


  • New command karabo-data-validate to check the integrity of data files.

  • New methods to select a subset of data: select(), deselect(), select_trains(), union(),

  • Selected data can be written back to a new HDF5 file with write().

  • RunDirectory() and H5File() are now functions which return a DataCollection object, rather than separate classes. Most code using these should still work, but checking the type with e.g. isinstance() may break.