Release Notes¶

0.7¶

Data access¶

A new get_dask_array() method to access data as a Dask array (PR #212). Dask is a powerful tool for working with large amounts of data and doing computation in parallel.
open_run() and RunDirectory() now take an optional include= glob pattern to select files to open (PR #221). This can make opening a run faster if you only need to read certain files.
Trying to open a run directory to which you don’t have read access now correctly raises PermissionError (PR #210).
stack_detector_data() has a new parameter real_array. Passing real_array=False avoids copying the data into a temporary array on the way to assembling images with detector geometry (PR #196).
When you open a run directory with open_run() or RunDirectory(), karabo_data tries to cache the metadata describing what data is in each file (PR #206). Once the cache is created, opening the run again should be much faster, as it only needs to open the files containing the data you want. See Cached run data maps for the details of how this works.
Importing karabo_data is faster, as packages like xarray and pandas are now only loaded if you use the relevant methods (PR #207).
lsxfel and info() are faster in some cases, as they only look in one file for the detector data shape (PR #219).
get_array() is slightly faster, as it avoids copying data in memory unnecessarily (PR #209)
When you select sources with select() or deselect(), the resulting DataCollection no longer keeps references to files with no selected data. This should make it easier to then combine data with union() in some situations (PR #202).
Data validation now checks that indexes have one entry per train ID.

Detector geometry¶

plot_data_fast() is much more flexible, e.g. if you want to add a colorbar or draw the image as part of a larger figure (PR #205). See its documentation for the new parameters.

0.6¶

Data access¶

The karabo-bridge-serve-files command now takes --source and --key options to select data to stream. They can be used with exact source names or with glob-style patterns, e.g. --source '*/DET/*' (PR #183).
Skip checking that .h5 files in a run directory are HDF5 before trying to open them (PR #187). The error is still handled if they are not.

Detector geometry¶

Assembling detector data into images can now reuse an output array - see position_modules_fast() and output_array_for_position_fast() (PR #186).
CrystFEL format geometry files can now be written for 2D input arrays with the modules arranged along the slow-scan axis, as used by OnDA (PR #191). To do this, pass dims=('frame', 'ss', 'fs') to write_crystfel_geom().
The geometry code has been reworked to use metres internally (PR #193), along with other refactorings in PR #184 and PR #192. These changes should not affect the public API.

0.5¶

Data access¶

New method get_data_counts() to find how many data points were recorded in each train for a given source and key.
Create a virtual dataset for any single dataset with get_virtual_dataset() (PR #162). See Parallel processing with a virtual dataset for how this can be useful.
Write a file with virtual datasets for all selected data with write_virtual() (PR #132).
Data from the supported multi-module detectors (AGIPD, LPD & DSSC) can be exposed in CXI format using a virtual dataset - see write_virtual_cxi() (PR #150, PR #166, PR #173).
New class DSSC for accessing DSSC data (PR #171).
New function open_run() to access a run by proposal and run number rather than path (PR #147).
stack_detector_data() now allows input data where some sources don’t have the specified key (PR #141).
Files in the new 1.0 data format can now be opened (PR #182).

Detector geometry¶

New class DSSC_Geometry for handling DSSC detector geometry (PR #155).
LPD_1MGeometry can now read and write CrystFEL format geometry files, and produce PyFAI distortion arrays (PR #168, PR #129).
write_crystfel_geom() (for AGIPD and LPD geometry) now accepts various optional parameters for other details to be written into the geometry file, such as the detector distance (clen) and the photon energy (PR #168).
New method get_pixel_positions() to get the physical position of every pixel in a detector, for all of AGIPD, LPD and DSSC (PR #142).
New method data_coords_to_positions() to convert data array coordinates to physical positions, for AGIPD and LPD (PR #142).

0.4¶

Python 3.5 is now the minimum required version.
Fix compatibility with numpy 1.14 (the version installed in Anaconda on the Maxwell cluster).
Better error message from stack_detector_data() when passed non-detector data.

0.3¶

New features:

New interfaces for working with AGIPD, LPD & DSSC Geometry.
New interfaces for accessing AGIPD, LPD & DSSC data.
select_trains() can now select arbitrary specified trains, not just a slice.
get_array() can take a region of interest (roi) parameter to select a slice of data from each train.
A newly public keys_for_source() method to list keys for a given source.

Fixes:

stack_detector_data() can handle missing detector modules.
Source sets have been changed to frozen sets. Use select() to choose a subset of sources.
get_array() now only loads the data for selected trains.
get_array() works with data recorded more than once per train.

0.2¶

New command karabo-data-validate to check the integrity of data files.
New methods to select a subset of data: select(), deselect(), select_trains(), union(),
Selected data can be written back to a new HDF5 file with write().
RunDirectory() and H5File() are now functions which return a DataCollection object, rather than separate classes. Most code using these should still work, but checking the type with e.g. isinstance() may break.