Database Specification

Legend

keys or files
shape
data type
<variable>
HDF5 structure
- key: group, or dtype if members have same shape and no identifiers
  - if group, <sub-key-variable>: dtype variable sub-key without type
  - if dtype: shape same shape for all members
- other-key: group <clustering> dtype shape definition for all group members
  - specific-member: properties for some specific member of group

General Data Specification

all data is stored as HDF5
file names with underscores for spaces and dashes as key-separators
all keys are in singular
each data instance is one large file
- input data is only one per dataset and model
- attribution is one per dataset, model and attribution method
- analysis is one per dataset, model, attribution method and analysis topic

Model Input Data

shape for image data is samples x channel x height x width
since preprocessing depends on the model, we supply a file <model>.input.h5 with all preprocessing steps applied
HDF5 structure
- data: group, or float32 if every sample has same dimensions
  - if group: <data-id> float32 channel x height x width if group, <data-id> can be a filename or an identifier
  - if float32: float32 samples x channel x height x width
- label: group, or uint if data is group
  - if group: <data-id> uint 1 for single label, or bool classes for multi label
  - if uint: samples for single label, bool samples x classes for multi label if data is float32
- index: group <data-id> uint 1, optional
  - if data is group, assign indices to keys
  - otherwise natural sort order of keys is assumed

Attribution of Input Data

<attribution-strategy> can be:
- true: for true label
- model: for model prediction
- <integer> for choosing a fixed label
- <else> for something I did not think of
HDF5 structure
- index: uint samples indices of attributed input samples in the input file
- attribution: group or float32 attributions with full channel information
  - if group: <data-id> float32 channel x height x width
  - if float32: samples x channel x height x width
- label group or float32, attribution assigned for the model output, governed by <attribution-strategy>
  - if group: <data-id> samples x {1, <classes>}
  - if uint: samples x {1, <classes>}
- prediction: group or float32
  - if group: <data-id> float32 classes
  - if float32: samples x classes

Analysis Output Data

HDF5 structure
- <analysis-identifier> group with name of the analysis as sub-keys (not necessarily classes!, WordNet-id for class-wise ImageNet Analysis)
  - name: string verbose name of analysis
  - index: uint32 samples sample indices in the input attribution file
  - embedding: group <embedding-id>
    - spectral: group
      - name: string, verbose name of embedding
      - root: float32 samples x eigenvalues Eigenvectors of Eigen Decomposition
      - base: link, if not model input, link to the embedding used
      - region: region reference, if not model input or not full embedding, region reference to the features used as input
      - eigenvalue float32 eigenvalues Eigenvalues for the spectral embedding
    - tsne: group
      - name: string, verbose name of embedding
      - root: float32 samples x 2 t-SNE Embedding
      - base: link, if not model input, link to the embedding used
      - region: region reference, if not model input or not full embedding, region reference to the features used as input
  - clustering: group
    - <clustering>: group label for clusters on an embedding
      - name: string, verbose name of embedding
      - root: uint32 samples labels for clustering on embedding
      - base: link link to the embedding used for clustering
      - region: region reference, if not model input or not full embedding, region reference to the features used as input
      - #clusters: int, optional if not applying, number of clusters for this clustering
      - prototype: group multiple prototypes for each cluster
        
        average: group member average prototypes for all clusters
        
        name: string, verbose name of prototype
        
        root: float32 <#clusters> x channel x height x width prototype payload