Note
This technote describes policy that has been adopted (RFC-741) but not fully implemented.
While the Gen3 Butler provides some intrinsic structure to its data repositories, considerably more is left to convention (often encoded in higher-level packages, like obs_base). This document will be - at least at first - a proposal for how to organize data repositories in detail, focusing on collection naming conventions, filesystem locations, and developer workflows. The immediate focus will be the environment at NCSA, but it is hoped that much of this will hold for the IDF and USDF as well. A core assumption is that there will be a very small number of large shared data repositories for all instruments at NCSA (and each other major facility), for “friendly” use by the DM team; in particular, we will have one data repository for all real (non-simulated) data.
Shared repositories insulate most users from having to worry about raw ingest, calibrations, references, and (now) even skymap definition. In Gen3, including data from multiple instruments in the same data repository opens up processing data from those instruments together (albeit with careful control over configuration, as our usual obs-package overrides may not work), and allows them to directly share reference catalogs and skymaps.
Smaller custom data repositories may also exist (especially for CI), but we hope to ensure that nearly all development work can be performed without the need for developers to ever create their own personal repositories. One or more large shared repositories for science users are also expected to exist, but are explicitly beyond the scope of this proposal.
Simulated data other than DESC DC2 (e.g. the NCSA “test stand”) is also explicitly not covered in this proposal; the author is not sufficiently familiar with the scope and structure of such data to make a proposal. The real-data repository should generally not contain simulations, however; this could easily lead to confusion and conflicts, while seeming to have no benefits at all.
After consultation with other stakeholders and ultimately RFC, at least some of the content here should probably be moved to the DM Developer Guide.
1 Collections¶
The Gen3 Butler organizes datasets primarily via collections, which are groups of datasets defined strictly in the database (while some collections may relate directly to filesystem or other storage locations, this should usually be considered an implementation detail).
There are several types of collections, which are quite different in how they how they are written, but almost identical to readers. A full description of the types of collections can be found in the daf_butler documentation, but a brief summary here should be sufficient for the proposals in this document:
- A
RUN
collection (or just “a run”) is the only type of collection that is intrinsic to a dataset - a dataset is added toRUN
collection when it is first added to a repository, and remains in thatRUN
collection until/unless it is deleted entirely. - A
CHAINED
collection is an ordered list of other collections to search. These can be nested arbitrarily (but cycles are of course not permitted), and may be redefined after creation. A single-elementCHAINED
collection can thus be used as a sort of symbolic link, providing a stable name for a conceptual group of datasets whose definition may change over time. - A
TAGGED
collection is an explict list of individual datasets. - A
CALIBRATION
collection is an explict list of (dataset, validity range) tuples, where the validity range is a timespan.
A dataset is always in exactly one RUN
collection, but may be in any number of TAGGED
and/or CALIBRATION
collections as well.
It is also important to note that collections are always groups of datasets, not data IDs (again, see the daf_butler documentation for more information on these concepts). So, for example, a collection can contain raws or coadds, but not visits or tracts.
In addition, collections are constrained to hold only one dataset with a particular combination of dataset type (e.g. “calexp”) and data ID, with the exception of CALIBRATION
collections, for which there may only be one dataset with a particular dataset type and data ID at any point in time.
Note
Butler nomenclature in both Gen2 and Gen3 uses “dataset” to refer to entities that are of order one file (e.g. “a raw” or “a coadd” is “a dataset”). Informally, we also frequently refer to large sets of these datasets and/or their data IDs (“HSC PDR1” or “DESC DC2”) as datasets, but I will avoid that usage whenever possible here.
The naming patterns for collections proposed here are summarized in Overview of collection naming conventions for real (non-simulated) data. and Overview of collection naming conventions for DESC DC2 data., with details and explanations in the following subsections.
Name Pattern | Type | Description |
---|---|---|
<instrument>/defaults | CHAINED | Recommended raw, calibration, and auxiliary data for <instrument>. |
<instrument>/raw | CHAINED | Recommended raw data for <instrument>. |
<instrument>/raw/<ticket> | TAGGED | Raw data curated to have no problems on <ticket>. |
<instrument>/raw/all | RUN | Where all raw data are originally ingested. |
<instrument>/calib | CHAINED | Recommended calibrations for <instrument>. |
<instrument>/calib/<ticket> | CALIBRATION | Calibrations certified on <ticket>. |
<instrument>/calib/<ticket>/* | unspecified | Calibration production runs. |
[<instrument>/]runs/<target>/<release>/<ticket> | CHAINED | Public outputs of processing data <target> with <release> on <ticket>. |
[<instrument>/]runs/<target>/<release>/<ticket>/* | unspecified | Private intermediates of processing data <target> with <release> on <ticket>. |
refcats | CHAINED | All reference catalogs (distinguished by dataset type). |
refcats/<ticket> | RUN | One reference catalog, ingested and sharded on <ticket>. |
skymaps | RUN | All skymap definition datasets (distinguished by data ID). |
pretrained_models/<model> | RUN | One pretrained neural network, with package name <model>. |
injection/defaults | CHAINED | All required input data for synthetic source injection. |
injection/catalogs | CHAINED | All synthetic source injection input catalogs. |
injection/catalogs/<ticket> | RUN | Synthetic source injection input catalog as defined on <ticket>. |
u/<user>/* | unspecified | Experimental/development processing by <user>. |
Name Pattern | Type | Description |
---|---|---|
<X>.<Y>[ip]/defaults | CHAINED | Recommended raw, calibration, and auxiliary data for a simulation run. |
<X>.<Y>[ip]/raw | CHAINED | Recommended raw data for for a simulation run. |
<X>.<Y>[ip]/raw/<ticket> | TAGGED | Raw data curated to have no problems on <ticket>. |
<X>.<Y>[ip]/raw/all | RUN | Where all raw data are originally ingested. |
<X>.<Y>[ip]/calib | CHAINED | Recommended calibrations for for a simulation run. |
<X>.<Y>[ip]/calib/<ticket> | CALIBRATION | Calibrations certified on <ticket>. |
<X>.<Y>[ip]/calib/<ticket>/* | unspecified | Calibration production runs. |
[<X>.<Y>[ip]/]runs/<target>/<release>/<ticket> | CHAINED | Public outputs of processing data <target> with <release> on <ticket>. |
[<X>.<Y>[ip]/]runs/<target>/<release>/<ticket>/* | unspecified | Private intermediates of processing data <target> with <release> on <ticket>. |
refcats | CHAINED | All reference catalogs (distinguished by dataset type). |
refcats/<ticket> | RUN | One reference catalog, ingested and sharded on <ticket>. |
skymaps | RUN | All skymap definition datasets (distinguished by data ID). |
u/<user>/* | unspecified | Experimental/development processing by <user>. |
1.1 Per-instrument collections¶
Raw and calibration data associated with a particular real-data instrument is organized into collections that start with the “short” instrument name, e.g. “HSC” or “LSSTCam-imSim”, followed by a slash.
The naming conventions for these collections are codified by the lsst.obs.base.Instrument class’s make*Name
methods.
The highest-level collections are always defined as CHAINED
collection “pointers” to other versioned collections.
In the case of raw data (including both science observations and raw calibrations), we propose three levels of collections:
<instrument>/raw/all
: theRUN
collection into which all raws for that instrument are originally ingested.<instrument>/raw/<ticket>
: aTAGGED
collection containing a curated subset of all raws that do not contain problems (e.g. tracking issues, airplanes, etc.), named according to the ticket (e.g.DM-98765
) on which the curation work was done.<instrument>/raw
: aCHAINED
collection that points to the current-best*/<ticket>
collection for this instrument.
The collections for master calibrations follow a similar pattern, but because master calibration datasets are produced by our own pipelines, not ingested, [1] there is no single RUN
collection that holds these all of these datasets directly.
As described further in 1.7 Calibration production, each processing run generates a new RUN
collection.
Certifying these calibration datasets - marking them as acceptable for use in calibrating observations taken in a certain temporal validity range - involves adding them to a CALIBRATION
collection.
These should have names of the form <instrument>/calib/<ticket>
, and we will use single-element CHAINED
collections of the form <instrument>/calib
as pointers to the current best set of calibrations for each instrument.
Note
CALIBRATION
collections that are not candidates for broad use (e.g. because they represent experimental work on a development branch) should instead start with u/<user>
, as described in 1.6 Developer processing outputs.
Finally, for convenience, we will define per-instrument CHAINED
collections with names of the form <instrument>/defaults
that aggregate:
- the recommended raws for that instrument (
<instrument>/raw/good
), - the recommended calibrations for that instrument (
<instrument>/calib
), - and cross-instrument auxiliary collections (
refcats
andskymaps
; see 1.2 Reference catalogs and 1.3 SkyMap definitions, respectively).
[1] | In Gen2, master calibration datasets were ingested, because the data repository in which they were produced was entirely different from the special calibration repository where they were put after certification. Gen3 data repositories are larger, with Gen3 collections corresponding more closely to Gen2 repositories. So certifying a master calibration in Gen3 just involves adding it to a new collection, not ingesting it into a new data repository. |
1.1.1 HSC-only auxiliary data¶
Our HSC processing uses bright object masks produced by external code.
By analogy with raw and calibration data, these will be stored in a HSC/masks/S18A
RUN
collection, with a HSC/masks
single-element CHAINED
collection pointer to the current best version.
S18A
refers to the HSC internal release in which these masks were first used.
While it is somewhat unlikely that we will ever add older mask versions or new masks in the same form to LSST data repositories (LSST processing is moving to a different approach to these masks, and HSC will probably follow suit), this gives us a clear place to put them without naming conflicts.
The top-level HSC/defaults
collection will include HSC/masks
as well.
This of course establishes a precedent for other instrument-specific auxiliary data, but we expect this to be sufficiently rare that new cases probably merit their own RFCs.
1.1.2 Per-instrument collections for DESC DC2¶
For DESC DC2 data repositories, a very similar structure is used, but <instrument>
is replaced here by a <X>.<Y>[ip]
simulation number; while DC2 data repositories may in general have multiple instruments (i.e. ImSim and PhoSim), the simulation version number is also necessary to distinguish between different raws and calibs.
It is assumed that all simulation versions utilize the same observational metadata (i.e. exposure
and visit
records), at least within each of ImSim and PhoSim, or that differences are sufficiently small that one simulation version’s observations can be used with raws from other simulation versions.
When this is not the case, different data repositories must be used for those incompatible simulation versions.
1.2 Reference catalogs¶
External reference catalogs reformatted and sharded by DM code are written to refcats/<ticket>
RUN
collections, where <ticket>
is the ticket on which the reformatting and sharding work was performed.
After a reference catalog has been validated, its RUN
is added to the overall refcats
CHAINED
collection.
Different collections for different reference catalogs are not necessary, as the name of a reference catalog (e.g. ps1_pv3_3pi_20170110
) is used directly as its dataset type name (note that this was not the case in Gen2, where the reference catalog name was instead part of the data ID).
1.3 SkyMap definitions¶
All skymaps must have a globally unique name in Gen3, which is used as part of the data ID for any dataset that is defined on tracts.
The skymap definition datasets (i.e. lsst.skymap.BaseSkyMap
subclass instances in Python) also include this globally unique name in their data IDs, and hence can also all go in a single skymaps
collection.
This is a RUN
collection that holds skymap definition datasets directly.
The existence of different skymap definition datasets for different coadd types (goodSeeingCoadd_skyMap
, etc.) is a relic of Gen2 that will soon be removed entirely from Gen3; all skymap definition datasets will just use the skyMap
dataset type.
The new globally-unique skymap data ID names are both necessary and sufficient for uniqueness in Gen3.
SkyMap registration is something we expect to be rare in Gen3 - much more rare than running makeSkyMap.py
was in Gen2 - because we almost always use one of a few standard SkyMaps, and in Gen3 a SkyMap (a combination of a lsst.skymap.BaseSkyMap
class and its configuration) can only be registered once.
Discrete SkyMaps, which typically cover only a small part of the sky and are conceptually a bit more per-user, may be less rare, but our data model currently does not treat these any differently, and until we can identify the patterns and use cases for creating new SkyMaps (even discrete ones), we propose that any new SkyMap registration in a shared repository be preceded by an RFC.
1.4 Source injection datasets¶
Input catalogs containing synthetic source parameters required for source injection are written to injection/catalogs/<ticket>
RUN
collections, where <ticket>
is the ticket on which the catalog was created.
After a catalog has been ingested, its RUN
is added to the overall injection/catalogs
CHAINED
collection.
All required input data for source injection, including input catalogs, are aggregated in the injection/defaults
CHAINED
collection.
1.6 Developer processing outputs¶
Processing initiated by DM developers that are intended primarily for personal or small-group use must start with u/<user>
(e.g. u/jbosch
), and are strongly encouraged to start with u/<user>/<ticket>
(e.g. u/jbosch/DM-56789
) whenever possible.
Names and structure after this prefix are at user discretion, but we strongly recommend using a combination of CHAINED
collections and RUN
collections to distinguish between “inputs and outputs” collections and “output only” collections, as in 1.5 Shared/official processing outputs.
The pipetask
tool will automatically take care of this if the --output
option is used with or instead of the --output-run
option.
Note
TODO: It’s unclear whether BPS supports this currently, but it should be easy to at least support it under the condition that the RUN
collection be given explicitly as well, instead of generated automatically by appending a timestamp.
1.7 Calibration production¶
Calibration production runs intended for broad use (i.e. outputs will be at least candidates for membership in the recommended calibration collection for this instrument) should output to collections with names that start with <instrument>/calib/<ticket>/
.
Those produced for experimental or development purposes should start with u/<user>/<ticket>/
.
In either case, the RUN
collections that hold output datasets directly will usually require another disambiguating term, mapping roughly to the expected validity range epoch.
Our current plan is to use our initial estimate of the start date of the validity range; this will rarely change (though if it does, the date in the RUN
collection - which cannot be changed after datasets have been written - will not reflect the actual validity range start date).
We could consider using only dates (possibly with human-incremented integer suffixes, as necessary) for RUN
terms while always using full (e.g. second-precision) date-times for actual validity values in order to reduce both confusion and verbosity.
Actual validity ranges are not assigned until datasets are certified (i.e. added to CALIBRATION
collections), and until then, the usual dataset type + data ID constraint applies (i.e. there can only be one bias
for each detector in a particular RUN
collection).
As noted in 1.1 Per-instrument collections, certified calibration products intended for broad use should go in CALIBRATION
collections named just <instrument>/calib/<ticket>
.
CALIBRATION
collections can also of course be nested under u/<user>/<ticket>
, but may not always be necessary for development work, because a RUN
or CHAINED
collection directly containing e.g. new bias
datasets can also be used as an input to a processing run that generates new flat
datasets (as long as only one calibration epoch is in play).
Note
While the middleware can use RUN
collections as inputs to later CPP processing steps, the CPP team may declare in the future that this should not be done as part of official calibration production.
Note
“Curated” calibration datasets that are written from a source-of-truth in an obs_*_data
git repository (rather than generated directly via pipeline processing) are currently written to RUN
collections with names of the form <instrument>/calib/curated/<calibDate>
, which are then ingested directly into an <instrument>/calib
CALIBRATION
collection (which clashes with our proposal earlier to make <instrument>/calib
a CHAINED
collection “pointer”).
The full workflow for curated calibrations is sufficiently unclear that it is unlikely that we will get this right in time for the first long-lived Gen3 repository.
Initially, our proposal is to use RUN
collections of the form <instrument>/calib/<ticket>/curated/<calibDate>
, and a CALIBRATION
collection of the form <instrument>/calib/<ticket>
(which would in general hold non-curated calibrations as well).
This leaves room for multiple curated calibration ingests to coexist, which is necessary because they will improve over time, but we don’t want to assume we can remove old ones.
It does not provide a way to avoid duplication of curated calibration datasets that have not changed.
Calibration collections created by converting the default Gen2 calibration repo for an instrument will use gen2/defaults
instead of <ticket>
, i.e. <instrument>/calib/gen2/defaults
for the CALIBRATION
collection.
1.8 Pretrained models¶
Pretrained neural networks for meas.transiNet.RbTransiNetTask
(and similar tasks that may be developed in the future) are stored in pretrained_models/<model>
RUN
collections, where <model>
is the model package name.
Models cannot be explicitly selected through task connections, but only implicitly by including a specific run in the execution inputs.
Therefore, there is no overall CHAINED
collection, and the models are not included in <instrument>/defaults
.
Collections disambiguated by ticket number are not necessary, as the name of a model (e.g. rbResnet50-DC2
) may include any versioning information, just as refcats are labeled by conversion date.
2 Filesystem locations¶
The main shared data repository for all real-data instruments at NCSA will have a public repository root of /repo/main
, which will be a symlink to a directory of the form /repo/main_<YYYYMMDD>
.
These directories will each contain a butler.yaml
file that points to the appropriate database (with a one-to-one correspondance between databases or database schemas and main_<YYYYMMDD>
directories).
The DC2 shared data repository will use an analogous structure with /repo/DC2
and /repo/DC2_<YYYYMMDD>
paths.
The default (POSIX) datastore will write datasets with templates that begin with the RUN
name, resulting in e.g. the datasets of per-instrument RUN
collections landing in /repo/main_<YYYYMMDD>/<instrument>/
and per-user RUN
collections landing in /repo/main_<YYYYMMDD>/u/<user>
.
Users are discouraged from inspecting these directories (as this will be at least quite different in the IDF or other future cloud-based datastores), and strongly discouraged from modifying them in any way other than via middleware tools.
In many cases, write access will actually be prohibited (see 3 Access Controls).
When migrations are necessary due to changes in the repository format (something that is always preceded by an RFC with explicit CCB approval), a new main_<YYYYMMDD>
(or DC2_<YYYYMMDD>
) directory and database/schema pair will be created, and files will shared via hard links until/unless the old repository is retired.
The existing /datasets
and /lsstdata
paths will remain largely as-is, and may be mounted as fully read-only in any context in which only Gen3-mediated access is needed.
Nested paths within that contain fully-Gen2-managed datasets (such as processing outputs) will be converted to Gen3 via hard-link transfers to the corresponding Gen3 location under /repo
.
These Gen2 locations may be removed when the Gen2 middleware is fully retired.
Files under these paths that are typically symlinked into Gen2 data repositories (such as raw data) will be ingested in-place into the Gen3 data repositories; symlinks would also be possible, but are unnecessary given that Gen3 supports in-place ingest.
These paths may renamed for consistency after Gen2 retirement, as long as the Gen3 database entries are updated accordingly.
2.1 Current paths and migrations describes the migration plan for existing dataset locations at NCSA in detail. New locations (and access controls) are described in 3.1 New paths and access controls.
2.1 Current paths and migrations¶
Path | Definition | Action | Target Gen3 Repo |
---|---|---|---|
/lsstdata | All raw Rubin Obs. data | Ingest contents to Gen3 with in-place transfer. | /repo/main |
/datasets/refcats | Gen2 reference catalog root. | Convert contents to Gen3 with hard-link transfer, retire with Gen2. | /repo/main |
/datasets/hsc/raws | HSC raw data | Convert contents to Gen3 with in-place transfer. Consider moving after Gen2 retirement for consistency. |
/repo/main |
/datasets/hsc/BrightObjectMasks | HSC externally-produced bright object masks | Convert contents to Gen3 with in-place transfer. Consider moving after Gen2 retirement for consistency. |
/repo/main |
/datasets/hsc/calib | HSC externally-produced calibs | Convert contents to Gen3 with in-place transfer. Consider moving after Gen2 retirement for consistency. |
/repo/main |
/datasets/hsc/repo/rerun | HSC Gen2 data repository | Convert contents to Gen3 with hard-link transfer, retire with Gen2. private/$USER subdirectories are skipped, and can be converted manually by users (into u/$USER collections) as desired. Shared reruns will land in collections and subdirectories nested under HSC/runs. |
/repo/main |
/datasets/decam/$DATE | DECam raw data | Convert contents to Gen3 with in-place transfer. Consider moving after Gen2 retirement for consistency. |
/repo/main |
/datasets/des_sn/$DATE | DECam raw data | Convert contents to Gen3 with in-place transfer. Consider moving after Gen2 retirement for consistency. |
/repo/main |
/datasets/decam/rerun/* | DECam Gen2 data repositories | Convert contents to Gen3 with hard-link transfer, retire with Gen2. private/$USER subdirectories are skipped, and can be converted manually by users (into u/$USER collections) as desired. Shared reruns will land in collections and subdirectories nested under DECam/runs. |
/repo/main |
/datasets/des_sn/repo_Y2 | DECam Gen2 data repository | Convert non-raw contents to Gen3 with hard-link transfer, retire with Gen2. private/$USER subdirectories are skipped, and can be converted manually by users (into u/$USER collections) as desired. Shared reruns will land in collections and subdirectories nested under DECam/runs. |
/repo/main |
/datasets/des_sn/repo_Y3 | DECam Gen2 data repository | Convert non-raw contents to Gen3 with hard-link transfer, retire with Gen2. private/$USER subdirectories are skipped, and can be converted manually by users (into u/$USER collections) as desired. Shared reruns will land in collections and subdirectories nested under DECam/runs. |
/repo/main |
/datasets/DC2/DR6 | External transfer of DESC DC2 raws, auxilliary, and processing for DP0. | Convert contents to Gen3 (at /datasets/DC2/repo_$DATE) with in-place transfer. | /repo/DC2 |
/datasets/DC2/DR6/registry.sqlite3 | Gen2 registry database | I don't know what this file is doing here; this is not a Gen2 data repository root. We should delete it or add a README explaining its presence. | |
/datasets/DC2/repo | DC2 Gen2 data repository | Rename to /datasets/DC2/repoRun1.2i to reflect contents, then handle like other directories with that name pattern. | /repo/DC2 |
/datasets/DC2/raw | DC2 raws of various versions | Convert contents to Gen3 with in-place transfer (as needed). Consider moving after Gen2 retirement for consistency. |
/repo/DC2 |
/datasets/DC2/repoRun$X.$Y/rerun | DC2 Gen2 data repository | Convert contents to Gen3 with hard-link transfer (as needed), retire with Gen2. | /repo/DC2 |
/datasets/DC2/repoRun$X.$Y/CALIB | DC2 Gen2 data repository | Ingest contents to Gen3 with in-place transfer (as needed). Consider moving after Gen2 retirement for consistency. |
/repo/DC2 |
/datasets/all-sky | All-sky camera images | Move to /lsstdata/offline/allsky | |
/datasets/ctio0m9 | Early LSST sensor data from CTIO 0.9m | Retire with Gen2 (no Gen3 support for this instrument is planned). | |
/datasets/gapon | Qserv test data | Ignore or relocate. | |
/datasets/lsstCam | Empty placeholder | Remove. | |
/datasets/sdss | SDSS Gen2 data repository and raws | Retire with Gen2 (no Gen3 support for this instrument is planned). | |
/datasets/ts8 | Test stand Gen2 data repository | Retire with Gen2 (if not in /lsstdata already, fix that). | |
/project/datasets | Effectively empty | Remove. | |
/project/test | Effectively empty | Remove. | |
/project/hsc/gen3repo | Temporary Gen3 data repositories | Retire after new shared repo has been up for a while. | |
/project/hsc_rc | Effectively empty | Remove. | |
/project/hsc_rc2_reproc | Effectively empty | Remove. | |
/project/RC | Effectively empty | Remove. | |
/project/shared/hsc | Symlink to /datasets/hsc/repo | Retire with Gen2. |
3 Access Controls¶
The current Gen3 registry architecture does not allow any fine-grained access control in the repository database; we instead rely on “friendly users” being careful and respectful of this shared space.
The access control rules for most users are extremely simple: do not create, modify, or write datasets to any collection unless it starts with u/<user>
.
References to both shared datasets and collection, and other user’s datasets and collections (via TAGGED
and CHAINED
collections) are allowed, but shared collections should not reference personal (u/<user>
) collections
We may be able to add a small number of guards via database access control systems (specifically PostgreSQL’s “row-level security”) in the future, but we do not ever plan to make these exhaustive (the long-term plans for butler access control involve a different approach; see DMTN-163). Our focus will be limited to the most important shared collections and those easiest to accidentally modify, and the details of these guards are beyond the scope of this document.
Warning
NEVER use psql
or other direct-SQL clients (e.g. the Python DBAPI or SQLAlchemy) to perform write operations in the repository database.
These can corrupt the data repository, and we have essentially no way to guard against them.
It should not be necessary in the long term to ever use direct SQL access even for read access; the SQL schema is not considered a public interface - but we recognize that this may be necessary for debugging for a while. This can be ensured by running:
SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY;
at the start of the session.
If you have to do this (and not at the prompting of a middleware team member trying to help diagnose a problem), please also create a ticket explaining what you wanted to do that couldn’t be done with butler tools, so we can address that feature gap.
We do plan to use filesystem access controls to protect shared and per-user files, and we plan to implement some checks in the Butler client itself to make it at least difficult to accidentally cause problems.
This proposal specifies filesystem access controls in terms of small number of groups that mostly grant permission to create subdirectories in files in various paths under /repo
.
How to map these to users, groups, and filesystem, directory, or file-level permissions in detail is something I’d prefer to leave to the system administrators.
All directories in /repo
should always be world-readable.
In addition, all directories in /datasets
and /lsstdata
are expected to be read-only (from the perspective of Gen3 data repositories) and world-readable.
Note
The first version of this document proposed much more extensive changes to the access controls in /datasets
, to enable easier access (i.e. without admin action) to shared datasets by the expert developers that actually oversee them.
Those aspects of the proposal have been dropped because they were a distraction from (and a lower priority than) getting a shared, long-term Gen3 data repository up and available.
Access controls for directories under /repo
are detailed in the table below.
3.1 New paths and access controls¶
Path | Definition | Group | Access Controls | Notes |
---|---|---|---|---|
/repo/main | -> /repo/main_$BEST | Read-only. | This is a symlink to the latest/best data repository for all non-simulated data. | |
/repo/main_$DATE | Gen3 data repository for all non-simulated data. | Read-only. | ||
/repo/main_$DATE/u | Root for personal collections. | All users can create user-owned subdirectories matching their usernames. | If possible, make username subdirectories recursively world-readable by default, and disable creation of non-username subdirectories or files here. | |
/repo/main_$DATE/runs | Root for shared instrument-generic processing outputs. | operators | Group members can create user-owned subdirectories. | If possible, make created subdirectories group-writeable and world-readable (recursively), and disable creation of files here. |
/repo/main_$DATE/$INSTRUMENT/runs | Root for shared instrument-specific processing outputs. | operators | Group members can create user-owned subdirectories. | If possible, make created subdirectories group-writeable and world-readable (recursively), and disable creation of files here. |
/repo/main_$DATE/$INSTRUMENT/calib | Root for shared calibration production outputs. | cpp | Group members can create user-owned subdirectories. | If possible, make created subdirectories group-writeable and world-readable (recursively), and disable creation of files here. |
/repo/main_$DATE/refcats | Root for ingested reference catalogs. | refcats | Group members can create user-owned subdirectories. | No permanent group members; users granted temporary access upon request. If possible, make created subdirectories group-writeable and world-readable (recursively), and disable creation of files here. |
/repo/main_$DATE/skymaps | Root for SkyMap-definition datasets. | skymaps | Group members can create user-owned files. | No permanent group members; users granted temporary access upon request. If possible, disable creation of directories here. |
/repo/DC2 | -> /repo/DC2_$BEST | Read-only. | This is a symlink to the latest/best data repository for all DC2 data. | |
/repo/DC2_$DATE | Gen3 data repository for DC2. | Read-only. | ||
/repo/DC2_$DATE/u | Root for personal collections. | All users can create user-owned subdirectories matching their usernames. | If possible, make username subdirectories recursively world-readable by default, and disable creation of non-username subdirectories or files here. | |
/repo/DC2_$DATE/$X.$Y[ip]/runs | Root for shared processing outputs (for one simulation version). | operators | Group members can create user-owned subdirectories. | If possible, make created subdirectories group-writeable and world-readable (recursively), and disable creation of files here. |
/repo/DC2_$DATE/runs | Root for shared processing outputs (mixed simulation versions). | operators | Group members can create user-owned subdirectories. | If possible, make created subdirectories group-writeable and world-readable (recursively), and disable creation of files here. |
/repo/DC2_$DATE/$X.$Y[ip]/calib | Gen3-produced calibrations to be used with this DC2 version. | cpp | Group members can create user-owned subdirectories. | If possible, make created subdirectories group-writeable and world-readable (recursively), and disable creation of files here. |
/repo/DC2_$DATE/refcats | Root for ingested reference catalogs. | refcats | Users granted temporary access upon request. | No permanent group members; users granted temporary access upon request. If possible, make created subdirectories group-writeable and world-readable (recursively), and disable creation of files here. |
/repo/DC2_$DATE/skymaps | Root for SkyMap-definition datasets. | skymaps | Group members can create user-owned files. | No permanent group members; users granted temporary access upon request. If possible, disable creation of directories here. |
4 Personal and test-package repositories¶
This proposal is primarily concerned with long-lived, shared data repositories of the sort that will exist not just at NCSA, but at the IDF, SLAC, CCIN2P3, and other major LSST data facilities.
Small repositories (typically backed by SQLite) are also expected to be common, especially for small-scale CI and local development.
These repositories should follow the same naming patterns whenever possible, but will generally not need as many levels of indirection to guard against future changes or collections, and many of the collections defined here as CHAINED
or TAGGED
collections can instead be safely defined directly as RUN
collections instead.
5 Notable omissions and future work¶
5.1 “Collections” of data IDs¶
Collections that represent fields of particular interest or regularly-reprocessed test datasets are not described here, because those are conceptually more groups of data IDs than groups of datasets (e.g. not just raw exposures, but tracts on which to combine them as well).
As in Gen2, we will continue to record the definitions of these groups outside the data repository itself, though we may add support for in-repository storage of data IDs to Gen3 in the future.
It is also worth noting that exposure or visit metadata can sometimes be used to help select some of these data IDs (e.g. visit.target_name='SSP-Wide'
), and these selections are automatically combined with the selection of a <instrument>/raw/good
input collection.
5.2 Naming conventions for dataset types¶
The names for nearly all dataset types in Gen3 have been inherited directly from Gen2, and while these are sorely in need of standardization and cleanup, we have no plans to change to new names until Gen2 has been fully retired. Naming conventions for new dataset types would be welcome before then, but are still beyond the scope of this proposal.
In the meantime, users should be aware that dataset types are global entities with no implicit namespacing, and hence new dataset types should be created with care.
The pipetask
tool’s --register-dataset-types
option is a non-default option for exactly this reason: in a long-lived repository, re-executions of the same pipeline will eventually outnumber executions of new pipelines (especially new pipelines with new datasets), and hence --register-dataset-types
should rarely be needed.
Passing it all the time as a matter of habit is an antipattern, because it makes it easy for a typo to result in long-lived, hard-to-clean-up garbage (dataset types can be removed, but only if there are no datasets of that type).
5.3 Provenance and Reproducibility¶
The plan for provenance in the Gen3 butler is centered around storing the directed acyclic graph (DAG) of datasets and processing “quanta” that is used to drive PipelineTask
execution, after updating it with the unique identifiers of the datasets actually produced and annotating it with information about which input datasets are actually used by the (rare) PipelineTasks
that may not use all predicted inputs.
While some provenance information (e.g. software versions and configurations) are currently associated directly with RUN
collections (and this information, at least, may always be), and CHAINED
collections provide some information about what datasets were used as inputs when creating others (see 1.5 Shared/official processing outputs), these do not carry sufficiently detailed information about the relationships between datasets to meet our needs.
Using fine-grained provenance information to exactly reprocess a DAG will actually be quite different from starting a new run “from scratch”, as it doesn’t involve providing collections or data IDs as inputs - the input datasets are already fully resolved, so there is no need to search for them in collections, and the data IDs are intrinsic to those datasets.
We will also need to provide ways to almost exactly reprocess a DAG, of course - e.g. replacing the initial resolved datasets with new collection + data ID searches, modifying Task
configuration in a way that does not change the DAG (or changes it only in a limited sense), and probably more.
All of this fine-grained provenance is not yet implemented, however, and at present the only way to guarantee reproducibility is for all input collections to have exactly the same state they had when the original run was performed. The standard collections defined in this document are poorly suited for this role, however; we consider it more important for these to track the “current best” (or in the case of raws, recent observations) than it is for them remain immutable. Users should thus be aware that repeated processing runs using the same input collections (and everything else held constant) are not intended to always produce the same results (and this is a feature, not a bug).