-
Notifications
You must be signed in to change notification settings - Fork 182
[ENH] BEP036 - Phenotypic Data Guidelines #2123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ericearl
wants to merge
17
commits into
bids-standard:master
Choose a base branch
from
surchs:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
3cedc86
Merge pull request #2 from bids-standard/master
ericearl 11fbb47
Merge pull request #3 from bids-standard/master
ericearl 0ef9fdf
[ENH] Integrate BEP036 - Phenotypic Data Guidelines
surchs 0a640e6
Update phenotype.md and data-summary-files.md
ericearl a19512b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 5718888
Update data-summary-files.md and phenotypic-and-assessment-data.md
ericearl 8f54e94
Apply suggestions from code review
ericearl 94cb476
Apply suggestions from code review
ericearl 142c460
Apply suggestions from code review
ericearl 8b78359
Update src/modality-agnostic-files/phenotypic-and-assessment-data.md
ericearl 60f712a
Update mkdocs.yml
ericearl e62b5cc
Update src/modality-agnostic-files/phenotypic-and-assessment-data.md
ericearl ac097aa
Update phenotype.md to have a macro table from schema
ericearl 32fedd0
Update src/schema/rules/tabular_data/modality_agnostic.yaml
ericearl aacda9b
Update modality_agnostic.yaml
ericearl fd5ff2d
Update phenotype.md appendix and modality_agnostic.yaml schema
ericearl dd65b5e
Update modality_agnsotic.yaml
ericearl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,335 @@ | ||
# Tabular phenotypic data guidelines | ||
|
||
This appendix is a collection of guidelines and examples for creating well-organized aggregated tabular phenotypic data. | ||
|
||
## Guidelines | ||
|
||
These guidelines are all **RECOMMENDED** when preparing | ||
tabular phenotypic data like the | ||
participants file, sessions file, demographics file, | ||
or phenotypic and assessment data. | ||
The language below uses REQUIRED, MUST, and others to imply | ||
these are the requirements for these **RECOMMENDED** guidelines. | ||
|
||
### 1. Always pair tabular data with data dictionaries | ||
|
||
Tabular phenotypic data MUST be prepared as one pair of a tabular file | ||
in tab-separated value (TSV) format and a corresponding data dictionary | ||
in JavaScript Object Notation (JSON) format. | ||
|
||
### 2. Aggregate data across sessions | ||
|
||
Aggregation refers to the contents of the TSV file. It is REQUIRED | ||
to collect all participant data into one TSV per tabular phenotypic file. | ||
|
||
### 3. Ensure minimal annotation for phenotypic and assessment data | ||
|
||
In phenotypic and assessment data each measurement tool has an independent | ||
aggregated data TSV file in which the user collects all subjects, sessions, | ||
and/or runs of data as one entry per row (with a row defined by | ||
the smallest unit of acquisition). In other words: | ||
|
||
1. Each row MUST start with `participant_id`. | ||
|
||
1. Each TSV file MUST contain a `session_id` column when | ||
multiple [sessions](../glossary.md#session-entities)[^1] are present | ||
in the data set regardless of whether those sessions are in | ||
the `phenotype/` data, `sub-<label>/` data, or a combination of the two. | ||
|
||
1. If more than one of the same measurement tool is acquired within | ||
the same `session_id`, a `run` column MUST be added. | ||
|
||
1. To encode the acquisition time for a measurement tool’s `session_id`, | ||
add the `session_id` to the sessions file and | ||
include the OPTIONAL `acq_time` column. | ||
|
||
To summarize this guideline as a table: | ||
|
||
<!-- This block generates a columns table. | ||
The definitions of these fields can be found in | ||
src/schema/rules/tabular_data/*.yaml | ||
and a guide for using macros can be found at | ||
https://github.com/bids-standard/bids-specification/blob/master/macros_doc.md | ||
--> | ||
{{ MACROS___make_columns_table("modality_agnostic.Phenotypes") }} | ||
|
||
Furthermore, if you have to add a `session_id` column to the | ||
tabular phenotypic data, you then MUST also introduce a session directory to the | ||
imaging data, even if only one imaging session has been created. | ||
This rule can be considered as "**if anyone uses sessions, everyone uses sessions**." | ||
And vice versa, if imaging data has session directories, | ||
all imaging data and tabular phenotypic data MUST have sessions. | ||
|
||
This produces a file in which same-participant entries can take up as many rows | ||
as needed according to the smallest unit of acquisition. | ||
The combination of values in the `participant_id`, `session_id`, and `run` (if present) | ||
columns MUST be unique for the entire tabular file. | ||
|
||
### 4. Add `MeasurementToolMetadata` to each tabular phenotypic measurement tool | ||
|
||
Whenever possible, it is RECOMMENDED to add `MeasurementToolMetadata` to | ||
each `phenotype/<measurement_tool_name>.json` data dictionary. | ||
This improves reusability and provides clarity about the measurement tool. | ||
|
||
### 5. Use the demographics file for common variables about participants | ||
|
||
Some studies collect demographics into their own tabular phenotypic data file already. | ||
In these cases, it is RECOMMENDED to house this data in the `phenotype/` directory | ||
as a TSV called `demographics.tsv` and its corresponding data dictionary JSON | ||
called `demographics.json`. | ||
|
||
### 6. Store longitudinal age in the demographics file | ||
|
||
It is RECOMMENDED to use the `age` column to record participant age | ||
at every session in longitudinal or multi-session data sets. | ||
This reduces data duplication across tabular data files. The `Units` of `age` | ||
do not have to be years so long as the units of the age | ||
are written in `phenotype/demographics.json`. | ||
Consider participant privacy or study objectives when selecting | ||
the `Units` of `age` or the accuracy of `age` data. | ||
|
||
### 7. Use the sessions file at the root level | ||
|
||
If there is more than one session for any one participant, then | ||
it is REQUIRED to provide a sessions file at the dataset root. | ||
The sessions file MUST list all sessions for all subjects across | ||
imaging and tabular phenotypic data. | ||
|
||
When a sessions file is in use, you MUST NOT provide additional sessions files | ||
at the participant-level which would otherwise use the inheritance principle. | ||
If a sessions file is provided, then it MUST begin with a `participant_id` column | ||
followed immediately by a `session_id` column. The data dictionary JSON file’s | ||
`session_id` field MUST include `Levels` with the description of each `session_id`. | ||
|
||
### 8. Record acquisition time of sessions with `acq_time` | ||
|
||
Whenever possible, it is RECOMMENDED to also collect acquisition time for | ||
tabular phenotypic data and store the time of acquisition[^2] of each row | ||
inside a column named `acq_time` in the sessions file. | ||
This is consistent with how acquisition time is recorded for MRI data | ||
and other time-sensitive measurements (for example systolic blood pressure). | ||
|
||
When needed to preserve participant privacy, you SHOULD record | ||
relative acquisition times with respect to the earliest session. | ||
Relative session acquisition times MAY be listed as durations from | ||
the earliest session (baseline) in days, months, or years | ||
using the `acq_time` column. | ||
|
||
## Summary | ||
|
||
This appendix described seven guidelines for best tabular phenotypic data. | ||
A short summary table here describes when to use which files. | ||
|
||
| File | Single session data | Multiple session data | | ||
| :----------------------------- | :------------------ | :-------------------- | | ||
| Participants | RECOMMENDED | RECOMMENDED | | ||
| Phenotypic and assessment data | RECOMMENDED | RECOMMENDED | | ||
| Sessions | OPTIONAL | REQUIRED | | ||
| Demographics | OPTIONAL | RECOMMENDED | | ||
|
||
## Examples | ||
|
||
What follows are a few common use case examples for tabular phenotypic files. | ||
|
||
### 1 participant session with both non-tabular and tabular phenotypic data | ||
|
||
File tree | ||
|
||
```Text | ||
phenotype/ | ||
<measurement_tool_name>.json | ||
<measurement_tool_name>.tsv | ||
sub-01/anat/ | ||
sub-01_T1w.json | ||
sub-01_T1w.nii.gz | ||
``` | ||
|
||
Contents of `phenotype/<measurement_tool_name>.tsv` | ||
|
||
```tsv | ||
participant_id measurement_1 measurement_2 | ||
sub-01 value1 value2 | ||
``` | ||
|
||
### 1 participant with 2 sessions, where 1 session is only tabular phenotype and the other is only imaging | ||
|
||
With only one imaging and one phenotypic session each in this example you might want | ||
to merge both imaging and phenotypic data under one session. But it is more correct to | ||
have separate sessions for the imaging and phenotypic data, especially if | ||
the sessions were collected days, weeks, or months apart. You can denote both sessions | ||
and their acquisition time in the `sessions.tsv` file and have `session_id` `Levels` noted | ||
in the `sessions.json` sidecar. Below are a CORRECT and an INCORRECT example | ||
of prepared data following these guidelines. | ||
|
||
#### CORRECT | ||
|
||
File tree | ||
|
||
```Text | ||
phenotype/ | ||
<measurement_tool_name>.json | ||
<measurement_tool_name>.tsv | ||
sub-01/ses-MRI/anat/ | ||
sub-01_ses-MRI_T1w.json | ||
sub-01_ses-MRI_T1w.nii.gz | ||
``` | ||
|
||
Contents of `phenotype/<measurement_tool_name>.tsv` | ||
|
||
```tsv | ||
participant_id session_id measurement_1 measurement_2 | ||
sub-01 ses-pheno value1 value2 | ||
``` | ||
|
||
#### INCORRECT | ||
|
||
File tree | ||
|
||
```Text | ||
phenotype/ | ||
<measurement_tool_name>.json | ||
<measurement_tool_name>.tsv | ||
sub-01/anat/ | ||
sub-01_T1w.json | ||
sub-01_T1w.nii.gz | ||
``` | ||
|
||
Contents of `phenotype/<measurement_tool_name>.tsv` | ||
|
||
```tsv | ||
participant_id measurement_1 measurement_2 | ||
sub-01 value1 value2 | ||
``` | ||
|
||
A session directory **MUST** be present in the participant directory and | ||
the `session_id` column **MUST** be present in `<measurement_tool_name>.tsv` as well. | ||
Sessions must be used consistently for the combination of tabular and | ||
non-tabular phenotypic data. | ||
|
||
### 2 participants with a mix of tabular phenotypic data and imaging sessions | ||
|
||
File tree | ||
|
||
```Text | ||
phenotype/ | ||
<measurement_tool_name>.json | ||
<measurement_tool_name>.tsv | ||
sub-01/ | ||
ses-MRI1/ | ||
anat/ | ||
sub-01_ses-MRI1_T1w.json | ||
sub-01_ses-MRI1_T1w.nii.gz | ||
ses-MRI2/ | ||
anat/ | ||
sub-01_ses-MRI2_T1w.json | ||
sub-01_ses-MRI2_T1w.nii.gz | ||
sub-02/ | ||
ses-MRI1/ | ||
anat/ | ||
sub-02_ses-MRI1_T1w.json | ||
sub-02_ses-MRI1_T1w.nii.gz | ||
``` | ||
|
||
Contents of `phenotype/<measurement_tool_name>.tsv` | ||
|
||
```tsv | ||
participant_id session_id measurement_1 measurement_2 | ||
sub-01 ses-pheno1 value1 value2 | ||
sub-02 ses-pheno1 value3 value4 | ||
sub-02 ses-pheno2 value5 value6 | ||
``` | ||
|
||
### 3 participants with 3 different kinds of sessions among them | ||
|
||
The `ses-baseline` session collects an MRI and tabular phenotypic data. | ||
|
||
File tree | ||
|
||
```Text | ||
participants.json | ||
participants.tsv | ||
sessions.json | ||
sessions.tsv | ||
phenotype/ | ||
demographics.json | ||
demographics.tsv | ||
... | ||
sub-01/ | ||
ses-baseline/ | ||
ses-followupMRI/ | ||
sub-02/ | ||
ses-baseline/ | ||
sub-03/ | ||
ses-baseline/ | ||
ses-followupMRI/ | ||
``` | ||
|
||
Contents of `sessions.tsv`. | ||
|
||
```tsv | ||
participant_id session_id acq_time | ||
sub-01 ses-baseline 2001-01-01T12:05:00 | ||
sub-01 ses-followupMRI 2001-07-01T13:33:00 | ||
sub-01 ses-interview 2002-01-01T11:21:00 | ||
sub-02 ses-baseline 2001-04-01T11:01:00 | ||
sub-02 ses-interview 2002-04-01T14:08:00 | ||
sub-03 ses-baseline 2001-09-01T11:45:00 | ||
sub-03 ses-followupMRI 2002-03-01T12:17:00 | ||
``` | ||
|
||
Contents of `sessions.json`. Note how the `session_id` `Levels` are clearly described. | ||
|
||
```json | ||
{ | ||
"participant_id": { | ||
"Description": "BIDS participant identifier" | ||
}, | ||
"session_id": { | ||
"Description": "BIDS session identifier", | ||
"Levels": { | ||
"ses-baseline": "Baseline visit for MRI and assessments", | ||
"ses-followupMRI": "6-months after baseline MRI follow-up", | ||
"ses-interview": "1-year after baseline in-person follow-up" | ||
} | ||
}, | ||
"acq_time": { | ||
"Description": "When the data acquisition started" | ||
} | ||
} | ||
``` | ||
|
||
Contents of `participants.tsv`. | ||
|
||
```tsv | ||
participant_id sex | ||
sub-01 M | ||
sub-02 F | ||
sub-03 F | ||
``` | ||
|
||
Contents of `phenotype/demographics.tsv`. Measures or features that can change | ||
from session to session belong here especially. | ||
|
||
```tsv | ||
participant_id session_id age gender race household_income | ||
sub-01 ses-baseline 10 3 4 5 | ||
sub-01 ses-followupMRI 10 3 4 5 | ||
sub-01 ses-interview 11 4 4 6 | ||
sub-02 ses-baseline 9 1 3 3 | ||
sub-02 ses-interview 10 1 7 3 | ||
sub-03 ses-baseline 11 2 10 4 | ||
sub-03 ses-followupMRI 12 5 10 4 | ||
``` | ||
|
||
For more complete examples, see the `pheno00*` | ||
[bids-examples on GitHub](https://github.com/bids-standard/bids-examples/). | ||
|
||
[^1]: A session is any logical grouping of imaging and behavioral data consistent | ||
across participants. Session can (but doesn't have to) be synonymous to a visit | ||
in a longitudinal study. In situations where different data types are obtained over | ||
several visits (for example fMRI on one day followed by DWI the day after) | ||
those can still be grouped in one session. Refer to the | ||
[definition of session](../glossary.md#session-entities) for more details. | ||
|
||
[^2]: Datetime format and the anonymization procedure are | ||
described in [Units](../common-principles.md#units). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying from https://github.com/surchs/bids-specification/pull/1/files#r2103117486
For this section, would it make sense to suggest that demo-like information be prioritized in this file rather than
participants.tsv
, making the latter primarily a list of subject IDs? I haven't seen this explicitly addressed anywhere, though I'm unsure if it's something we want to formalize 😬Something like this could follow the paragraph?:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It'd be good to mention this.