Skip to content

[ENH] BEP036 - Phenotypic Data Guidelines #2123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
3cedc86
Merge pull request #2 from bids-standard/master
ericearl May 20, 2025
11fbb47
Merge pull request #3 from bids-standard/master
ericearl May 30, 2025
0ef9fdf
[ENH] Integrate BEP036 - Phenotypic Data Guidelines
surchs May 30, 2025
0a640e6
Update phenotype.md and data-summary-files.md
ericearl May 30, 2025
a19512b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 30, 2025
5718888
Update data-summary-files.md and phenotypic-and-assessment-data.md
ericearl May 30, 2025
8f54e94
Apply suggestions from code review
ericearl May 30, 2025
94cb476
Apply suggestions from code review
ericearl May 30, 2025
142c460
Apply suggestions from code review
ericearl May 30, 2025
8b78359
Update src/modality-agnostic-files/phenotypic-and-assessment-data.md
ericearl May 30, 2025
60f712a
Update mkdocs.yml
ericearl May 30, 2025
e62b5cc
Update src/modality-agnostic-files/phenotypic-and-assessment-data.md
ericearl May 30, 2025
ac097aa
Update phenotype.md to have a macro table from schema
ericearl Jun 24, 2025
32fedd0
Update src/schema/rules/tabular_data/modality_agnostic.yaml
ericearl Jun 24, 2025
aacda9b
Update modality_agnostic.yaml
ericearl Jun 24, 2025
fd5ff2d
Update phenotype.md appendix and modality_agnostic.yaml schema
ericearl Jun 24, 2025
dd65b5e
Update modality_agnsotic.yaml
ericearl Jun 24, 2025
f4205e8
add missing column objects, use existing acq column definition (#4)
rwblair Jul 17, 2025
0eba71d
Updates for BEP036
ericearl Jul 17, 2025
d3631a8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 17, 2025
f4939ad
Updates phentoype.md and Guideline 3 in the modality agmpstic section
ericearl Jul 17, 2025
abd5c2b
Update modality_agnostic.yaml
ericearl Jul 17, 2025
ec2c53d
Update phenotypic-and-assessment_data.md
ericearl Jul 17, 2025
7639001
Merge branch 'master' into master
ericearl Jul 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ nav:
- Quantitative MRI: appendices/qmri.md
- Arterial Spin Labeling: appendices/arterial-spin-labeling.md
- Cross modality correspondence: appendices/cross-modality-correspondence.md
- Phenotypic data guidelines: appendices/phenotype.md
- Changelog: CHANGES.md
- The BIDS Starter Kit:
- Website: https://bids-standard.github.io/bids-starter-kit/
Expand Down
335 changes: 335 additions & 0 deletions src/appendices/phenotype.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,335 @@
# Tabular phenotypic data guidelines

This appendix is a collection of guidelines and examples for creating well-organized aggregated tabular phenotypic data.

## Guidelines

These guidelines are all **RECOMMENDED** when preparing
tabular phenotypic data like the
participants file, sessions file, demographics file,
or phenotypic and assessment data.
The language below uses REQUIRED, MUST, and others to imply
these are the requirements for these **RECOMMENDED** guidelines.

### 1. Always pair tabular data with data dictionaries

Tabular phenotypic data MUST be prepared as one pair of a tabular file
in tab-separated value (TSV) format and a corresponding data dictionary
in JavaScript Object Notation (JSON) format.

### 2. Aggregate data across sessions

Aggregation refers to the contents of the TSV file. It is REQUIRED
to collect all participant data into one TSV per tabular phenotypic file.

### 3. Ensure minimal annotation for phenotypic and assessment data

In phenotypic and assessment data each measurement tool has an independent
aggregated data TSV file in which the user collects all subjects, sessions,
and/or runs of data as one entry per row (with a row defined by
the smallest unit of acquisition). In other words:

1. Each row MUST start with `participant_id`.

1. Each TSV file MUST contain a `session_id` column when
multiple [sessions](../glossary.md#session-entities)[^1] are present
in the data set regardless of whether those sessions are in
the `phenotype/` data, `sub-<label>/` data, or a combination of the two.

1. If more than one of the same measurement tool is acquired within
the same `session_id`, a `run` column MUST be added.

1. To encode the acquisition time for a measurement tool’s `session_id`,
add the `session_id` to the sessions file and
include the OPTIONAL `acq_time` column.

To summarize this guideline as a table:

<!-- This block generates a columns table.
The definitions of these fields can be found in
src/schema/rules/tabular_data/*.yaml
and a guide for using macros can be found at
https://github.com/bids-standard/bids-specification/blob/master/macros_doc.md
-->
{{ MACROS___make_columns_table("modality_agnostic.Phenotypes") }}

Furthermore, if you have to add a `session_id` column to the
tabular phenotypic data, you then MUST also introduce a session directory to the
imaging data, even if only one imaging session has been created.
This rule can be considered as "**if anyone uses sessions, everyone uses sessions**."
And vice versa, if imaging data has session directories,
all imaging data and tabular phenotypic data MUST have sessions.

This produces a file in which same-participant entries can take up as many rows
as needed according to the smallest unit of acquisition.
The combination of values in the `participant_id`, `session_id`, and `run` (if present)
columns MUST be unique for the entire tabular file.

### 4. Add `MeasurementToolMetadata` to each tabular phenotypic measurement tool

Whenever possible, it is RECOMMENDED to add `MeasurementToolMetadata` to
each `phenotype/<measurement_tool_name>.json` data dictionary.
This improves reusability and provides clarity about the measurement tool.

### 5. Use the demographics file for common variables about participants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying from https://github.com/surchs/bids-specification/pull/1/files#r2103117486

For this section, would it make sense to suggest that demo-like information be prioritized in this file rather than participants.tsv, making the latter primarily a list of subject IDs? I haven't seen this explicitly addressed anywhere, though I'm unsure if it's something we want to formalize 😬
Something like this could follow the paragraph?:

When all demographic data is stored in phenotype/demographics.tsv, participants.tsv may serve primarily as a minimal listing of subject identifiers with only the participant_id column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It'd be good to mention this.


Some studies collect demographics into their own tabular phenotypic data file already.
In these cases, it is RECOMMENDED to house this data in the `phenotype/` directory
as a TSV called `demographics.tsv` and its corresponding data dictionary JSON
called `demographics.json`.

### 6. Store longitudinal age in the demographics file

It is RECOMMENDED to use the `age` column to record participant age
at every session in longitudinal or multi-session data sets.
This reduces data duplication across tabular data files. The `Units` of `age`
do not have to be years so long as the units of the age
are written in `phenotype/demographics.json`.
Consider participant privacy or study objectives when selecting
the `Units` of `age` or the accuracy of `age` data.

### 7. Use the sessions file at the root level

If there is more than one session for any one participant, then
it is REQUIRED to provide a sessions file at the dataset root.
The sessions file MUST list all sessions for all subjects across
imaging and tabular phenotypic data.

When a sessions file is in use, you MUST NOT provide additional sessions files
at the participant-level which would otherwise use the inheritance principle.
If a sessions file is provided, then it MUST begin with a `participant_id` column
followed immediately by a `session_id` column. The data dictionary JSON file’s
`session_id` field MUST include `Levels` with the description of each `session_id`.

### 8. Record acquisition time of sessions with `acq_time`

Whenever possible, it is RECOMMENDED to also collect acquisition time for
tabular phenotypic data and store the time of acquisition[^2] of each row
inside a column named `acq_time` in the sessions file.
This is consistent with how acquisition time is recorded for MRI data
and other time-sensitive measurements (for example systolic blood pressure).

When needed to preserve participant privacy, you SHOULD record
relative acquisition times with respect to the earliest session.
Relative session acquisition times MAY be listed as durations from
the earliest session (baseline) in days, months, or years
using the `acq_time` column.

## Summary

This appendix described seven guidelines for best tabular phenotypic data.
A short summary table here describes when to use which files.

| File | Single session data | Multiple session data |
| :----------------------------- | :------------------ | :-------------------- |
| Participants | RECOMMENDED | RECOMMENDED |
| Phenotypic and assessment data | RECOMMENDED | RECOMMENDED |
| Sessions | OPTIONAL | REQUIRED |
| Demographics | OPTIONAL | RECOMMENDED |

## Examples

What follows are a few common use case examples for tabular phenotypic files.

### 1 participant session with both non-tabular and tabular phenotypic data

File tree

```Text
phenotype/
<measurement_tool_name>.json
<measurement_tool_name>.tsv
sub-01/anat/
sub-01_T1w.json
sub-01_T1w.nii.gz
```

Contents of `phenotype/<measurement_tool_name>.tsv`

```tsv
participant_id measurement_1 measurement_2
sub-01 value1 value2
```

### 1 participant with 2 sessions, where 1 session is only tabular phenotype and the other is only imaging

With only one imaging and one phenotypic session each in this example you might want
to merge both imaging and phenotypic data under one session. But it is more correct to
have separate sessions for the imaging and phenotypic data, especially if
the sessions were collected days, weeks, or months apart. You can denote both sessions
and their acquisition time in the `sessions.tsv` file and have `session_id` `Levels` noted
in the `sessions.json` sidecar. Below are a CORRECT and an INCORRECT example
of prepared data following these guidelines.

#### CORRECT

File tree

```Text
phenotype/
<measurement_tool_name>.json
<measurement_tool_name>.tsv
sub-01/ses-MRI/anat/
sub-01_ses-MRI_T1w.json
sub-01_ses-MRI_T1w.nii.gz
```

Contents of `phenotype/<measurement_tool_name>.tsv`

```tsv
participant_id session_id measurement_1 measurement_2
sub-01 ses-pheno value1 value2
```

#### INCORRECT

File tree

```Text
phenotype/
<measurement_tool_name>.json
<measurement_tool_name>.tsv
sub-01/anat/
sub-01_T1w.json
sub-01_T1w.nii.gz
```

Contents of `phenotype/<measurement_tool_name>.tsv`

```tsv
participant_id measurement_1 measurement_2
sub-01 value1 value2
```

A session directory **MUST** be present in the participant directory and
the `session_id` column **MUST** be present in `<measurement_tool_name>.tsv` as well.
Sessions must be used consistently for the combination of tabular and
non-tabular phenotypic data.

### 2 participants with a mix of tabular phenotypic data and imaging sessions

File tree

```Text
phenotype/
<measurement_tool_name>.json
<measurement_tool_name>.tsv
sub-01/
ses-MRI1/
anat/
sub-01_ses-MRI1_T1w.json
sub-01_ses-MRI1_T1w.nii.gz
ses-MRI2/
anat/
sub-01_ses-MRI2_T1w.json
sub-01_ses-MRI2_T1w.nii.gz
sub-02/
ses-MRI1/
anat/
sub-02_ses-MRI1_T1w.json
sub-02_ses-MRI1_T1w.nii.gz
```

Contents of `phenotype/<measurement_tool_name>.tsv`

```tsv
participant_id session_id measurement_1 measurement_2
sub-01 ses-pheno1 value1 value2
sub-02 ses-pheno1 value3 value4
sub-02 ses-pheno2 value5 value6
```

### 3 participants with 3 different kinds of sessions among them

The `ses-baseline` session collects an MRI and tabular phenotypic data.

File tree

```Text
participants.json
participants.tsv
sessions.json
sessions.tsv
phenotype/
demographics.json
demographics.tsv
...
sub-01/
ses-baseline/
ses-followupMRI/
sub-02/
ses-baseline/
sub-03/
ses-baseline/
ses-followupMRI/
```

Contents of `sessions.tsv`.

```tsv
participant_id session_id acq_time
sub-01 ses-baseline 2001-01-01T12:05:00
sub-01 ses-followupMRI 2001-07-01T13:33:00
sub-01 ses-interview 2002-01-01T11:21:00
sub-02 ses-baseline 2001-04-01T11:01:00
sub-02 ses-interview 2002-04-01T14:08:00
sub-03 ses-baseline 2001-09-01T11:45:00
sub-03 ses-followupMRI 2002-03-01T12:17:00
```

Contents of `sessions.json`. Note how the `session_id` `Levels` are clearly described.

```json
{
"participant_id": {
"Description": "BIDS participant identifier"
},
"session_id": {
"Description": "BIDS session identifier",
"Levels": {
"ses-baseline": "Baseline visit for MRI and assessments",
"ses-followupMRI": "6-months after baseline MRI follow-up",
"ses-interview": "1-year after baseline in-person follow-up"
}
},
"acq_time": {
"Description": "When the data acquisition started"
}
}
```

Contents of `participants.tsv`.

```tsv
participant_id sex
sub-01 M
sub-02 F
sub-03 F
```

Contents of `phenotype/demographics.tsv`. Measures or features that can change
from session to session belong here especially.

```tsv
participant_id session_id age gender race household_income
sub-01 ses-baseline 10 3 4 5
sub-01 ses-followupMRI 10 3 4 5
sub-01 ses-interview 11 4 4 6
sub-02 ses-baseline 9 1 3 3
sub-02 ses-interview 10 1 7 3
sub-03 ses-baseline 11 2 10 4
sub-03 ses-followupMRI 12 5 10 4
```

For more complete examples, see the `pheno00*`
[bids-examples on GitHub](https://github.com/bids-standard/bids-examples/).

[^1]: A session is any logical grouping of imaging and behavioral data consistent
across participants. Session can (but doesn't have to) be synonymous to a visit
in a longitudinal study. In situations where different data types are obtained over
several visits (for example fMRI on one day followed by DWI the day after)
those can still be grouped in one session. Refer to the
[definition of session](../glossary.md#session-entities) for more details.

[^2]: Datetime format and the anonymization procedure are
described in [Units](../common-principles.md#units).
8 changes: 7 additions & 1 deletion src/common-principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -470,7 +470,7 @@ NIfTI header.

### Tabular files

Tabular data MUST be saved as plain-text, tab-delimited values (TSV) files
Tabular data MUST be saved as plain-text, tab-separated values (TSV) files
(with [extension `.tsv`](glossary.md#tsv-extensions)),
that is, [CSV files](https://en.wikipedia.org/wiki/Comma-separated_values) where commas are replaced by tab characters.
Tabs MUST be true tab characters and MUST NOT be a series of space characters.
Expand Down Expand Up @@ -532,6 +532,12 @@ Note that if a field name included in the data dictionary matches a column name
then that field MUST contain a description of the corresponding column,
using an object containing the following fields:

!!! success "Guideline 1"

For [best tabular phenotypic data](./appendices/phenotype.md):
Each tabular phenotypic data TSV file MUST be accompanied by
a corresponding data dictionary JSON file.

<!-- This block generates a metadata table.
The definitions of these fields can be found in
src/schema/objects/metadata.yaml
Expand Down
Loading
Loading