Skip to content

add data to the package #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bvreede opened this issue Mar 30, 2023 · 5 comments
Closed

add data to the package #6

bvreede opened this issue Mar 30, 2023 · 5 comments

Comments

@bvreede
Copy link
Collaborator

bvreede commented Mar 30, 2023

We will be using the Santa Barbara Corpus of Spoken American English to attach to the package. The data is licensed under CC-BY-ND, which means we are not allowed to distribute a derivative; unfortunately, an R object with a dataset is a derivative, so we need to distribute only the raw data and create the R object on the fly.

Perhaps contacting [email protected] to ask for specific permission is worthwhile? The original author mentioned with the license (John W. DuBois) does not seem to be connected to the department anymore (his page is empty).

@mdingemanse
Copy link
Contributor

I have not succeeded in getting in touch with the corpus creator. It also occurs to me that creating an R object on the fly would replicate the work done in scikit-talk (unless it can be done by invoking that package, e.g. through reticulate).

I think we should look more seriously into the GPL-licensed IFADV corpus — it was specifically designed for free use including commercial use so it would be very painful if it were not possible. The 2008 LREC paper by the corpus creators is very clear about this:
https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/Documents/LREC2008RvSetal.pdf

A freely available annotated corpus is presented, gratis and libre, of high quality video recordings of face-to-face conversational speech. Within the bounds of the law, everything has been done to remove copyright and use restrictions.
and:

All our participants were asked to sign copyright transfer forms that allow the use of the recordings in a very broad
range of activities, including unlimited distribution over the Internet.

The paper foresees some of the implications of using the GPL (see §6) and specifies that the "source" that must be included to make the derivative version GPL-proof:

The GPLv2 allows unlimited use and distribution of the licensed materials. There is however a condition to (re-)distributing adapted or changed versions of the “works”. Whenever such changes fall under copyright laws, ie, when they create a derivative work in the sense of of the law, they must be distributed under the same license, ie, the GPLv2. And that license requires the release of the “source” behind the works
This condition raises the question of what the source of a corpus recording or annotation is. The short answer is, everything needed to reproduce the changes in whatever format is customary for making changes. Examples would be Praat TextGrid or ELAN EAF files.

On one reading then, it seems we could use a portion of the IFADV data in the package, if we also include the original TextGrid or EAF files for that portion. However, it is not clear to me whether this ultimately comes down to the same issue was with the English data, and whether we'd need to create an R object on the fly. Given the express goal of the IFADV project to allow any uses including unlimited distribution that would be a shame.

@bvreede
Copy link
Collaborator Author

bvreede commented Mar 30, 2023

I had come to similar conclusions.

This does mean we need to change our license to GPL, but that should be OK; especially since we have good reason to do so. (I will nevertheless double check this.)

The source behind the works in this context means code or software; this is openly available in our case, so not a worry. As for manipulations done on the source material, R actually has a neat standard to include any additional code that was required to create the derivative objects we ship with our package: we can include a script in the data_raw/ folder of the package. This way, datasets can be recreated should this be necessary when the package is updated.
To cover all our bases, and work reproducibly, we can use this standard to clearly document our workflow from original data to our included data objects.

@bvreede
Copy link
Collaborator Author

bvreede commented Apr 4, 2023

Another proposal:

Instead of including the IFADV dataset in this package, we can package it separately for R users. Given the sheer amount of materials, and the fact that our code does not depend on it (and can easily use the dataset once it's packaged) this would make sense and make the talkr package lighter.

This would also allow us to stay outside the GPL for our code (or at the very least: avoid complex dual licensing issues), which is worthwhile (see e.g. explanations in this discussion on the ggplot2 github). Knowing that the tidyverse/ggplot2 folks made an extensive effort to relicense away from GPL, and we want to contribute to their ecosystem, is another argument for me to keep with the more permissive Apache license.

Do you agree @mdingemanse? If so, I will close this PR and generate a new GPL-licensed data package that includes the ifadv data (we can simply call it ifadv).

@mdingemanse
Copy link
Contributor

I'm totally happy with that!

@bvreede
Copy link
Collaborator Author

bvreede commented Apr 4, 2023

The data package is here: https://github.com/elpaco-escience/ifadv

@bvreede bvreede closed this as completed Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants