Skip to content

Update NCSLGR #79

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
cleong110 opened this issue Jun 12, 2024 · 10 comments · May be fixed by #99
Open

Update NCSLGR #79

cleong110 opened this issue Jun 12, 2024 · 10 comments · May be fixed by #99

Comments

@cleong110
Copy link
Contributor

cleong110 commented Jun 12, 2024

cleong110#21 used by SignBLEU. They say

We use the ELAN version of Boston University’s The National Center for Sign Language and Gesture Resources corpus (NCSLGR) (Neidle and Sclaroff, 2012)

Carol Neidle and Stan Sclaroff. 2012. National Center for Sign Language and Gesture Resources (NCSLGR) corpus. Boston University. ISLRN, American Sign Language Linguistic Research Project (ASLLRP), ISLRN 833-505-711564-4.

Which links to https://www.islrn.org/resources/833-505-711-564-4/, which links to https://www.bu.edu/asllrp/ncslgr.html as the source.

Currently we have an entry for NCSLGR, it goes to dataset:databases2007volumes, aka

Databases, NCSLGR. 2007. “Volumes 2–7.” American Sign Language Linguistic Research Project (Distributed on CD-ROM ….

and it's got some TODOs
image

https://www.bu.edu/asllrp/ncslgr-for-download/download-info.html

@cleong110
Copy link
Contributor Author

http://asl.cs.depaul.edu/corpus/index.html actually might be the "ELAN Version" they mention in SignBLEU

@cleong110
Copy link
Contributor Author

@cleong110
Copy link
Contributor Author

But I still don't know the precise citation for the Corpus itself? It says cite the corpus AND this publication. ???

@cleong110
Copy link
Contributor Author

https://www.bu.edu/asllrp/publications.html doesn't have a paper called "The National Center for Sign language and Gesture Resources (NCSLGR) Corpus

@cleong110
Copy link
Contributor Author

I think I'll just... cite this:

@inproceedings{Vogler2012ANW,
  title={A new web interface to facilitate access to corpora: development of the ASLLRP data access interface},
  author={Christian Vogler and C. Neidle},
  year={2012},
  url={https://api.semanticscholar.org/CorpusID:58305327}
}

@cleong110
Copy link
Contributor Author

And maybe add a custom citation like this:

@misc{dataset:Neidle_2020_NCSLGR_ISLRN,
  type = {Languageresource},
  title = {National Center for Sign Language and Gesture Resources (NCSLGR) corpus. ISLRN 833-505-711-564-4},
  author = {Carol Neidle and Stan Sclaroff},
  year = {2012},
  publisher = {Boston University},
  url = {https://www.islrn.org/resources/833-505-711-564-4/}
}

@cleong110
Copy link
Contributor Author

Previously the JSON pointed to

databases2007volumes

@cleong110
Copy link
Contributor Author

In index.md that is cited only here:

###### Continuous sign corpora {-}
contain parallel sequences of signs and spoken language.
Available continuous sign corpora are extremely limited, containing 4-6 orders of magnitude fewer sentence pairs than similar corpora for spoken language machine translation [@arivazhagan2019massively].
Moreover, while automatic speech recognition (ASR) datasets contain up to 50,000 hours of recordings [@pratap2020mls], the most extensive continuous sign language corpus contains only 1,150 hours, and only 50 of them are publicly available [@dataset:hanke-etal-2020-extending].
These datasets are usually synthesized [@dataset:databases2007volumes;@dataset:Crasborn2008TheCN;@dataset:ko2019neural;@dataset:hanke-etal-2020-extending] or recorded in studio conditions [@dataset:forster2014extensions;@cihan2018neural], which does not account for noise in real-life conditions. Moreover, some contain signed interpretations of spoken language rather than naturally-produced signs, which may not accurately represent native signing since translation is now a part of the discourse event.

@cleong110
Copy link
Contributor Author

As for JSON updates:
going off of https://www.bu.edu/asllrp/ncslgr-for-download/download-info.html, it seems there is:

  • Linguistic
  • gloss
  • video

Also

    Most of these data are from four native signers of ASL.

    This dataset includes 1,866 distinct canonical signs (i.e., grouping together very slight variants in production). The total number of sign tokens is 11,854.

    Restricting consideration to signs other than gestures and classifiers, there are 1,278 distinct canonical signs, and a total of 10,719 tokens.

    1,002 of the utterances in this collection are part of short spontaneous narratives (19). The remaining 885 utterances were elicited to illustrate a variety of constructions and sentence types.

@cleong110
Copy link
Contributor Author

cleong110 commented Jun 20, 2024

Licensing is the big one: https://www.bu.edu/asllrp/data-credits.html

The data available from these pages can be used for research and education purposes, but cannot be redistributed without permission.

Commercial use, without explicit permission, is not allowed, nor are any patents and copyrights based on this material.

Those making use of these data must, in resulting publications or presentations, cite: The National Center for Sign Language and Gesture Resources (NCSLGR) Corpus and this publication:

    Carol Neidle and Christian Vogler [2012] "A New Web Interface to Facilitate Access to Corpora: Development of the ASLLRP Data Access Interface," Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, LREC 2012, Istanbul, Turkey.

and also include the following URL's: http://www.bu.edu/asllrp// and http://secrets.rutgers.edu/dai/queryPages/.

@cleong110 cleong110 linked a pull request Jun 20, 2024 that will close this issue
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant