“A” Universal Corpus not “The” Universal Corpus{Comments Off on “A” Universal Corpus not “The” Universal Corpus}


by Monica G
A universal corpus would be considered a collection of 30 + languages along with their written text, annotations (including the sentence structure brought on from translation), etc. It would be a central, resourceful database that holds and processes linguistic information.  The text and annotations would be inputted by the scholarly community accompanied with proper citations. The goal is to have as much language information as possible without the complexity, deep understanding of the linguistic evidence. It is not to be the universal corpus data set, nor to hold all the in-depth understanding of spoken language; its primary collection goal is annotations. The data collected is placed in tables that allow for efficient searches and updates. The information would be categorized by “document aligned text,” meaning text from the same document would be stored together. Then “sentence aligned text,” would be breaking down the evidence into sentence fragments, and “translation dictionaries” would look at a word and translate it according to the sentence it was originally placed in. Therefore if none of these processes would be used then “analyzed text” would come into play. Because the Universal Corpus is intended to be used as a web service, a multidimensional database is being proposed.

In essence the reading is a database proposal. It refers to multi-table implementation of data, language. It clearly states that it is not the “universal corpus” but rather a convenient alternative. It pretty much uses the data model as a jumping off point, by reassuring users that other extensive applications are out there, but this one is a bit different. Therefore is reuses information already available.

This search is interesting because it asks for community input, by asking scholarly patrons to enter information. And because it can start shows different ways data can be stored and analyzed. There are a lot of other similar services already on the web, but this one sounds to be more user-friendly. However the only aspect that was not mentioned was monetary charges or gains for the creators. Would there be a service fee for accessing the information? And the citations would not very clear. Patrons would receive credit for their work but where would that evidence be stored?

Citation: Abney, S., & Bird, S. (2010). Towards a data model for the Universal Corpus. The Human Language Project: building a universal corpus of the world’s languages. Stroudsburg, Pennsylvania: Association for Computational Linguistics. Retrieved October 21, 2011, from http://0-delivery.acm.org.opac.library.csupomona.edu/10.1145/2030000/2024257/p120-abney.pdf?ip=134.71.59.187&acc=OPEN&CFID=62393618&CFTOKEN=46653563&__acm__=1319245343_0b937ac9fcc1989934a95300c2da9033