Linguistic Linked Open Data
In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data describes a method and an interdisciplinary community concerned with creating, sharing, and using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.
Definition and Development
Linguistic Linked Open Data describes the publication of data for linguistics and natural language processing using the following principles:- Data should be openly licensed using licenses such as the Creative Commons licenses.
- The elements in a dataset should be uniquely identified by means of a URI.
- The URI should resolve, so users can access more information using web browsers.
- Resolving an LLOD resource should return results using web standards such as the Resource Description Framework.
- Links to other resources should be included to help users discover new resources and provide semantics.
- Representation: Linked graphs are a more flexible representation format for linguistic data.
- Interoperability: Common RDF models can easily be integrated.
- Federation: Data from multiple sources can trivially be combined.
- Ecosystem: Tools for RDF and linked data are widely available under open source licenses.
- Expressivity: Existing vocabularies help express linguistic resources.
- Semantics: Common links express what you mean.
- Dynamicity: Web data can be continuously improved.
LLOD vocabularies
Aside from gathering metadata and generating the LLOD cloud diagram, the LLOD community is driving the development of community standards with respect to vocabularies, metadata and best practice recommendations.According to the state-of-the-art overview by Cimiano et al., these include:
- for modelling lexical resources
- *OntoLex-Lemon, community standard for lexical resources
- for modelling linguistic annotations
- *Web Annotation, a W3C standard for the annotation of web resources
- *NLP Interchange Format, a community standard for the grammatical annotation of text
- *CoNLL-RDF, a NIF-based vocabulary for the RDF representation of corpora in conventional TSV formats
- *POWLA, a vocabulary for generic linguistic data structures that can be used to complement NIF, CoNLL-RDF or Web Annotation
- for linguistic data categories
- *Ontologies of Linguistic Annotation for linguistic annotation
- *lexinfo for grammatical and other features in lexical resources
- for language identification
- *as language-tagged strings using IETF BCP 47 language tags
- *with ISO 639-3 URIs provided by lexvo.org
- *with Glottolog URIs for language varieties not covered by ISO 639
- for metadata
- *Dublin Core, a community standard of terms that can be used to describe web resources
- * Data Catalog Vocabulary, a W3C standard for data catalogs published on the web
- *METASHARE-OWL, vocabulary for language resource metadata
Community
The LLOD cloud diagram has been developed and is maintained by the Open Linguistics Working Group of the Open Knowledge Foundation, an open and interdisciplinary of experts in language resources.The OWLG organizes community events and coordinates LLOD developments and facilitates interdisciplinary communication between and among LLOD contributors and users.
Several W3C Business and Community Groups focus on specialized aspects of LLOD:
- The W3C Ontology-Lexica Community Group develops and maintains specifications for machine-readable dictionaries in the LLOD cloud.
- The W3C Best Practices for Multilingual Linked Open Data Community Group gathers information on best practices for producing multilingual linked open data.
- The W3C Linked Data for Language Technology Community Group assembles user cases and requirements for language technology applications that use Linked Data.
- Linked Data in Linguistics, annual scientific workshop, started 2012
- Multilingual Linked Open Data for Enterprises, bi-annual community meeting
- Summer Datathon on Linguistic Linked Open Data, bi-annual datathon, since 2015
Applications of LLOD
- In all areas of empirical linguistics, computational philology, and natural language processing, linguistic annotation and linguistic markup represent central elements of analysis. However, progress in this field is being hampered by interoperability challenges, most notably differences in vocabularies and annotation schemes used for different resources and tools. Using Linked Data to connect language resources and ontologies/terminology repositories facilitate re-using shared vocabularies and interpreting them against a common basis.
- In corpus linguistics and computational philology, overlapping markup represents a notorious problem to conventional XML formats. Hence, graph-based data models have been suggested since the late 1990s. These are traditionally represented by means of multiple, interlinked XML files, which are poorly supported by off-the-shelf XML technology. Modeling such complex annotations as Linked Data represents a formalism semantically equivalent to standoff XML, but eliminates the need for special-purpose technology, and, instead, relies on the existing RDF ecosystem.
- Multilingual issues, including the linking of lexical resources such as WordNet as performed in the Interlingual Index of the Global WordNet Association and interconnecting heterogeneous resources such as WordNet and Wikipedia, as was done in BabelNet.
- Providing forums for standardization of linguistic resource information
- best practices for linking lexical data on the web
- best practices for creating annotations on the web
- best practices for modelling and sharing textual resources with overlapping markup
Selected research projects
- LOD2. Creating Knowledge out of Interlinked Data
- MONNET. Multilingual Ontologies for Networked Knowledge
- LIDER. Linked Data as an enabler of cross-media and multilingual content analytics for enterprises across Europe
- QTLeap. Quality Translation by Deep Language Engineering Approaches
- LiODi. Linked Open Dictionaries
- FREME. Open Framework of E-Services for Multilingual and Semantic Enrichment of Digital Content
- POSTDATA. Poetry Standardization and Linked Open Data
- Linking Latin
- Pret-a-LLOD
- NexusLinguarum. European network for Web-centred linguistic data science
Selected resources
- The Ontologies of Linguistic Annotation provide reference terminology for linguistic annotations and grammatical metadata;
- WordNet, a lexical database for English and pivot for developing similar databases for other languages, with several editions ;
- DBpedia multilingual knowledge basis of general world knowledge, based on Wikipedia;
- lexinfo.net provides reference terminology for lexical resources;
- BabelNet multilingual lexicalized semantic network, based on the aggregation of various other resources, most notably WordNet and Wikipedia;
- lexvo.org provides language identifiers and other language-related data. Most importantly, lexvo provides an RDF representation of ISO 639-3 3-letter codes for language identifiers and information about these languages;
- The ISO 12620 Data Category Registry provides a semistructured repository for various language-related terminology. ISOcat is hosted by The Language Archive, respectively, the DOBES project, at the Max Planck Institute for Psycholinguistics, but currently in transition to CLARIN;
- UBY, a lexical network for English, aggregated from various lexical resources;
- Glottolog provides fine-grained language identifiers for low-resource languages, in particular, many not covered by lexvo.org;
- Wiktionary-DBpedia links, Wiktionary-based lexicalizations for DBpedia concepts.
Discussions
Linguistic Data: Scope and Classification
Aside from resources used in and created for linguistic research, the LLOD cloud diagram also includes ontologies, terminologies and general knowledge bases whose development was not originally driven by interest in language sciences or language technology, e.g., the DBpedia. As a criterion for inclusion into the LLOD diagram, the OWLG requires "linguistic relevance": " dataset is linguistically relevant if it provides or describes language data that can be used for the purpose of linguistic research or natural language processing." This does include linguistic resources in a strict sense, but also resources "that can be used for annotating, enriching, retrieving or classifying language resources... can be verified by the existence of links between a resource and resources fulfilling condition ".A related issue is the classification of linguistically relevant datasets. The OWLG developed the following classification for the LLOD cloud diagram:
- corpora: linguistically analyzed collection of language data
- lexicons: lexical-conceptual data
- *lexical resources: lexicons and dictionaries
- *term bases: terminologies, thesauri and knowledge bases
- metadata
- * linguistic resource metadata
- * linguistic data categories
- * typological databases
- other
Open Data: Availability
LLOD is defined in relation to Linked Open Data, and LLOD resources should thus conform to licenses in accordance with the Open Definition. For generating the LLOD cloud diagram, this does, however, not seem to be enforced yet, so that the technical criterion is availability over the web and a metadata entry. In the OWLG, it has been repeatedly discussed whether non-commercial resources could be included with a general consensus of admitting them for the moment but subsequently enforcing stricter requirements along with the growth of the LLOD cloud. As of January 2018, it was not agreed upon yet when this move was about to happen. As of January 2020, machine-readable license metadata was available for 86 LLOD resources, of these, 82 adopted open licenses, 4 adopted non-commercial licenses.In a broader sense, the term LLOD technology can also used to refer to the technology independently from whether actually open resources are involved, e.g., in the name of the EU project Pret-a-LLOD that features several commercial business cases. This is justified for applications that consume open data, but moreover, also when linked data technology and the adoptation of other LLOD conventions are applied in order to facilitates the seamless integration of LLOD resources.
The abbreviation "LLOD" can be used to refer to either LLOD technology and LLOD resources. For disambiguation, the terms "LLOD resources" and "LLOD technology" can be used. For emphasizing application or applicability to non-open resources, also "LLD" has been used. A possible compromise is the acronym "LLD" for the technology. A "Licensed Linguistic Linked Data" cloud that contains non-open resources does currently not exist.
Linked Data: Formats
The definition of Linked Data requires the application of RDF or related standards. This includes the W3C recommendations SPARQL, Turtle, JSON-LD, RDF-XM, RDFa, etc. In language technology and the language sciences, however, other formalisms are currently more popular, and the inclusion of such data into the LLOD cloud diagram has been occasionally requested. For several such languages, W3C-standardized wrapping mechanisms exist, but according to the current definition of Linked Data only the generated data would qualify for LLOD cloud inclusion, not the source data.Selected literature
An exhaustive description on the state of the art on LLOD is provided by- Cimiano, Philipp; Chiarcos, Christian; McCrae, John P.; Gracia, Jorge. Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing
- Chiarcos, Christian, Hellmann, Sebastian, and Nordhoff, Sebastian. Towards a Linguistic Linked Open Data cloud: The Open Linguistics Working Group. TAL , 52'', 245-275.
- Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann. Linked Data in Linguistics. Representing and Connecting Language Data and Language Metadata. Springer, Heidelberg.
- Christian Chiarcos, Steven Moran, Pablo N. Mendes, Sebastian Nordhoff, and Richard Littauer. Building a Linked Open Data cloud of linguistic resources: Motivations and developments. In Iryna Gurevych and Jungi Kim, The People’s Web Meets NLP. Collaboratively Constructed Language Resources.Springer, Heidelberg, 2013.
- Christian Chiarcos, John McCrae, Philipp Cimiano, and Christiane Fellbaum. Towards open data for linguistics: Lexical Linked Data. In Alessandro Oltramari, Piek Vossen, Lu Qin, and Eduard Hovy, New Trends of Research in Ontologies and Lexical Resources. Springer, Heidelberg, 2013.
- Jorge Gracia, Elena Montiel-Ponsoda, Philipp Cimiano, Asunción Gómez-Pérez, Paul Buitelaar, and John McCrae. Challenges for the multilingual Web of Data.Journal of Web Semantics, vol. 11, pp. 63–71. Elsevier B.V., 2012.
- Pareja-Lora, Antonio; Lust, Barbara; Blume, Maria; Chiarcos, Christian. Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences. The MIT Press