Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing Limited since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.
History of development
Sketch Engine is a product of Lexical Computing Limited, a company founded in 2003 by the lexicographer and research scientistAdam Kilgarriff. He started collaboration with Pavel Rychlý, a computer scientist working at the Natural Language Processing Centre at Masaryk University and the developer of Manatee and Bonito, and introduced the concept of word sketches. Since then, Sketch Engine has been commercial software, however all the core features of Manatee and Bonito that were developed by 2003 are freely available under the GPL license within the NoSketch Engine suite.
Features
Word sketches – a one-page automatic derived summary of a word's grammatical and collocational behaviour
Word sketch difference – compares and contrasts two words by analysing their collocation
Distributional Thesaurus – automated thesaurus finding words with similar meaning or appearing in the same/similar context
Sketch Engine consists of three main components: an underlying database management system called Manatee, a web interface search front-end called Bonito and a web interface for corpus building and management called Corpus Architect.
Manatee
Manatee is a database management system specifically devised for effective indexing of large text corpora. It is based on the idea of inverted indexing. It has been used to index text corpora comprising tens of billions of words. Searching corpora indexed by Manatee is performed by formulating queries in the Corpus Query Language. Manatee is written in C++ and offers an API for a number of other programming languages including Python, Java, Perl and Ruby. Recently, it was rewritten into Go for faster processing of corpus queries.
Bonito
Bonito is a web interface for Manatee providing access to corpus search. In the client–server model, Manatee is the server and Bonito plays the client part. It is written in Python.
Corpus Architect
Corpus Architect is a web interface providing corpus building and management features. It is also written in Python.