Wikipedia Monolingual Corpora: more than 5 billion tokens of text - TopicsExpress



          

Wikipedia Monolingual Corpora: more than 5 billion tokens of text in 23 languages extracted from the Wikipedia. The corpora are annotated with article and paragraph boundaries, number of incoming links for each article, anchor texts used to refer to each article (textlinks) and their frequencies, crosslanguage links, categories and more ( linguatools.org/tools/corpora/wikipedia-monolingual-corpora/). There is also a script that allows to extract domain-specific sub-corpora if you provide a list of desired categories.
Posted on: Thu, 27 Nov 2014 12:10:42 +0000

Recently Viewed Topics




© 2015