Wikipedia Monolingual Corpora: more than 5 billion tokens of text - TopicsExpress

Michael Pace-Sigge

Wikipedia Monolingual Corpora: more than 5 billion tokens of text in 23 languages extracted from the Wikipedia. The corpora are annotated with article and paragraph boundaries, number of incoming links for each article, anchor texts used to refer to each article (textlinks) and their frequencies, crosslanguage links, categories and more ( linguatools.org/tools/corpora/wikipedia-monolingual-corpora/). There is also a script that allows to extract domain-specific sub-corpora if you provide a list of desired categories.

Posted on: Thu, 27 Nov 2014 12:10:42 +0000

Recently Viewed Topics

"Secular" congress for you. If modi said "Hindus dont have to pay

There is a certain art and education to even understand the type

Happy Monday! There is opportunity all around you. It requires

CONSTRUCTION : LE DROIT DE GRÈVE RESPECTÉ, MAIS DANS

Dems Using False Flag Strategy To Steal Senate By DICK

My girlfriend is a princess, she is a doll, Her heart is so pure,

The heart is the knower of truth, not the mind. The mind presents

The road ride dubbed TOUR DE NAKURU happened yesterday with 15 of

Actually Clint wrote this, but just insert my name too. My

What if Jesus would not have died on the cross for our sins and

TopicsExpress

About Us Add Topic Contact Us Terms & Conditions Privacy Policy