← Brown Corpus

TL;DR

Freely available corpus of part-of-speech tagged documents, converted into big-ass JavaScript files. Try perusing the docs here.

Background

I have a long history of interest in natural language processing. In July 2016 I decided to dabble a bit with the Brown Corpus of tagged documents, like many have, to see what I could learn about the statistics of how we (in a limited scope) use English and whether I could write software that could use the knowledge embedded within the BC to better parse free-form English prose.

For simplicity, I decided to program in JavaScript. It took a while and some clever programming just to parse the source documents into JSON (technically full JavaScript) files. But then it is easy to just include those scripts in test HTML pages. It may sound crazy to include about 50MB of JavaScript files on a web page, but most modern computers -- including probably your smartphone -- can handle this pretty easily. And it's suddenly easy to explore the data set with JavaScript code and see the results as plain HTML.

Try perusing the documents here.

The data

Packing the database into pure JavaScript files makes it easy to share my work with you for your own research. Here are the data files:

Name Size Summary

BrownCorpus.js 18 b Declares the global brownCorpus object all other scripts fill

BrownCorpus_TagTypes.js 43 KB Lookup lists for all the fiddly part of speech tags used and their basic PoS

BrownCorpus_DocToc.js 241 KB Table of Contents with titles and copyrights for each document

BrownCorpus_Docs.js 23.2 MB All 500 documents, broken down by paragraph, sentence, and word

BrownCorpus_Words.js 2.7 MB All 50k unique words found in the BC with alternative PoS and frequency stats

BrownCorpus_WordTriples.js 17.3 MB All three-word tuples found in sentences plus frequencies and naive guess sets

BrownCorpus_PosTriples.js 590 KB All three-basic-part-of-speech combinations along with naive guess sets

Name	Size	Summary
BrownCorpus.js	18 b	Declares the global brownCorpus object all other scripts fill
BrownCorpus_TagTypes.js	43 KB	Lookup lists for all the fiddly part of speech tags used and their basic PoS
BrownCorpus_DocToc.js	241 KB	Table of Contents with titles and copyrights for each document
BrownCorpus_Docs.js	23.2 MB	All 500 documents, broken down by paragraph, sentence, and word
BrownCorpus_Words.js	2.7 MB	All 50k unique words found in the BC with alternative PoS and frequency stats
BrownCorpus_WordTriples.js	17.3 MB	All three-word tuples found in sentences plus frequencies and naive guess sets
BrownCorpus_PosTriples.js	590 KB	All three-basic-part-of-speech combinations along with naive guess sets

Messy part of speech tags

I took pains to make sure that words found in brownCorpus.docs all have "p" attributes that match up with brownCorpus.pos entries, but they sometimes end in "-hl" and/or "-tl" to indicate that that particular word is part of a section headline or a person's title (e.g., "Dr."), respectively. You'll need to strip those off before searching for a match. But once you do, you will find every word.p attribute has a proper match. Here's how I handled this in my PeruseDocs.html page. Given a variable "word" pointing to some place in brownCorpus.docs:

    let posKey = word.p;
    let titleTag = false;
    let headlineTag = false;
    let match = null;

    match = posKey.match(/^(.*)-tl$/);
    if (match) {
        posKey = match[1];
        titleTag = true;
    }

    match = posKey.match(/^(.*)-hl$/);
    if (match) {
        posKey = match[1];
        headlineTag = true;
    }

    let pos = brownCorpus.pos[posKey];

Triples

The word and PoS tripples data give you a trivial way to guess what the Nth (1st, 2nd, or 3rd) word will be based on nothing but frequency. Let's say the parser came across a sentence like "Mary explained the consequences to John." One of the triples (3-tuples) will be "Mary explained the". Another will be "explained the consequences". And so on. The part-of-speech tuples will be noun+verb+article and verb+article+noun, respectively. If "explained the consequences" appeared 20 times throughout the corpus, there would be one entry for all 20 and that number as its global count, which is a useful statistic.

In addition to the brownCorpus.wordTriples.all dictionary object, you also have brownCorpus.wordTriples.guessFirst, .guessSecond, and .guessThird dictionaries. If you had a Mad Libs type application, you could use these to guess what the missing word might be based purely on whatever was found (if at all) in the BC and which option was most common there. You might do a lookup for brownCorpus.wordTriples.guessThird["explained|the"] and find "consequences", "situation", and "meaning" are some third words found in the BC, given those two earlier words.

The same concept applies to parts of speech in the BrownCorpus_PosTriples.js file, which is structured pretty much the same. It does not, however, have all the nitpicky parts of speech found in the BC, but only the 15 "basic" parts of speech (e.g., verb and noun). But this can be used to guess what the part of speech might be for some word your parser is considering now, given that you know what the PoS are for other words around it.

The guessing game for both specific words and general parts of speech can be played using several overlapping triples if you use all three sets (first, second, third) in guessing. If you want to find the likely part of speech of the word at position 10, consider using the guess-third with words 8 and 9 as input, then also the guess-second with words 9 and 11, and finally the guess-first with words 11 and 12. The math and algorithm is a little more tricky, but the results should generally be better, as it gives you a way to look 2 words to the left and right of any word in the middle of a sentence.

You can also download a ZIP file with all the source data files I parsed to create my JavaScript files.

The fine print

As the owner of this site, I am not affiliated with the Brown Corpus project or any other organization that curates or works with it. I do not have any copyrights on the data. To the best of my knowledge, this data is freely available for private research projects like my own. Given how widely this data has been distributed already, I'm not sure what the current status of copyrights is for it at this point.

I encourage you to research what usage and redistribution rights you might have. Caveat emptor!