Written by Sara Wachter-Boettcher, Technically Wrong is the best popular press book written about miss-classification and algorithms.
The chapters in the book cover common issues and problems that should not be happening.
From Chapter 10, Technically Dangerous
Software is designed and coded by people not representing the general population …” The narrower those people’s perspective’s are, the more they design and code like themselves and shrug off any responsibility for outcome, the more inequality, insensitivity and hate can thrive ”
People die from classification errors.
Small choices matter.
Text Analysis with R for Students of Literature by Matthew L. Jockers, published by Springer.
This is a well written book on the topic of Text Analysis. There is enough information to give you a good start using R. Followed by easy to understand details about text analysis.
Covered in Chapter 6 type token ratio, TTR.
Chapter 7 hapex legomena, words that appear in frequency.
Chapter 8, KWIC Key word context. Including how to make a corpus.
Chapter 11, covers clustering. Chapter 12, classification Shows how to do crosstabs with xtabs function. Also SVM support Vector Machine.
Chapter 13 covers topic modeling.
This is a good book to have if you are doing text analysis.
I did a session at bar camp 7 Portland. I brought a plastic bin of toys and asked the question Is there a cat in here? Talked and demoed how we would go about this. It is very slow to inspect each item and verify if it is a cat. First how would we know if we had a cat? We concluded that a cat had four legs, a head and fur. Took samples out of the bin and classified them into groups. Showed different types of classification trees, including discussion on red-black trees. Members of the group discussed their big data issues and sorts. Like coming up with an inspection criteria that allows you to make large cuts at the beginning and never look at that data again. We got thru 80% of the toys and concluded that there wasn’t a cat in the bin