Tag Archives: Data Science

Python Data Science Essentials

Authors: Alberto Boschetti and Luca Massaron published by Packt April 2015.

I am a Data Scientist who usually codes in R. It was a challenge to get comfortable  enough in python code to review the book. Python come in a lot of flavors.  I used Anaconda Launcher to run jupyter notebooks. The code is on the publishers page.

With broad strokes in six chapters it cover the fundamentals of Data Science using python. The pretty blue mosaic tile swirl on the cover catches your eye.

My favorite chapter is chapter five on Social Network Analysis. I like the table on graph types, node and edges. For example Twitter, a directed graph, people are nodes and followers are edges. Very useful table for writing code.

Get the code, run the notebooks, have fun.



Mastering Social Media Mining with R

Mastering Social Media Mining with R

Sharan Kumar Ravindran September 2015 Packt publisher

Useful R book that covers current Social Media and  data science techniques.

My favorite library in this book is from chapter six, SocialMediaMineR.

The function get_facebook from SocialMediaMineR package takes a URL and returns a data frame of shares, likes etc.  The function is easy to use. You do not need OAuth just a link. Works like this:

> library(SocialMediaMineR)
> get_facebook(“https://www.packtpub.com”)
trying URL ‘http://curl.haxx.se/ca/cacert.pem’
Content type ‘¸’
ýþ’ length 256338 bytes (250 Kb)
opened URL
downloaded 250 Kb

url normalized_url
1 https://www.packtpub.com https://www.packtpub.com/
share_count like_count comment_count total_count click_count
1 432 361 155 948 0
comments_fbid commentsbox_count
1 10150745127795008 0

This one function could keep you occupied for a long time.

But there are other useful libraries in this book: ROAuth for OAuth, twitterR for Twitter, Rfacebook for facebook, and rgithub for github.

The book covers exploratory data analysis, EDA. in the chapter on github.

Sentiment Analysis in the chapter on Twitter.

The book briefly covers a lot. There are many other books that cover a single topic in more detail. Read this book to discover what you want to explore.


Building a Recommendation System with R

Written by  Suresh K. Gorakala and Michele Usuelli, published by Packt Press 2015

This is whole book on a topic that is often only a single chapter in a book. It is a book for people who already know R and machine learning .

The book uses Math equations not just code for teaching the concepts.

Covers confusion matrix for classification. Along with sensitivity and specification.  Lots of details about type one and type two errors. This  clearly written section will help you understand why you don’t want either type of error and what they are.

Classification similarity measures include Euclidean Distance, Cosine Distance and Pearson Correlation.

Dimensionality  reduction techniques include Principle Component Analysis.

Data Mining techniques include K-means clustering and Support Vector Machine.

Recommender System includes collaborative filtering and content based filtering.


R package for the book is recommenderlab.

recommenderlab: Lab for Developing and Testing Recommender Algorithms by Michael Hahsler at http://CRAN.R-project.org/package=recommenderlab

Other packages used are lsa, e1071, cluster.


Beginning Data Science with R


Beginning Data Science with R written by Manas A. Pathak, published by Springer Publishing 2014.
ISBN 978-3-319-12065-2

Code examples at extras.springer.com

This book is written for coders who already know how to code to learn R for data science.

The book covers how to install and use R, but not an IDE like RStudio.

Chapter 2 includes control structures and functions. That functions in R are treated as first class objects. A fundamental property of functional programming languages.

Chapter 3 is on getting data into R. How do get the data into R is a common question. Years ago I was puzzled about getting data into R. I didn’t want to type it all into an array. You don’t have to type in the data, R will read, pull, connect to all sorts of data sources.

Chapter 4 is a nice over view of data visualization.

The book goes on to cover necessary topics and techniques in Data Science. What I want to point out is Chapter 7.3.1 on nearest neighbors uses a package that I haven’t used before kknn. The package is straight forward to use. The author Pathak has written an easy to grasp explanation of the technique.

This is a good book to get you stated coding in R for data science.

Books That I used In My Data Science Talk

This a is a list of books most of which I have previously reviewed.

bonnieBonnie who guards the books.

Algorithms, Robert Sedwick, Kevin Wayne 2011
The Black Belt Memory Jogger, Six Sigma Academy 2002
Six Sigma with R, E. Cano, J. Moguerza, A. Redchuk 2012
Graphical Models with R, S. Hojsgaard
The Cartoon Guide to Statistics, L. Gonick & W. Smith
R for Business Analytics, A Ohri
The Art of R Programming, N Matloff
The Elements of Statistical Learning, T. Hastie, R. Tibshirani, J. Friedman

Graphical Models with R


Graphical models with R by Soren Hojsgaard, David Edwards and Steffen Lauritzen published by Springer

I have been to a lot of talks lately on graphing social networks. Most of the code has been in python. I was happy to find this book written in R.
gRbase is the package for the book.
Most of the packages used in the book are on cran. The missing few dependencies are on bioconductor.org. Links to the site are on the gRbase part of cran. With some fiddling I got everything to work correctly.

Chapter 2.3.5 covers Hypothesis testing with graphical models.

I have been busy working thru all the examples in the book. Making lots of big spidery graphs that make sense. I am pleased with the mix a theory and code in this book.

I am thinking about what data to use with the code for an upcoming talk that I am giving.


Thinking with Data by Max Shron

Max Shron wrote Thinking with Data, How to turn Information into Insights. Published by o’Reilly

I requested a review copy of this book because it looked interesting. Math and Philosophy meets data science.
There is no code in this book. It is worth reading because it goes over the concepts concisely. Reasoning and arguments. Examples are timely like does being close to mass transportation increase the cost of renting an apartment?

The book concludes with that the author hopes that in several years the material will be obvious to data scientists and a clear place to start.

R Statistical Application Development by Example Beginner’s Guide

R Statistical Application Development by Example Beginner’s Guide by
Prabhanjan Narayanachar Tattar 1849519447 published by packtpub.com 2013

This book doesn’t do everything for you. It gets you started on topics covered in each chapter then gives you opened ended problems to solve. It took me awhile to work thru the book.  The time for action exercises are worth the effort to puzzle thru and play with.  The start of the book is  good  for beginners. The rest of the book has more advanced topics,  like CART and ridge regression.

Doing Data Science

Rachel Schutt and Cathy O’Neil wrote Doing Data Science, Straight Talk from the Frontline published by O’ Reilly 2013

The book describes and prescribes how to do Data Science. It isn’t a how to manual, the book isn’t for beginners. In the there are plenty of references to good beginner materials, many which are reviewed in this blog. The R and Python code provides examples of how to go about doing data science.

I received a review copy of this book. I am very pleased to have read it. The book How to do Data Science succinctly describes topics that I have been trying to get across to people.ofylqxif Chapter 2 has excellent information about Populations and Samples in Big Data.  Chapter 16 covers Next-generation Data Scientist, Hubris and Ethics, a good topic to include. 

The book came out of a class, I would of liked to have been in the class.

R By Example


R by Example by Jim Albert and Maria Rizzo. Published Springer Press 2013

The thoughtfulness of this book demonstrates the authors statement that this book was written to answer students questions.

Data sets used are varied, old and newer. Including horse kicks to Prussian army officers(my great,great grandpa Peter was in the Prussian Army) and  Chapter 13.1 estimating when will Sam meet Annie from Sleepless in Seattle, using Monte Carlo method for computing intervals.

Chapter 3.4 shows how to make a contingency table in R. Something that I wish there was a good package for.

Chapter 11 on Simulating Experiments tells where the term Monte Carlo came from then continues on to show by example how to implement the code.

11.5 on Patterns of dependence in a sequence has good information and R code for computing the significance of a streak.  Demonstrated with winning streaks in baseball.

Appendix A covers arrays, vectors and matrix.