Check out R-bloggers for more excellent content!

## Streaming from zip to bz2

2016-07-08

I’ve got a massive bunch of zip archives, each of which contains only a single file. And the name of the enclosed file varies. Dealing with these data is painful. It’d be a lot more convenient if the files were compressed with gzip or bzip2 and had a consistent naming convention. How would you go about making that conversion without actually unpacking the zip archive, finding the name of the enclosed file and then recompressing? Read more »

## Major League Baseball Birth Months

2016-07-05

The cutoff date for almost all nonschool baseball leagues in the United States is July 31, with the result that more major league players are born in August than in any other month. Malcolm Gladwell, Outliers A quick analysis to confirm Gladwell’s assertion above. Used data scraped from www.baseball-reference.com. Read more »

## Upgrading Ubuntu 16.04 to Linux Kernel 4.4.12

2016-06-04

I’ve had a few minor hardware issues with the default kernel in Ubuntu 16.04. For example, hibernate does not work on my laptop. So, in an effort to resolve these problems, I upgraded from the 4.4.0 version of the kernel to 4.4.12. Nothing tricky involved, but here’s the process. Grab the headers and image. $wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.12-xenial/linux-headers-4.4.12-040412-generic_4.4.12-040412.201606011712_amd64.deb$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.12-xenial/linux-headers-4.4.12-040412_4.4.12-040412.201606011712_all.deb \$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.12-xenial/linux-image-4.4.12-040412-generic_4.4.12-040412.201606011712_amd64.deb Then, become root and install the kernel. Read more »

## satRday in Cape Town

2016-05-26

We are planning to host one of the three inaugural satRday conferences in Cape Town during 2017. The [R Consortium](https://www.r-consortium.org/) has committed to funding three of these events: one will be in Hungary, another will be somewhere in the USA and the third will be at an international destination. At present Cape Town is dicing it out with Monterrey (Mexico) for the third location. We just need your votes to make Cape Town’s plans a reality. Read more »

2016-05-18

Great video bringing back some good memories. Read more »

2016-05-12

2016-04-13

## The Next Rembrandt

2016-04-06

Creating The Next Rembrandt: using data to touch the human soul. How a team from ING, Microsoft, TU Delft, Mauritshuis and Rembrandthuis used technology to synthesise a painting in the style of the Dutch master, Rembrandt, almost 350 years after his death.

2016-03-19

## International Open Data Day

2016-03-05

As part of International Open Data Day we spent the morning with a bunch of like minded people poring over some open Census South Africa data. Excellent initiative, @opendatadurban, I’m very excited to see where this is all going and look forward to contributing to the journey! The data above show the distribution of ages in a segment of the South African population who have either no schooling (blue) or have completed Grade 12 (orange). Read more »

## R, HDF5 Data and Lightning

2016-02-23

I used to spend an inordinate amount of time digging through lightning data. These data came from a number of sources, the World Wide Lightning Location Network (WWLLN) and LIS/OTD being the most common. I recently needed to work with some Hierarchical Data Format (HDF) data. HDF is something of a niche format and, since that was the format used for the LIS/OTD data, I went to review those old scripts. It was very pleasant rediscovering work I did some time ago. Read more »

## GPS Doodling

2016-02-22

Stephen Lund combines two of my passions: technology and exercise. Awesome. Durban Doodles coming soon. Read more »

2016-02-12

## Automating R scripts under Windows

2016-02-11

Setting up an automated job under Linux is a cinch thanks to cron. Doing the same under Windows is a little more tricky, but still eminently doable. Read more »

2016-02-10

2016-02-08

2016-02-08

2016-01-25

## Kaggle: Santa's Stolen Sleigh

2016-01-22

This morning I read Wendy Kan’s interesting post on Creating Santa’s Stolen Sleigh. I hadn’t really thought too much about the process of constructing an optimisation competition, but Wendy gave some interesting insights on the considerations involved in designing a competition which was both fun and challenging but still computationally feasible without military grade hardware. This seems like an opportune time to jot down some of my personal notes and also take a look at the results. I know that this sort of discussion is normally the prerogative of the winners and I trust that my ramblings won’t be viewed as presumptuous. Read more »

## Casting a Wide (and Sparse) Matrix in R

2016-01-19

I routinely use melt() and cast() from the reshape2 package as part of my data munging workflow. Recently I’ve noticed that the data frames I’ve been casting are often extremely sparse. Stashing these in a dense data structure just feels wasteful. And the dismal drone of page thrashing is unpleasant. So I had a look around for an alternative. As it turns out, it’s remarkably easy to cast a sparse matrix using sparseMatrix() from the Matrix package. Read more »

## Kaggle: Walmart Trip Type Classification

2016-01-15

Walmart Trip Type Classification was my first real foray into the world of Kaggle and I’m hooked. I previously dabbled in What’s Cooking but that was as part of a team and the team didn’t work out particularly well. As a learning experience the competition was second to none. My final entry put me at position 155 out of 1061 entries which, although not a stellar performance by any means, is just inside the top 15% and I’m pretty happy with that. Below are a few notes on the competition. Read more »

## MongoDB: Installing on Windows 7

2016-01-13

It’s not my personal choice, but I have to spend a lot of my time working under Windows. Installing MongoDB under Ubuntu is a snap. Getting it going under Windows seems to require jumping through a few more hoops. Here are my notes. I hope that somebody will find them useful. Read more »

2016-01-11

## Review: Learning Shiny

2016-01-05

I was asked to review Learning Shiny (Hernán G. Resnizky, Packt Publishing, 2015). I found the book to be useful, motivating and generally easy to read. I’d already spent some time dabbling with Shiny, but the book helped me graduate from paddling in the shallows to wading out into the Shiny sea. Read more »

## Using Checksum to Guess Message Length: Not a Good Idea!

2015-12-22

A question posed by one of my colleagues: can a checksum be used to guess message length? My immediate response was negative and, as it turns out, a simple simulation supported this knee-jerk reaction. Read more »

## Goto

2015-12-21

For a moment this morning I was regretting the fact that R doesn’t have a goto statement, but then… Read more »

## Making Sense of Logarithmic Loss

2015-12-14

Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric in Kaggle competitions. Since success in these competitions hinges on effectively minimising the Log Loss, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted. Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is basically equivalent to maximising the accuracy of the classifier, but there is a subtle twist which we’ll get to in a moment. Read more »

2015-12-09

## 2015 Data Science Salary Survey

2015-12-04

The recently published 2015 Data Science Salary Survey conducted by O’Reilly takes a look at the salaries received, tools used and other interesting facts about Data Scientists around the World. It’s based on a survey of over 600 respondents from a variety of industries. The entire report is well worth a read, but I’ve picked out some highlights below. The majority (67%) of the respondents in the survey were from the United States. Read more »

2015-11-23