Check out R-bloggers for more excellent content!

## Clustering Lightning Discharges to Identify Storms

A short talk that I gave at the LIGHTS 2013 Conference (Johannesburg, 12 September 2013). The slides are relatively devoid of text because I like the audience to hear the content rather than read it. The central message of the presentation is that clustering lightning discharges into storms is not a trivial task, but still a worthwhile challenge because it can lead to some very interesting science!
Read more »

## Clustering the Words of William Shakespeare

In my previous post I used the tm package to do some simple text mining on the Complete Works of William Shakespeare. Today I am taking some of those results and using them to generate word clusters.
Preparing the Data I will start with the Term Document Matrix (TDM) consisting of 71 words commonly used by Shakespeare.
> inspect(TDM.common[1:10,1:10]) A term-document matrix (10 terms, 10 documents) Non-/sparse entries: 94/6 Sparsity : 6% Maximal term length: 6 Weighting : term frequency (tf) Docs Terms 1 2 3 4 5 6 7 8 9 10 act 1 4 7 9 6 3 2 14 1 0 art 53 0 9 3 5 3 2 17 0 6 away 18 5 8 4 2 10 5 13 1 7 call 17 1 4 2 2 1 6 17 3 7 can 44 8 12 5 10 6 10 24 1 5 come 19 9 16 17 12 15 14 89 9 15 day 43 2 2 4 1 5 3 17 2 3 enter 0 7 12 11 10 10 14 87 4 6 exeunt 0 3 8 8 5 4 7 49 1 4 exit 0 6 8 5 6 5 3 31 3 2 This matrix is first converted from a sparse data format into a conventional matrix.
Read more »

## MetaTrader Time Zones

Time zones on MetaTrader can be slightly confusing. There are two important time zones:
the time zone of the broker’s server and your local time zone. And these need not be the same.
Read more »

## Text Mining the Complete Works of William Shakespeare

I am starting a new project that will require some serious text mining. So, in the interests of bringing myself up to speed on the tm package, I thought I would apply it to the Complete Works of William Shakespeare and just see what falls out.
The first order of business was getting my hands on all that text. Fortunately it is available from a number of sources. I chose to use Project Gutenberg.
Read more »

## What can be learned from 5 million books

This talk by Jean-Baptiste Michel and Erez Lieberman Aiden is phenomenal. The associated article is also well worth checking out: Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176–182.
Read more »

## Presenting Conformance Statistics

A client came to me with some conformance data. She was having a hard time making sense of it in a spreadsheet. I had a look at a couple of ways of presenting it that would bring out the important points.
The Data The data came as a spreadsheet with multiple sheets. Each of the sheets had a slightly different format, so the easiest thing to do was to save each one as a CSV file and then import them individually into R.
Read more »

## The Wonders of foreach

Writing code from scratch to do parallel computations can be rather tricky. However, the packages providing parallel facilities in R make it remarkably easy. One such package is foreach. I am going to document my trail of discovery with foreach, which began some time ago, but has really come into fruition over the last few weeks.
First we need a reproducible example. Preferably something which is numerically intensive.
> max.
Read more »

## Fitting a Model by Maximum Likelihood

Maximum-Likelihood Estimation (MLE) is a statistical technique for estimating model parameters. It basically sets out to answer the question: what model parameters are most likely to characterise a given set of data? First you need to select a model for the data. And the model must have one or more (unknown) parameters. As the name implies, MLE proceeds to maximise a likelihood function, which in turn maximises the agreement between the model and the data.
Read more »

## Finding Correlations in Data with Uncertainty: Classical Solution

Following up on my previous post as a result of an excellent suggestion from Andrej Spiess. The data are indeed very heteroscedastic! Andrej suggested that an alternative way to attack this problem would be to use weighted correlation with weights being the inverse of the measurement variance.
Read more »

## Finding Correlations in Data with Uncertainty: Bootstrap Solution

A week or so ago a colleague of mine asked if I knew how to calculate correlations for data with uncertainties. Now, if we are going to be honest, then all data should have some level of experimental or measurement error. However, I suspect that in the majority of cases these uncertainties are ignored when considering correlations. To what degree are uncertainties important? A moment’s thought would suggest that if the uncertainties are large enough then they should have a rather significant effect on correlation, or more properly, the uncertainty measure associated with the correlation.
Read more »

## Finding Your MetaTrader Log Files

Debugging an indicator or expert advisor (EA) can be a tricky business. Especially when you are doing the debugging remotely. So I write my MQL code to include copious amounts of debugging information to log files. The contents of these log files can be used to diagnose any problems. This articles tells you where you can find those files.
Testing Logs When you are running an EA under the strategy tester, the log files are written to the tester\logs directory (see the red rectangle in the directory tree above).
Read more »

## A Chart of Recent Comrades Marathon Winners

Continuing on my quest to document the Comrades Marathon results, today I have put together a chart showing the winners of both the men and ladies races since 1980. Click on the image below to see a larger version.
The analysis started off with the same data set that I was working with before, from which I extracted only the records for the winners.
> winners = subset(results, gender.position == 1, select = c(year, name, gender, race.
Read more »

## Modelling the Age of the Oldest Person You Know

The blog post How old is the oldest person you know? by Arthur Charpentier was inspired by Prudential’s stickers campaign which asks you to record the age of the oldest person you know by placing a blue sticker on a number line. The result is a histogram of ages. The original experiment was carried out using 400 real stickers in a park in Austin.
Read more »

## Comrades Marathon Inference Trees

Following up on my previous posts regarding the results of the Comrades Marathon, I was planning on putting together a set of models which would predict likelihood to finish and probable finishing time. Along the way I got distracted by something else that is just as interesting and which produces results which readily yield to qualitative interpretation: Conditional Inference Trees as implemented in the R package party.
Just to recall what the data look like:
Read more »

## Optimising a Noisy Objective Function

I am busy with a project where I need to calibrate the Heston Model to some Asian options data. The model has been implemented as a function which executes a Monte Carlo (MC) simulation. As a result, the objective function is rather noisy. There are a number of algorithms for dealing with this sort of problem, and here I simply give a brief overview of some of them.
Read more »

## Tutorial: Compiling Indicators and Expert Advisors from Source

When you receive the code for an expert advisor or indidator which we have developed for you, it will come in a package consisting of include files (with a .mqh extension) and source code files (with a .mq4 extension). So, what do you do with them?
Read more »

## Are Green Number Runners More Likely to Bail?

Comrades Marathon runners are awarded a permanent green race number once they have completed 10 journeys between Durban and Pietermaritzburg. For many runners, once they have completed the race a few times, achieving a green number becomes a possibility. And once the idea takes hold, it can become something of a compulsion. I can testify to this: I am thoroughly compelled! For runners with this goal in mind, every finish is one step closer to a green number.
Read more »

## The Green Number Effect

Following up on a suggestion from my previous post, here are the statistics for medal count versus age. Every point on the plot is the number (see colour legend on right) of athletes who have achieved a given number of medals by a particular age.
Read more »

## Age Distribution of Comrades Marathon Athletes

I can clearly remember watching the end of the 1989 Comrades Marathon on television and seeing Wally Hayward coming in just before the final gun, completing the epic race at the age of 80! I was in awe.
Since I have been delving into the Comrades Marathon data, this got me thinking about the typical age distribution of athletes taking part. The plot below indicates the ages of athletes who finished the race, going all the way back to 1984. You can clearly spot the two years when Wally Hayward ran (1988 and 1989). My data indicates that he was only 79 on the day of the 1989 Comrades Marathon, but I am not going to quibble over a year and I am more than happy to accept that he was 80!
Read more »

## Kagi Chart Indicator

In addition to a range of data analysis services, Exegetic Analytics also implements algorithms for automated FOREX trading. I am currently developing an expert advisor (EA) for a client. The strategy was developed on the ProRealTime charting software using Kagi Charts. My client wants to automate the strategy and implement it in MQL on the MetaTrader platform. One snag: Kagi Charts are independent of time. Or, more accurately, they do not have a uniform time axis. Charts in MetaTrader are of the classical variety with a nice linear time axis. So my first problem was to implement something analogous to the Kagi Chart under MetaTrader.
Read more »

## Medal Allocations at the Comrades Marathon

## Comrades Marathon Attrition Rate

It is a bit of a mission to get the complete data set for this year’s Comrades Marathon. The full results are easily accessible, but come as an HTML file. Embedded in this file are links to the splits for individual athletes. So with a bit of scripting wizardry it is also possible to download the HTML files for each of the individual athletes. Parsing all of these yields the complete result set, which is the starting point for this analysis.
Read more »

## Analysis of Cable Morning Trade Strategy

A couple of years ago I implemented an automated trading algorithm for a strategy called the “Cable Morning Trade”. The basis of the strategy is the range of GBPUSD during the interval 05:00 to 09:00 London time. Two buy stop orders are placed 5 points above the highest high for this period; two sell stop orders are placed 5 points below the lowest low. All orders have a protective stop at 40 points. When either the buy or sell orders are filled, the other orders are cancelled. Of the filled orders, one exits at a profit equal to the stop loss, while the other is left to run until the close of the London session.
Read more »

## Package MatchIt: Balancing experimental data

A balanced experimental design is one in which the distribution of the covariates is the same in both the control and treatment groups. However, although achievable in an experimental scenario, for observational data this ideal is seldom attained. The MatchIt package provides a means of pre-processing data so that the treated and control groups are as similar as possible, minimising the dependence between the treatment variable and the other covariates.
Read more »

## xkcd Style Bubble Plot

A package was recently released to generate plots in the style of xkcd using R. Being a big fan of the cartoon, I could not resist trying it out. So I set out to produce something like one of Hans Rosling’s bubble plots.
Read more »

## Swing Alert Indicator

I’ve just finished coding a swing alert indicator for a client. The rules are rather straightforward and it all depends on two simple moving averages (by default with periods of 25 and 5). The indicator generates alerts via
Read more »

## Package party: Conditional Inference Trees

I am going to be using the party package for one of my projects, so I spent some time today familiarising myself with it. The details of the package are described in Hothorn, T., Hornik, K., & Zeileis, A. (1999). “party: A Laboratory for Recursive Partytioning” which is available from CRAN.
Read more »

## Plotting categorical variables

In the previous installment we generated a few plots using numerical data straight out of the National Health and Nutrition Examination Survey. This time we are going to incorporate some of the categorical variables into the plots. Although going from raw numerical data to categorical data bins (like we did for age and BMI) does give you less precision, it can make drawing conclusions from plots a lot easier.
We will start off with a simple plot of two numerical variables: age against BMI.
Read more »

## Plotting numerical variables

In the previous installment we generated some simple descriptive statistics for the National Health and Nutrition Examination Survey data. Now we are going to move on to an area in which R really excels: making plots and visualisations. There are a variety of systems for plotting in R, but we will start off with base graphics.
Read more »

## Descriptive Statistics