Hadoop moves forward. Amazon marches with it. And so must RHIPE. This blog post is about RHIPE working with the new Amazon Elastic MapReduce images.The R code found here contains R code to start an Elastic Map Reduce cluster running R 3.1 and RHIPE. This should start your cluster in ~ 6-7 minutes. The commands are

 emrMakeCluster(name=sprintf("%s's EMR Cluster",Sys.getenv("USER")), instances=3,loguri,keyname,bootstrap=NULL)

which starts a cluster with a give name, number of instances, a URI to keep the logs, your AWS keyname, and URLS to S3 locations of any bootstrap scripts. This is not ready for you the eager reader to copy,paste and run. Please edit the source code to change the location of s3n://rhipeemr/kickstartrhipe.sh and s3://rhipeemr/final.step.sh. Those files can be found here and here. You’ll need to download them, store them on S3 and update the R code. This function returns a jobid, say J, which you can pass to emrDescribeCluster or emrWaitForStart (which will wait for the cluster to start).

I promise you these scripts work! But you need to slightly modify the source. You also need ‘aws’ (a python command line tool by Amazon) and a valid ~/.aws/{config,credentials} file. It also has Rstudio server running on 8787(username: metrics, password: metrics), but you’ll need the SSH proxy to be running( see aws socks help). I’m sure you have questions, post your questions to the RHIPE google group.

Mean while in the cold cold cold empty streets of Washington DC.