This is an update of the second big data lab for the cloud. Similar to previous versions, this document described how you can create a Big Data Lab in the cloud on Amazon EMR.
Besides some major upgrades to the newest Amazon Hadoop AMI (3.6.0) Spark (1.3.0) and R, it includes now also the possibility to use Python in the browser. There, you have the same functionality as in R. This means you can use Hadoop M/R, Spark and SparkSQL in Python from the browser. Similar to R, Python has gained attention by data scientists.
You can find the newest version here.
In future blog posts, I will write how you can use some of the open datasets on the Internet in the Big Data lab.
Recently, I presented the first version of the Big Data Lab in the cloud. Now I extended this version and kept most of the features of the previous version. However, I provide upgrades for important software components. It still runs on Amazon EMR, but with the newest Amazon AMI (including Amazon Linux). It now features Hadoop 2.4, Spark 1.1.1, R 3 and for the first time SparkR, so you can do in-memory analytics in R by leveraging your whole Big Data cluster.
You can find the new version here.
Attention: It may not yet work in all availability zones, but has been tested successfully in Ireland.
In future blog posts, I will show how to write R scripts that distribute machine learning computation in R libraries to different nodes in your Big Data cluster by leveraging Apache Spark in-memory analytics.
Recently, I published on github.com several example Java projects for using various NoSQL technologies:
- cassandra-tutorial : Apache Cassandra tutorial (Column-oriented database)
- mongodb-tutorial : Mongo DB tutorial (Document database)
- neo4j-tutorial : Neo4J (Graph Database)
- redis-tutorial : Redis (Key/Value Store)
- solr-tutorial : Apache SolrCloud (Search technology)
Other example Java projects aim at standardized big data processing platforms:
- MapReduce: A simple hadoop map reduce job to count the number of tweets in a text file on HDFS
- SparkStreaming: A simple spark streaming job to count the number of tweets send by a simple network server
- tweet-server : A simple server for sending tweets to any client connecting to port 1234 on localhost. It is a shellscript using netcat.
You can use them in lectures or courses for teaching these topics. The projects use gradle, so you can make sure that the students always use the right libraries and thus avoiding conflicts with dependent libraries. Each project is a basic skeleton for using one software. Students can easily extend them and you can avoid a long bootstrapping phase until everybody has created the right project with right libraries for your environment. Unit testing can easily be added.
This first blog post is about creating your own Big Data lab in the Cloud using Amazon EMR. Follow my instructions here.
These instructions allow you within 15 minutes the following:
- You can use the analytics language R in a browser to access the full functionality of Hadoop/Spark, Hive/Shark (data warehouse), Rhipe (MapReduce for R), RMR (Map Reduce for R)
- Leverage the unlimited data and computing power of the Amazon Elastic Map Reduce cloud
- Create reports about your analytics results that you can distribute in any format
- Data Scientists simply use their browser to work with the data
- They can come up with new models based on your data in the organization to enhance your business processes and applications
- Improved personalized advertisement
- Improved sales targeting
- Predictive Maintenance for your assets
- User preference learning
- Resilience: Detect disasters in your software systems before they happen