Big Data Lab in the Cloud with Hadoop+Spark+R+Python

This is an update of the second big data lab for the cloud. Similar to previous versions, this document described how you can create a Big Data Lab in the cloud on Amazon EMR.

Besides some major upgrades to the newest Amazon Hadoop AMI (3.6.0) Spark (1.3.0) and R, it includes now also the possibility to use Python in the browser. There, you have the same functionality as in R. This means you can use Hadoop M/R, Spark and SparkSQL in Python from the browser. Similar to R, Python has gained attention by data scientists.

You can find the newest version here.

In future blog posts, I will write how you can use some of the open datasets on the Internet in the Big Data lab.

Example projects for using various NoSQL and Big Data technologies

Recently, I published on several example Java projects for using various NoSQL technologies:

  • cassandra-tutorial : Apache Cassandra tutorial (Column-oriented database)
  • mongodb-tutorial : Mongo DB tutorial (Document database)
  • neo4j-tutorial : Neo4J (Graph Database)
  • redis-tutorial : Redis (Key/Value Store)
  • solr-tutorial : Apache SolrCloud (Search technology)

Other example Java projects aim at standardized big data processing platforms:

  • MapReduce: A simple hadoop map reduce job to count the number of tweets in a text file on HDFS
  • SparkStreaming: A simple spark streaming job to count the number of tweets send by a simple network server
  • tweet-server : A simple server for sending tweets to any client connecting to port 1234 on localhost. It is a shellscript using netcat.

You can use them in lectures or courses for teaching these topics. The projects use gradle, so you can make sure that the students always use the right libraries and thus avoiding conflicts with dependent libraries. Each project is a basic skeleton for using one software. Students can easily extend them and you can avoid a long bootstrapping phase until everybody has created the right project with right libraries for your environment. Unit testing can easily be added.