Big Data Lab in the Cloud with Hadoop+Spark+R+Python

This is an update of the second big data lab for the cloud. Similar to previous versions, this document described how you can create a Big Data Lab in the cloud on Amazon EMR.

Besides some major upgrades to the newest Amazon Hadoop AMI (3.6.0) Spark (1.3.0) and R, it includes now also the possibility to use Python in the browser. There, you have the same functionality as in R. This means you can use Hadoop M/R, Spark and SparkSQL in Python from the browser. Similar to R, Python has gained attention by data scientists.

You can find the newest version here.

In future blog posts, I will write how you can use some of the open datasets on the Internet in the Big Data lab.

Enabling WebRTC in modern Java Enterprise Web Applications

I recently started a small project to create a sample enterprise Big Data web application using Spring.

You can find the source code here and a demonstration here.

One feature in this application WebRTC. I started working with WebRTC since its introduction around 2011/2012. Now, it became a W3C standard and has been implemented in nearly all popular browsers, such as Mozilla Firefox, Google Chrome or Opera. Basically it offers you secure video/voice chat, screen sharing and peer to peer data exchange for your browser. If you want to have a simple online demonstration of WebRTC in general then you can try it out here.

All major browsers support WebRTC on mobile, but also on desktop computers. Gateway software exists to connect a WebRTC client to SIP and thus the “standard” phone network. STUN and TURN server support you to correctly deal with firewalls.

You do not need any additional plugins in your browser to enable all of this. You can compare the functionality with Skype – except that it is possible in web applications without plugins. Hence, it works as well on smartphones and tablets, where you usually cannot install plugins for your browser.

WebRTC in Enteprise Applications

Communcation between people is certainly an important aspect of enterprise web applications. Hence, the WebRTC standard is interesting and relevant for them. Although WebRTC is at its core a peer to peer solution, the developer of an enterprise solution needs to provide a “signaling channel”. This channel is responsible so that the people participating in a WebRTC exchange, such as a video/voice chat, find each other and let their browsers exchange information on how they can connect directly to each other or via a gateway.

Basically, this signaling channel needs to transmit JSON objects

  1. Between all users in a conversation so they can contact each other directly
  2. Between two users so they can have a secure connection to each other.It should be noted that point 2) is also needed in a group chat, because a peer to peer connection is always established between two users. This means in a group chat consisting of three users, “user 1” has a peer to peer connection to “user 2” and another one to “user 3”. Additionally there is one between “user 2” and “user 3”. This is illustrated in the following figure.

webrtc

The signaling channel does not transmit any video/voice or other data, it is just for establishing and maintaining the direct connection between two peer to peer users.

Implementing a WebRTC signaling channel in a Web Enterprise Application

Implementing a signaling channel for an enterprise application needs to take into account secure, scalable and reliable message delivery via message-oriented middleware that does not impose any additional plugins on the web browser. Basically you can implement such a channel as follows

  1. The web application sends signaling messages to the backend using the WebSocket-Protocol or fallbacks for older browser (Sock.js)
  2. The Streaming Text Oriented Messaging Protocol (STOMP) is used to send signaling messages to a topic and private queues of the users within a message-oriented middleware connected to the web application backend to ensure that messages are delivered properly.
  3. The backend is connected to a message-oriented middleware, such as RabbitMQ, JBOSS HornetQ or with any JMS-capable middleware via the Kaazing Websocket Gateway. This can be configured in a flexible manner in the example application, because we use the Spring Messaging interface.

Those technologies have been integrated in the example enterprise web application.

WebRTC: Next Generation Communication

WebRTC has other exciting use cases, such as E-Learning, E-Health, Sales Support, Customer-Relationship Management (CRM), CoBrowsing or becoming the default protocol for the Internet of Things to link people and things. It is growing more and more. A lot of startups have emerged recently and big companies are starting to support WebRTC in their communication software.

Update: Next Generation Big Data Lab V2 in the Cloud

Recently, I presented the first version of the Big Data Lab in the cloud. Now I extended this version and kept most of the features of the previous version. However, I provide upgrades for important software components. It still runs on Amazon EMR, but with the newest Amazon AMI (including Amazon Linux). It now features Hadoop 2.4, Spark 1.1.1, R 3 and for the first time SparkR, so you can do in-memory  analytics in R by leveraging your whole Big Data cluster.

You can find the new version here.

Attention: It may not yet work in all availability zones, but has been tested successfully in Ireland.

In future blog posts, I will show how to write R scripts that distribute machine learning computation in R libraries to different nodes in your Big Data cluster by leveraging Apache Spark in-memory analytics.

Example projects for using various NoSQL and Big Data technologies

Recently, I published on github.com several example Java projects for using various NoSQL technologies:

  • cassandra-tutorial : Apache Cassandra tutorial (Column-oriented database)
  • mongodb-tutorial : Mongo DB tutorial (Document database)
  • neo4j-tutorial : Neo4J (Graph Database)
  • redis-tutorial : Redis (Key/Value Store)
  • solr-tutorial : Apache SolrCloud (Search technology)

Other example Java projects aim at standardized big data processing platforms:

  • MapReduce: A simple hadoop map reduce job to count the number of tweets in a text file on HDFS
  • SparkStreaming: A simple spark streaming job to count the number of tweets send by a simple network server
  • tweet-server : A simple server for sending tweets to any client connecting to port 1234 on localhost. It is a shellscript using netcat.

You can use them in lectures or courses for teaching these topics. The projects use gradle, so you can make sure that the students always use the right libraries and thus avoiding conflicts with dependent libraries. Each project is a basic skeleton for using one software. Students can easily extend them and you can avoid a long bootstrapping phase until everybody has created the right project with right libraries for your environment. Unit testing can easily be added.

Creating a Big Data lab in the Cloud using Amazon EMR

This first blog post is about creating your own Big Data lab in the Cloud using Amazon EMR. Follow my instructions here.

These instructions allow you within 15 minutes the following:

  • You can use the analytics language R in a browser to access the full functionality of Hadoop/Spark, Hive/Shark (data warehouse), Rhipe (MapReduce for R), RMR (Map Reduce for R)
  • Leverage the unlimited data and computing power of the Amazon Elastic Map Reduce cloud
  • Create reports about your analytics results that you can distribute in any format
  • Data Scientists simply use their browser to work with the data
  • They can come up with new models based on your data in the organization to enhance your business processes and applications
    • Improved personalized advertisement
    • Improved sales targeting
    • Predictive Maintenance for your assets
    • User preference learning
    • Gamification
    • Resilience: Detect disasters in your software systems before they happen