HadoopCryptoLedger library a vision for the coming Years

The first commit of the HadoopCryptoLedger has been on 26th March of 2016. Since then a lot of new functionality has been added, such as support for major Big Data platforms including Hive / Flink / Spark. Furthermore, besides Bitcoin, Altcoins based on Bitcoin (e.g. Namecoin, Litecoin or Bitcoin Cash) and Ethereum (including Altcoins) have been implemented for analytics.

Since the library integrates seamlessly with Big Data platforms you can join blockchain data with any other data you may have, such as currency exchange rates from various platforms.

Blockchain analytics is getting more and more attention by industry, policy makers and research. This is not surprising, because one of the key element is that blockchains should be transparent for everyone – even for the normal citizen.

Given that background I foresee two major directions for 2018 and the following years:

  • Streaming: Streaming has become a hot topic in Big Data platforms, virtually all Big Data platforms, such as Apache Flink or Apache Spark, move towards streaming has the default way to process streaming and non-streaming data in general. The idea here is to stream blockchain data directly from blockchain networks, such as Bitocin and Ethereum, into your Big Data platform for direct analysis. This would also offer the possibility of some further interesting analytics, such as how many bad blocks/transactions are spammed into the network, when did forks happen, how many forks/subnetworks are established, what is the percentage of nodes piggybacking on the network (cf. merged mining for Bitcoin) and many other interesting data based on the blockchain network metadata.

  • Business & Conceptual Aspects of Blockchain Analytics: Surprisingly one finds very little research and investigations on business and conceptual aspects of blockhain (cf. here), especially analytics. Most of them describe only technical concepts of implementing block chain technology (see here). The idea here is to establish some basic framework, such as interesting metrics and how to efficiently calculate them, finding interesting patterns using machine learning algorithm or to derive them by joining other datasets (e.g. currency exchange rates). Another aspect is security and validity of analysis results. Of course this theoretical/conceptual work needs to be validated with practical investigations using the HadoopCryptoLedger library.

Some other topics supporting the aforementioned two topics are:

  • Contract Analytics: Virtually all blockchain technologies allow more or less powerful definition of contracts. The goal here is to find out 1) how can express contracts formally and find flaws in their definition 2) find evidence for these flaws actually been exploited/abused in the blockchain data. Furthermore, this will also enable linking contract data with other datasets.

  • Cloud Deployment: We want to create a cloud deployment in docker container format that is open to everyone, so everyone can deploy the analytics chain including download of the blockchain data within their preferred cloud solution. Of course, we would use this also to do more advanced integration tests of our analytic solution and showcase some of the aforementioned business analytics concepts.

  • Quality Assurance: Also 2018 will be characterized by lifting up quality assurance – increasing unit test coverage is a key element. This also includes getting rid of legacy stuff, such as supporting already outdated platform versions.

  • More Currencies: Although we support already a wide range of currencies by offering support for Bitcoin & Altcoins (Namecoin, Litecoin, Bitcoin Cash and many more) as well as Ethereum and Altcoins (Ethereum Classic etc.), there are further interesting blockchain concepts based payment networks/practical byzantine fault tolerance, proof-of-burn and direct acyclic graph based blockchains that are worth to analyse.

  • New research: QuantumChains (not to be confused with Quantum Money). QuantumChains is a rather new concept to explore quantum computing for representing blockchains. The advanced would be not only that those may get rid of some current issues with blockchains (proof of work, instant payment, large storage needs), but also make blockchains easier and faster to analyze for anyone – not only the biggest player with all the computing power. The How? may not be answered in 2018, but we hope to have some interesting conceptual Gedankenexperimente (though experiments) on how this could really work.

This is a pretty ambitious agenda for 2018, but it should be also seen that it will be further explored in the coming years.


Big Data Lab in the Cloud with Hadoop+Spark+R+Python

This is an update of the second big data lab for the cloud. Similar to previous versions, this document described how you can create a Big Data Lab in the cloud on Amazon EMR.

Besides some major upgrades to the newest Amazon Hadoop AMI (3.6.0) Spark (1.3.0) and R, it includes now also the possibility to use Python in the browser. There, you have the same functionality as in R. This means you can use Hadoop M/R, Spark and SparkSQL in Python from the browser. Similar to R, Python has gained attention by data scientists.

You can find the newest version here.

In future blog posts, I will write how you can use some of the open datasets on the Internet in the Big Data lab.

Update: Next Generation Big Data Lab V2 in the Cloud

Recently, I presented the first version of the Big Data Lab in the cloud. Now I extended this version and kept most of the features of the previous version. However, I provide upgrades for important software components. It still runs on Amazon EMR, but with the newest Amazon AMI (including Amazon Linux). It now features Hadoop 2.4, Spark 1.1.1, R 3 and for the first time SparkR, so you can do in-memory  analytics in R by leveraging your whole Big Data cluster.

You can find the new version here.

Attention: It may not yet work in all availability zones, but has been tested successfully in Ireland.

In future blog posts, I will show how to write R scripts that distribute machine learning computation in R libraries to different nodes in your Big Data cluster by leveraging Apache Spark in-memory analytics.

Creating a Big Data lab in the Cloud using Amazon EMR

This first blog post is about creating your own Big Data lab in the Cloud using Amazon EMR. Follow my instructions here.

These instructions allow you within 15 minutes the following:

  • You can use the analytics language R in a browser to access the full functionality of Hadoop/Spark, Hive/Shark (data warehouse), Rhipe (MapReduce for R), RMR (Map Reduce for R)
  • Leverage the unlimited data and computing power of the Amazon Elastic Map Reduce cloud
  • Create reports about your analytics results that you can distribute in any format
  • Data Scientists simply use their browser to work with the data
  • They can come up with new models based on your data in the organization to enhance your business processes and applications
    • Improved personalized advertisement
    • Improved sales targeting
    • Predictive Maintenance for your assets
    • User preference learning
    • Gamification
    • Resilience: Detect disasters in your software systems before they happen