Big Data Analytics on Bitcoin‘s first Altcoin: NameCoin

This blog post is about analyzing the Namecoin Blockchain using different Big Data technologies based on the HadoopCryptoLedger library. Currently, this library enables you to analyze the Bitcoin blockchain and Altcoins based on Bitcoin (incl. segregated witness), such as Namecoin, Litecoin, Zcash etc., on Big Data platforms, such as Hadoop, Hive, Flink and Spark. A support for Ethereum is planned for the near future.

However, back to Namecoin and why it is interesting:

  • It was one of the first Altcoin based on Bitcoin
  • It supports a decentralized domain name and identity system based on blockchain technology – no central actor can censor it
  • It is one of the first systems that supports merged mining/AuxPOW. Merged mining enables normal Bitcoin mining pools to mine Altcoins (such as Namecoin) without any additional effort while mining normal Bitcoins. Hence, they can make more revenue and at the same time they support Altcoins, which would be without Bitcoin mining pool support much weaker or not existing. It can be expected that many Altcoins based on Bitcoin will switch to it eventually.

HadoopCryptoLedger supported already from the beginning Altcoins based on Bitcoin. Usually it is expected that Big Data applications based on the HadoopCryptoLedger library implement the analytics they are interested in. However, sometimes we add specific functionality to make it much easier, for instance we provide the Hive UDFs to make certain analysis easier in Hive or we provide certain utility functions for MapReduce, Flink and Spark to make some things which require more detailed know how more easily available to beginners in block chain technology.

We have done this also for Namecoin by providing functionality to:

  • determine the type of name operation (new, firstupdate, update)
  • get more information about available domains, subdomains, ip addresses, identities

With this analysis is rather easy, because Namecoin requires you to update the information at least every 35,999 blocks (roughly between 200 to 250 days) or information are considered as expired. You just have to take simply the most recent information for a given domain or identity – it contains a full update of all information and there is no need to merge it with information from previous transactions. However, sometimes additional information are provided in so-called delegate entries and in this case you need to combine information.

Finally, we provided additional support to read blockchains based on Bitcoin and the merged mining/AuxPOW specification. You can enable it with a single configuration option.
Of course, we provide examples, e.g. for Spark and Hive. Nevertheless all major Big Data platforms (including Flink) are supported.
Find here what we plan next – amongst others:

  • Support analytics on Ethereum
  • Run analytics based on HadoopCryptoLedger in the cloud (e.g. Amazon EMR and/or Microsoft Azure) and provide real time aggregations as a simple HTML page
  • … much more

Let us know via Github issues what you are interested in!


Leverage the Power of Apache Flink to analyze the Bitcoin Blockchain

The hadoopcryptoledger library has been enhanced with a datasource for Apache Flink. This means you can use the Big Data processing framework Apache Flink to analyze the Bitcoin Blockchain.

It also includes an example that counts the total number of transactions in the Bitcoin blockchain. Of course given the power of Apache Flink you can think about more complex analysis applications, such as:

  • Graph analysis on the Bitcoin transaction graph, e.g. to identify clusters or connected components to find out close interactions between Bitcoin addresses
  • Trace money flows through the Bitcoin network
  • Predict power of mining pools, difficulty of block processing, impact of changes on the Bitcoin protocol or rules
  • Join it with other data to make predictions on prices, criminal activity and economics

In the future, we want to work on the following things :

  • Support for other cryptoledgers, e.g. Ethereum
  • Provide examples for analyzing other currencies based on the Bitcoin Blockchain, such as Litecoin and Namecoin
  • A Flume data source to stream Bitcoin Blockchain data directly into your cluster
  • Support selected blockchains provided via the Hyperledger Framework


Big Data Lab in the Cloud with Hadoop+Spark+R+Python

This is an update of the second big data lab for the cloud. Similar to previous versions, this document described how you can create a Big Data Lab in the cloud on Amazon EMR.

Besides some major upgrades to the newest Amazon Hadoop AMI (3.6.0) Spark (1.3.0) and R, it includes now also the possibility to use Python in the browser. There, you have the same functionality as in R. This means you can use Hadoop M/R, Spark and SparkSQL in Python from the browser. Similar to R, Python has gained attention by data scientists.

You can find the newest version here.

In future blog posts, I will write how you can use some of the open datasets on the Internet in the Big Data lab.

Example projects for using various NoSQL and Big Data technologies

Recently, I published on several example Java projects for using various NoSQL technologies:

  • cassandra-tutorial : Apache Cassandra tutorial (Column-oriented database)
  • mongodb-tutorial : Mongo DB tutorial (Document database)
  • neo4j-tutorial : Neo4J (Graph Database)
  • redis-tutorial : Redis (Key/Value Store)
  • solr-tutorial : Apache SolrCloud (Search technology)

Other example Java projects aim at standardized big data processing platforms:

  • MapReduce: A simple hadoop map reduce job to count the number of tweets in a text file on HDFS
  • SparkStreaming: A simple spark streaming job to count the number of tweets send by a simple network server
  • tweet-server : A simple server for sending tweets to any client connecting to port 1234 on localhost. It is a shellscript using netcat.

You can use them in lectures or courses for teaching these topics. The projects use gradle, so you can make sure that the students always use the right libraries and thus avoiding conflicts with dependent libraries. Each project is a basic skeleton for using one software. Students can easily extend them and you can avoid a long bootstrapping phase until everybody has created the right project with right libraries for your environment. Unit testing can easily be added.