HadoopCryptoLedger library a vision for the coming Years

The first commit of the HadoopCryptoLedger has been on 26th March of 2016. Since then a lot of new functionality has been added, such as support for major Big Data platforms including Hive / Flink / Spark. Furthermore, besides Bitcoin, Altcoins based on Bitcoin (e.g. Namecoin, Litecoin or Bitcoin Cash) and Ethereum (including Altcoins) have been implemented for analytics.

Since the library integrates seamlessly with Big Data platforms you can join blockchain data with any other data you may have, such as currency exchange rates from various platforms.

Blockchain analytics is getting more and more attention by industry, policy makers and research. This is not surprising, because one of the key element is that blockchains should be transparent for everyone – even for the normal citizen.

Given that background I foresee two major directions for 2018 and the following years:

  • Streaming: Streaming has become a hot topic in Big Data platforms, virtually all Big Data platforms, such as Apache Flink or Apache Spark, move towards streaming has the default way to process streaming and non-streaming data in general. The idea here is to stream blockchain data directly from blockchain networks, such as Bitocin and Ethereum, into your Big Data platform for direct analysis. This would also offer the possibility of some further interesting analytics, such as how many bad blocks/transactions are spammed into the network, when did forks happen, how many forks/subnetworks are established, what is the percentage of nodes piggybacking on the network (cf. merged mining for Bitcoin) and many other interesting data based on the blockchain network metadata.

  • Business & Conceptual Aspects of Blockchain Analytics: Surprisingly one finds very little research and investigations on business and conceptual aspects of blockhain (cf. here), especially analytics. Most of them describe only technical concepts of implementing block chain technology (see here). The idea here is to establish some basic framework, such as interesting metrics and how to efficiently calculate them, finding interesting patterns using machine learning algorithm or to derive them by joining other datasets (e.g. currency exchange rates). Another aspect is security and validity of analysis results. Of course this theoretical/conceptual work needs to be validated with practical investigations using the HadoopCryptoLedger library.

Some other topics supporting the aforementioned two topics are:

  • Contract Analytics: Virtually all blockchain technologies allow more or less powerful definition of contracts. The goal here is to find out 1) how can express contracts formally and find flaws in their definition 2) find evidence for these flaws actually been exploited/abused in the blockchain data. Furthermore, this will also enable linking contract data with other datasets.

  • Cloud Deployment: We want to create a cloud deployment in docker container format that is open to everyone, so everyone can deploy the analytics chain including download of the blockchain data within their preferred cloud solution. Of course, we would use this also to do more advanced integration tests of our analytic solution and showcase some of the aforementioned business analytics concepts.

  • Quality Assurance: Also 2018 will be characterized by lifting up quality assurance – increasing unit test coverage is a key element. This also includes getting rid of legacy stuff, such as supporting already outdated platform versions.

  • More Currencies: Although we support already a wide range of currencies by offering support for Bitcoin & Altcoins (Namecoin, Litecoin, Bitcoin Cash and many more) as well as Ethereum and Altcoins (Ethereum Classic etc.), there are further interesting blockchain concepts based payment networks/practical byzantine fault tolerance, proof-of-burn and direct acyclic graph based blockchains that are worth to analyse.

  • New research: QuantumChains (not to be confused with Quantum Money). QuantumChains is a rather new concept to explore quantum computing for representing blockchains. The advanced would be not only that those may get rid of some current issues with blockchains (proof of work, instant payment, large storage needs), but also make blockchains easier and faster to analyze for anyone – not only the biggest player with all the computing power. The How? may not be answered in 2018, but we hope to have some interesting conceptual Gedankenexperimente (though experiments) on how this could really work.

This is a pretty ambitious agenda for 2018, but it should be also seen that it will be further explored in the coming years.

Advertisements

Leverage the Power of Apache Flink to analyze the Bitcoin Blockchain

The hadoopcryptoledger library has been enhanced with a datasource for Apache Flink. This means you can use the Big Data processing framework Apache Flink to analyze the Bitcoin Blockchain.

It also includes an example that counts the total number of transactions in the Bitcoin blockchain. Of course given the power of Apache Flink you can think about more complex analysis applications, such as:

  • Graph analysis on the Bitcoin transaction graph, e.g. to identify clusters or connected components to find out close interactions between Bitcoin addresses
  • Trace money flows through the Bitcoin network
  • Predict power of mining pools, difficulty of block processing, impact of changes on the Bitcoin protocol or rules
  • Join it with other data to make predictions on prices, criminal activity and economics

In the future, we want to work on the following things :

  • Support for other cryptoledgers, e.g. Ethereum
  • Provide examples for analyzing other currencies based on the Bitcoin Blockchain, such as Litecoin and Namecoin
  • A Flume data source to stream Bitcoin Blockchain data directly into your cluster
  • Support selected blockchains provided via the Hyperledger Framework

 

Spark+Scala+Graphx: Analyzing the Bitcoin Transaction Graph

The hadoopcryptoledger library provides now an example how you can generate a Bitcoin Transaction Graph using the Big Data graph analysis technologies Spark+Scala+Graphx. Basically it demonstrates how to read the Bitcoin Blockchain from HDFS, transform it into a graph with Bitcoin addresses as vertices and transactions between them as edges. The example returns the 5 top bitcoin addresses having the most input transactions. This could indicate that they belong to Mixing services that try to obfuscate transactions between two addresses. The graph exemplified in the following figure showing four vertices with transactions between them:

transactiongraph

Of course this is just one example. You can think about numerous of other analysis related to this graph using algorithms such as strongly connected components or PageRank. Particularly if you connect it with other data that you collect related to the blockchain. You can also use this graph to do visual analytics on it.

In the coming weeks, further extensions are planned to be published:

  • Some common analytics pattern to analyze the Bitcoin economy

  • Some technical patterns, such as Bitcoin block validation

  • A flume source for receiving new Bitcoin blocks including Economic and technical consensus (storing and accessing it in the Hadoop ecosystem, e.g. in Hbase)

  • Adding support for more crypto ledgers, such as Ethereum

Hive & Bitcoin: Analytics on Blockchain data with SQL

You can now analyze the Bitcoin Blockchain using Hive and the hadoopcryptoledger library with the new HiveSerde plugin.

Basically you can link any data that you loaded in Hive with Bitcoin Blockchain data. For example, you can link Blockchain data with important events in history to determine what causes Bitcoin exchange rates to increase or decrease.

The site provides several examples on how to use SQL in Hive to do calculation upon Blockchain data, such as

  • Number of blocks in blockhain
  • Number of transactions in the blockchain
  • Total sum of all outputs of all transactions in the output

Of course, you can calculate nearly anything you can imagine using the Bitcoin Blockchain data as input. Furthermore, you can link the data with other data.

Although accessing Bitcoin blockchain data is rather fast for analytics, you can optimize your analytics by extracting often used data from the blockchain and storing them in a format optimized for analytics, such as the columnar format ORC in Hive.

The following simple example shows how you can do this. I assume that the Bitcoin Blockchain data is represented as the table “BitcoinBlockchain” and you want to copy the hashsum of each Bitcoin block, the block size and the version number in the table “BlockAnalytics” optimized for analytics:

CREATE TABLE BlockAnalytics STORED AS ORC AS SELECT hashmerkleroot, blocksize, version FROM BitcoinBlockchain;

Of course you can access the tables in Hive with analytical and visual analytic tools, such as Tableau, Matlab, SAS, R, SAP Lumira, DS3.js etc.

In the coming weeks, further extensions are planned to be published:

  • Some common analytics pattern to analyze the Bitcoin economy (e.g. similar to the ones shown on https://blockchain.info/)

  • Some technical patterns, such as Bitcoin block validation

  • A flume source for receiving new Bitcoin blocks including Economic and technical consensus (storing and accessing it in the Hadoop ecosystem, e.g. in Hbase)

  • Adding support for more crypto ledgers, such as Ethereum

Using Apache Spark to Analyze the Bitcoin Blockchain

The hadoopcryptoledger library provides now a simple example how you can analyze the Bitcoin Blockchain with Apache Spark. Previously, I described how you can use Hadoop MR or any other Hadoop ecosystem-compatible application to analyze it.

Basically, it leverages the HadoopRDD API to read the Hadoop File Format of the hadoopcryptoledger library. Afterwards you can apply any transformation on it or combine it with other data loaded with Spark.

You can apply the following generic Spark optimization techniques:

  • Extract in the map step only the data you need as simple data types or arrays of simple data types.

  • If you reuse the data more often then you might want to store it in a format optimized for analytics, such as ORC or Parquet.

  • Extract data as vectors that you process as vectors, e.g. in the Bitcoin Blockchain you can use the granularity of all the transactions in one block (usually between 1000-2000 transactions). This enables you to leverage JVM optimizations, such as java.util.Arrays.parallel*, SIMD (Single Instruction Multiple Data Values) or Streams (both JDK8) and reduces overhead. Additionally, use concurrent data structures, such as CopyOnWriteArrayList, *Queues, ConcurrentMaps, ConcurrentSets. However, use only the data of the transaction that you really need.
  • Use as serialization format Dataframes or Datasets instead of RDD. This means that the data is stored more compact in memory and thus can be processed as well as transferred faster.
    • Additionally think about encoding information as bits (especially doubles are very costly for storing pricing information), dates as int or timestamps as long.

  • …. many more

In the coming weeks, further extensions are planned to be published:

  • Integration of Blockchain data into Hive to enable end users to use SQL queries to analyze the Blockchain

  • Some common analytics pattern to analyze the Bitcoin economy

  • Some technical patterns, such as Bitcoin block validation

  • A flume source for receiving new Bitcoin blocks including Economic and technical consensus (storing and accessing it in the Hadoop ecosystem, e.g. in Hbase)

  • Adding support for more crypto ledgers, such as Ethereum

Analyzing the Bitcoin Blockchain using the Hadoop Ecosystem – A first Approach

Bitcoin and other crytocurrencies have drawn a lot of attention of companies, public organizations and individuals. While many use cases exists there is still a long road ahead to make them part of everybody’s life.

The recently released first version of the open source hadoopycryptoledger library is a first attempt to make this happen. It currently allows analyzing the Bitcoin blockchain together with any data using Hadoop ecosystem tools. The Bitcoin blockchain is a distributed ledger containing all transactions executed over the Bitcoin network.

Hence, virtually all use cases related to analysis of the Bitcoin blockchain are possible. Some examples:

  • Predict Bitcoin exchange prices by analysing the Blockchain together with pricing information from Bitcoin exchanges
  • Explore relationships between counterparties in the blockchain
  • Explore impact of Bitcoin miners on the Bitcoin ecosystem
  • Trace Bitcoin money flows around the network
  • Link news events with Bitcoin blockchain data
  • Link economic data with Blockchain transactions

Currently the library provides a Hadoop File Format to analyze the Blockchain with any Hadoop application. For example, one can develop a Hadoop MapReduce, Spark or TEZ josb.

There are several enhancements planned for the library over the coming weeks, such as

  • Provide an example how to use Spark with the hadoopcryptoledger library
  • Integration of Blockchain data into Hive to enable end users to use SQL queries to analyze the blockchain
  • A flume source for receiving new Bitcoin blocks
  • Adding support for more crypto ledgers, such as Ethereum