Big Data Analytics on Bitcoin‘s first Altcoin: NameCoin

This blog post is about analyzing the Namecoin Blockchain using different Big Data technologies based on the HadoopCryptoLedger library. Currently, this library enables you to analyze the Bitcoin blockchain and Altcoins based on Bitcoin (incl. segregated witness), such as Namecoin, Litecoin, Zcash etc., on Big Data platforms, such as Hadoop, Hive, Flink and Spark. A support for Ethereum is planned for the near future.

However, back to Namecoin and why it is interesting:

  • It was one of the first Altcoin based on Bitcoin
  • It supports a decentralized domain name and identity system based on blockchain technology – no central actor can censor it
  • It is one of the first systems that supports merged mining/AuxPOW. Merged mining enables normal Bitcoin mining pools to mine Altcoins (such as Namecoin) without any additional effort while mining normal Bitcoins. Hence, they can make more revenue and at the same time they support Altcoins, which would be without Bitcoin mining pool support much weaker or not existing. It can be expected that many Altcoins based on Bitcoin will switch to it eventually.

HadoopCryptoLedger supported already from the beginning Altcoins based on Bitcoin. Usually it is expected that Big Data applications based on the HadoopCryptoLedger library implement the analytics they are interested in. However, sometimes we add specific functionality to make it much easier, for instance we provide the Hive UDFs to make certain analysis easier in Hive or we provide certain utility functions for MapReduce, Flink and Spark to make some things which require more detailed know how more easily available to beginners in block chain technology.

We have done this also for Namecoin by providing functionality to:

  • determine the type of name operation (new, firstupdate, update)
  • get more information about available domains, subdomains, ip addresses, identities

With this analysis is rather easy, because Namecoin requires you to update the information at least every 35,999 blocks (roughly between 200 to 250 days) or information are considered as expired. You just have to take simply the most recent information for a given domain or identity – it contains a full update of all information and there is no need to merge it with information from previous transactions. However, sometimes additional information are provided in so-called delegate entries and in this case you need to combine information.

Finally, we provided additional support to read blockchains based on Bitcoin and the merged mining/AuxPOW specification. You can enable it with a single configuration option.
Of course, we provide examples, e.g. for Spark and Hive. Nevertheless all major Big Data platforms (including Flink) are supported.
Find here what we plan next – amongst others:

  • Support analytics on Ethereum
  • Run analytics based on HadoopCryptoLedger in the cloud (e.g. Amazon EMR and/or Microsoft Azure) and provide real time aggregations as a simple HTML page
  • … much more

Let us know via Github issues what you are interested in!

Advertisements

Leverage the Power of Apache Flink to analyze the Bitcoin Blockchain

The hadoopcryptoledger library has been enhanced with a datasource for Apache Flink. This means you can use the Big Data processing framework Apache Flink to analyze the Bitcoin Blockchain.

It also includes an example that counts the total number of transactions in the Bitcoin blockchain. Of course given the power of Apache Flink you can think about more complex analysis applications, such as:

  • Graph analysis on the Bitcoin transaction graph, e.g. to identify clusters or connected components to find out close interactions between Bitcoin addresses
  • Trace money flows through the Bitcoin network
  • Predict power of mining pools, difficulty of block processing, impact of changes on the Bitcoin protocol or rules
  • Join it with other data to make predictions on prices, criminal activity and economics

In the future, we want to work on the following things :

  • Support for other cryptoledgers, e.g. Ethereum
  • Provide examples for analyzing other currencies based on the Bitcoin Blockchain, such as Litecoin and Namecoin
  • A Flume data source to stream Bitcoin Blockchain data directly into your cluster
  • Support selected blockchains provided via the Hyperledger Framework

 

Spark+Scala+Graphx: Analyzing the Bitcoin Transaction Graph

The hadoopcryptoledger library provides now an example how you can generate a Bitcoin Transaction Graph using the Big Data graph analysis technologies Spark+Scala+Graphx. Basically it demonstrates how to read the Bitcoin Blockchain from HDFS, transform it into a graph with Bitcoin addresses as vertices and transactions between them as edges. The example returns the 5 top bitcoin addresses having the most input transactions. This could indicate that they belong to Mixing services that try to obfuscate transactions between two addresses. The graph exemplified in the following figure showing four vertices with transactions between them:

transactiongraph

Of course this is just one example. You can think about numerous of other analysis related to this graph using algorithms such as strongly connected components or PageRank. Particularly if you connect it with other data that you collect related to the blockchain. You can also use this graph to do visual analytics on it.

In the coming weeks, further extensions are planned to be published:

  • Some common analytics pattern to analyze the Bitcoin economy

  • Some technical patterns, such as Bitcoin block validation

  • A flume source for receiving new Bitcoin blocks including Economic and technical consensus (storing and accessing it in the Hadoop ecosystem, e.g. in Hbase)

  • Adding support for more crypto ledgers, such as Ethereum

Hive & Bitcoin: Analytics on Blockchain data with SQL

You can now analyze the Bitcoin Blockchain using Hive and the hadoopcryptoledger library with the new HiveSerde plugin.

Basically you can link any data that you loaded in Hive with Bitcoin Blockchain data. For example, you can link Blockchain data with important events in history to determine what causes Bitcoin exchange rates to increase or decrease.

The site provides several examples on how to use SQL in Hive to do calculation upon Blockchain data, such as

  • Number of blocks in blockhain
  • Number of transactions in the blockchain
  • Total sum of all outputs of all transactions in the output

Of course, you can calculate nearly anything you can imagine using the Bitcoin Blockchain data as input. Furthermore, you can link the data with other data.

Although accessing Bitcoin blockchain data is rather fast for analytics, you can optimize your analytics by extracting often used data from the blockchain and storing them in a format optimized for analytics, such as the columnar format ORC in Hive.

The following simple example shows how you can do this. I assume that the Bitcoin Blockchain data is represented as the table “BitcoinBlockchain” and you want to copy the hashsum of each Bitcoin block, the block size and the version number in the table “BlockAnalytics” optimized for analytics:

CREATE TABLE BlockAnalytics STORED AS ORC AS SELECT hashmerkleroot, blocksize, version FROM BitcoinBlockchain;

Of course you can access the tables in Hive with analytical and visual analytic tools, such as Tableau, Matlab, SAS, R, SAP Lumira, DS3.js etc.

In the coming weeks, further extensions are planned to be published:

  • Some common analytics pattern to analyze the Bitcoin economy (e.g. similar to the ones shown on https://blockchain.info/)

  • Some technical patterns, such as Bitcoin block validation

  • A flume source for receiving new Bitcoin blocks including Economic and technical consensus (storing and accessing it in the Hadoop ecosystem, e.g. in Hbase)

  • Adding support for more crypto ledgers, such as Ethereum

Using Apache Spark to Analyze the Bitcoin Blockchain

The hadoopcryptoledger library provides now a simple example how you can analyze the Bitcoin Blockchain with Apache Spark. Previously, I described how you can use Hadoop MR or any other Hadoop ecosystem-compatible application to analyze it.

Basically, it leverages the HadoopRDD API to read the Hadoop File Format of the hadoopcryptoledger library. Afterwards you can apply any transformation on it or combine it with other data loaded with Spark.

You can apply the following generic Spark optimization techniques:

  • Extract in the map step only the data you need as simple data types or arrays of simple data types.

  • If you reuse the data more often then you might want to store it in a format optimized for analytics, such as ORC or Parquet.

  • Extract data as vectors that you process as vectors, e.g. in the Bitcoin Blockchain you can use the granularity of all the transactions in one block (usually between 1000-2000 transactions). This enables you to leverage JVM optimizations, such as java.util.Arrays.parallel*, SIMD (Single Instruction Multiple Data Values) or Streams (both JDK8) and reduces overhead. Additionally, use concurrent data structures, such as CopyOnWriteArrayList, *Queues, ConcurrentMaps, ConcurrentSets. However, use only the data of the transaction that you really need.
  • Use as serialization format Dataframes or Datasets instead of RDD. This means that the data is stored more compact in memory and thus can be processed as well as transferred faster.
    • Additionally think about encoding information as bits (especially doubles are very costly for storing pricing information), dates as int or timestamps as long.

  • …. many more

In the coming weeks, further extensions are planned to be published:

  • Integration of Blockchain data into Hive to enable end users to use SQL queries to analyze the Blockchain

  • Some common analytics pattern to analyze the Bitcoin economy

  • Some technical patterns, such as Bitcoin block validation

  • A flume source for receiving new Bitcoin blocks including Economic and technical consensus (storing and accessing it in the Hadoop ecosystem, e.g. in Hbase)

  • Adding support for more crypto ledgers, such as Ethereum

Analyzing the Bitcoin Blockchain using the Hadoop Ecosystem – A first Approach

Bitcoin and other crytocurrencies have drawn a lot of attention of companies, public organizations and individuals. While many use cases exists there is still a long road ahead to make them part of everybody’s life.

The recently released first version of the open source hadoopycryptoledger library is a first attempt to make this happen. It currently allows analyzing the Bitcoin blockchain together with any data using Hadoop ecosystem tools. The Bitcoin blockchain is a distributed ledger containing all transactions executed over the Bitcoin network.

Hence, virtually all use cases related to analysis of the Bitcoin blockchain are possible. Some examples:

  • Predict Bitcoin exchange prices by analysing the Blockchain together with pricing information from Bitcoin exchanges
  • Explore relationships between counterparties in the blockchain
  • Explore impact of Bitcoin miners on the Bitcoin ecosystem
  • Trace Bitcoin money flows around the network
  • Link news events with Bitcoin blockchain data
  • Link economic data with Blockchain transactions

Currently the library provides a Hadoop File Format to analyze the Blockchain with any Hadoop application. For example, one can develop a Hadoop MapReduce, Spark or TEZ josb.

There are several enhancements planned for the library over the coming weeks, such as

  • Provide an example how to use Spark with the hadoopcryptoledger library
  • Integration of Blockchain data into Hive to enable end users to use SQL queries to analyze the blockchain
  • A flume source for receiving new Bitcoin blocks
  • Adding support for more crypto ledgers, such as Ethereum