Using Apache Spark to Analyze the Bitcoin Blockchain

The hadoopcryptoledger library provides now a simple example how you can analyze the Bitcoin Blockchain with Apache Spark. Previously, I described how you can use Hadoop MR or any other Hadoop ecosystem-compatible application to analyze it.

Basically, it leverages the HadoopRDD API to read the Hadoop File Format of the hadoopcryptoledger library. Afterwards you can apply any transformation on it or combine it with other data loaded with Spark.

You can apply the following generic Spark optimization techniques:

  • Extract in the map step only the data you need as simple data types or arrays of simple data types.

  • If you reuse the data more often then you might want to store it in a format optimized for analytics, such as ORC or Parquet.

  • Extract data as vectors that you process as vectors, e.g. in the Bitcoin Blockchain you can use the granularity of all the transactions in one block (usually between 1000-2000 transactions). This enables you to leverage JVM optimizations, such as java.util.Arrays.parallel*, SIMD (Single Instruction Multiple Data Values) or Streams (both JDK8) and reduces overhead. Additionally, use concurrent data structures, such as CopyOnWriteArrayList, *Queues, ConcurrentMaps, ConcurrentSets. However, use only the data of the transaction that you really need.
  • Use as serialization format Dataframes or Datasets instead of RDD. This means that the data is stored more compact in memory and thus can be processed as well as transferred faster.
    • Additionally think about encoding information as bits (especially doubles are very costly for storing pricing information), dates as int or timestamps as long.

  • …. many more

In the coming weeks, further extensions are planned to be published:

  • Integration of Blockchain data into Hive to enable end users to use SQL queries to analyze the Blockchain

  • Some common analytics pattern to analyze the Bitcoin economy

  • Some technical patterns, such as Bitcoin block validation

  • A flume source for receiving new Bitcoin blocks including Economic and technical consensus (storing and accessing it in the Hadoop ecosystem, e.g. in Hbase)

  • Adding support for more crypto ledgers, such as Ethereum