Although it seems to be that it was only a small improvement, version 1.0.4 of the HadoopOffice library has a lot of new features for reading/writing Excel files:
- Templates, so you can define complex documents with diagrams or other features in MSExcel and fill it with data or formulas from your Big Data platform in Hadoop, Spark & Co
- Low footprint mode – this mode leverages the Apache POI event and streaming APIs. It saves CPU and memory consumption significantly at the expense of certain features (e.g. evaluation of formulas which is only supported in standard mode). This mode supports reading old MS Excel (.xls)/new MS Excel (.xlsx) and writing new MS Excel (.xlsx) documents
- New features in the Spark 2 datasource:
- Inferring of the DataFrame schema consisting of simple Spark SQL DataTypes (Boolean, Date, Byte, Short, Integer, Long, Decimal, String) based on the data in the Excel file
- Improved writing of a DataFrame based on a schema with simpel Spark SQL DataTypes
- Interpreting the first row of an Excel file as column names for the DataFrame for reading (“header”)
- Writing column names of a DataFrame as the first row of an Excel file (“header”)
- Support for Spark 2.0.1, 2.1, 2.2
Of course still other features are still usable, such as metadata reading/writing, encryption/decryption or linked workbooks, support for Hadoop MapReduce, support for Spark2 datasources and support for Spark 1.
What is next?
- Support for Apache Flink for reading/writing Excel files
- Support for Apache Hive (Hive SerDe) for reading/writing Excel files
- Support for digitally signing/verifying signature(s) of Excel files
- Support for reading access files
- … many more
The hadoopcryptoledger library has been enhanced with a datasource for Apache Flink. This means you can use the Big Data processing framework Apache Flink to analyze the Bitcoin Blockchain.
It also includes an example that counts the total number of transactions in the Bitcoin blockchain. Of course given the power of Apache Flink you can think about more complex analysis applications, such as:
- Graph analysis on the Bitcoin transaction graph, e.g. to identify clusters or connected components to find out close interactions between Bitcoin addresses
- Trace money flows through the Bitcoin network
- Predict power of mining pools, difficulty of block processing, impact of changes on the Bitcoin protocol or rules
- Join it with other data to make predictions on prices, criminal activity and economics
In the future, we want to work on the following things :
- Support for other cryptoledgers, e.g. Ethereum
- Provide examples for analyzing other currencies based on the Bitcoin Blockchain, such as Litecoin and Namecoin
- A Flume data source to stream Bitcoin Blockchain data directly into your cluster
- Support selected blockchains provided via the Hyperledger Framework
Reading/Writing office documents, such as Excel, has been always challenging on Big data platforms. Although many libraries exist for reading/writing office documents, they have never been really integrated in Hadoop or Spark and thus lead to a lot of development efforts.
There are several use cases for using office documents jointly with Big data technologies:
- Enabling the full customer-centric data science lifecycle: Within your Big Data platform you crunch numbers for complex models. However, you have to make them accessible to your customers. Le us assume you work in the insurance industry. Your Big Data platform calculates various models focused on your customer for insurance products. Your sales staff receives the models in Excel format. They can now play together with the customers on the different parameters, e.g. retirement age, individual risks etc. They may also come up with a different proposal more suitable for your customer and you want to feed it back into your Big Data platform to see if it is feasible.
- You still have a lot of data in Excel files related to your computation. Let it be code lists, data collected manually or your existing systems simply support this format.
Hence, the HadoopOffice library was created and the first version has just been released!
- A Hadoop FileFormat for reading/writing Excel files using the Apache POI library, so that nearly all Hadoop ecosystem components can read/write them
- Excel files can be in .xls or .xlsx format, encrypted/not encrypted, with linked workbooks, be filtered based on metadata, with formulas, comments etc.
- mapred.* and mapreduce.* API supported
- A Spark2 datasource for reading/writing Excel files enabling comfortable integration of the HadoopOffice library into Spark2. It is available on Spark-packages.
Of course, further releases are planned:
- Support for signing and verification of signature of Excel documents
- Going beyond Excel with further office formats, such as ODF Calc
- A Hive Serde for querying and writing Excel documents directly in Hive
- Further examples including one for Apache Flink