Today we have released HadoopOffice v1.1.0 with major enhancements:
Of course the usual features of the HadoopOffice library are still supported, such as Spark2 datasource/datasink support, encryption, linked workbooks, templates, low footprint mode etc.
This activity is part on the delivery of the overall HadoopOffice vision.
HadoopOffice is already since more than a year available (first commit: 16.10.2016). Currently it supports Excel formats based on the Apache POI parsers/writers. Meanwhile a lot of functionality has been added, such as:
- Support for .xlsx and .xls formats – reading and writing
- Encryption/Decryption Support
- Support for Hadoop mapred.* and mapreduce.* APIs
- Support for Spark 1.x (via mapreduce.*) and Spark 2.x (via data source APIs)
- Low footprint mode to use less CPU and memory resources to parse and write Excel documents
- Template support – add complex diagrams and other functionality in your Excel documents without coding
Within 2018 and the coming years we want to go beyond this functionality:
- Add further security functionality: Signing and verification of signatures of new Excel files (in XML format via XML signature) / Store credentials for encryption, decryption, signing in keystores
- Apache Hive Support
- Apache Flink Support
- Add support for reading/writing Access based on the Jackcess library including encryption/decryption support
- Add support for dbase formats
- Develop a new spreadsheet format suitable for the Big Data world: There is currently a significant gap in the Big Data world. There are formats optimized for data exchange, such as Apache Avro, and for large scale analytics queries, such as Apache ORC or Apache Parquet. These formats have been proven as very suitable in the Big Data world. However, they only store data, but not formulas. This means every time simple data calculation need to be done they have to be done in dedicated ETL/batch processes varying on each cluster or software instance. This makes it very limiting to exchange data, to determine how data was calculated, compare calculations or flexible recalculate data – one of the key advantages of Spreadsheet formats, such as Excel. However, Excel is not designed for Big Data processing. Hence, the goal is to find a SpreadSheet format suitable for Big Data processing and as flexible as Excel/LibreOffice Calc. Finally, a streaming SpreadSheet format should be supported.
HadoopOffice aims at supporting legacy office formats (Excel, Access etc.) in a secure manner on Big Data platforms but also paving the way for a new spreadsheet format suitable for the Big Data world.
HadoopCryptoLedger release 1.1.0 added support for another well-known cryptocurrency: Ethereum and its Altcoins. Of course similar to its Bitcoin & Altcoin support you can use the library with many different frameworks in the Hadoop ecosystem:
Furthermore, you can use it with various other frameworks due to its implementation as a FileInputFormat for Big Data platforms.
Analysing blockchains, such as Ethereum or Bitcoin, using Big Data platforms brings you several advantages. You can analyse its liquidity, how many people are using it or if it is a valid blockchain at all. Furthermore, you can cross-integrate any other data source, such as currency rates, news or events from Internet of Things platforms (e.g. payments using hardware wallets). Governments can integrate it with various other sources to investigate criminal activity.
All and all it enables transparency for everyone.
This blog post is about analyzing the Namecoin Blockchain using different Big Data technologies based on the HadoopCryptoLedger library. Currently, this library enables you to analyze the Bitcoin blockchain and Altcoins based on Bitcoin (incl. segregated witness), such as Namecoin, Litecoin, Zcash etc., on Big Data platforms, such as Hadoop, Hive, Flink and Spark. A support for Ethereum is planned for the near future.
However, back to Namecoin and why it is interesting:
- It was one of the first Altcoin based on Bitcoin
- It supports a decentralized domain name and identity system based on blockchain technology – no central actor can censor it
- It is one of the first systems that supports merged mining/AuxPOW. Merged mining enables normal Bitcoin mining pools to mine Altcoins (such as Namecoin) without any additional effort while mining normal Bitcoins. Hence, they can make more revenue and at the same time they support Altcoins, which would be without Bitcoin mining pool support much weaker or not existing. It can be expected that many Altcoins based on Bitcoin will switch to it eventually.
HadoopCryptoLedger supported already from the beginning Altcoins based on Bitcoin. Usually it is expected that Big Data applications based on the HadoopCryptoLedger library implement the analytics they are interested in. However, sometimes we add specific functionality to make it much easier, for instance we provide the Hive UDFs to make certain analysis easier in Hive or we provide certain utility functions for MapReduce, Flink and Spark to make some things which require more detailed know how more easily available to beginners in block chain technology.
We have done this also for Namecoin by providing functionality to:
- determine the type of name operation (new, firstupdate, update)
- get more information about available domains, subdomains, ip addresses, identities
With this analysis is rather easy, because Namecoin requires you to update the information at least every 35,999 blocks (roughly between 200 to 250 days) or information are considered as expired. You just have to take simply the most recent information for a given domain or identity – it contains a full update of all information and there is no need to merge it with information from previous transactions. However, sometimes additional information are provided in so-called delegate entries and in this case you need to combine information.
Finally, we provided additional support to read blockchains based on Bitcoin and the merged mining/AuxPOW specification. You can enable it with a single configuration option.
Of course, we provide examples, e.g. for Spark and Hive. Nevertheless all major Big Data platforms (including Flink) are supported.
Find here what we plan next – amongst others:
- Support analytics on Ethereum
- Run analytics based on HadoopCryptoLedger in the cloud (e.g. Amazon EMR and/or Microsoft Azure) and provide real time aggregations as a simple HTML page
- … much more
Let us know via Github issues what you are interested in!
Although it seems to be that it was only a small improvement, version 1.0.4 of the HadoopOffice library has a lot of new features for reading/writing Excel files:
- Templates, so you can define complex documents with diagrams or other features in MSExcel and fill it with data or formulas from your Big Data platform in Hadoop, Spark & Co
- Low footprint mode – this mode leverages the Apache POI event and streaming APIs. It saves CPU and memory consumption significantly at the expense of certain features (e.g. evaluation of formulas which is only supported in standard mode). This mode supports reading old MS Excel (.xls)/new MS Excel (.xlsx) and writing new MS Excel (.xlsx) documents
- New features in the Spark 2 datasource:
- Inferring of the DataFrame schema consisting of simple Spark SQL DataTypes (Boolean, Date, Byte, Short, Integer, Long, Decimal, String) based on the data in the Excel file
- Improved writing of a DataFrame based on a schema with simpel Spark SQL DataTypes
- Interpreting the first row of an Excel file as column names for the DataFrame for reading (“header”)
- Writing column names of a DataFrame as the first row of an Excel file (“header”)
- Support for Spark 2.0.1, 2.1, 2.2
Of course still other features are still usable, such as metadata reading/writing, encryption/decryption or linked workbooks, support for Hadoop MapReduce, support for Spark2 datasources and support for Spark 1.
What is next?
- Support for Apache Flink for reading/writing Excel files
- Support for Apache Hive (Hive SerDe) for reading/writing Excel files
- Support for digitally signing/verifying signature(s) of Excel files
- Support for reading access files
- … many more
Reading/Writing office documents, such as Excel, has been always challenging on Big data platforms. Although many libraries exist for reading/writing office documents, they have never been really integrated in Hadoop or Spark and thus lead to a lot of development efforts.
There are several use cases for using office documents jointly with Big data technologies:
- Enabling the full customer-centric data science lifecycle: Within your Big Data platform you crunch numbers for complex models. However, you have to make them accessible to your customers. Le us assume you work in the insurance industry. Your Big Data platform calculates various models focused on your customer for insurance products. Your sales staff receives the models in Excel format. They can now play together with the customers on the different parameters, e.g. retirement age, individual risks etc. They may also come up with a different proposal more suitable for your customer and you want to feed it back into your Big Data platform to see if it is feasible.
- You still have a lot of data in Excel files related to your computation. Let it be code lists, data collected manually or your existing systems simply support this format.
Hence, the HadoopOffice library was created and the first version has just been released!
- A Hadoop FileFormat for reading/writing Excel files using the Apache POI library, so that nearly all Hadoop ecosystem components can read/write them
- Excel files can be in .xls or .xlsx format, encrypted/not encrypted, with linked workbooks, be filtered based on metadata, with formulas, comments etc.
- mapred.* and mapreduce.* API supported
- A Spark2 datasource for reading/writing Excel files enabling comfortable integration of the HadoopOffice library into Spark2. It is available on Spark-packages.
Of course, further releases are planned:
- Support for signing and verification of signature of Excel documents
- Going beyond Excel with further office formats, such as ODF Calc
- A Hive Serde for querying and writing Excel documents directly in Hive
- Further examples including one for Apache Flink
I present in this blog post the sneak preview of the hadoopoffice library that will enable you to process Office files, such as MS Excel, using the Hadoop Ecosystem including Hive/Spark.
It currently contains only an ExcelInputFormat, which is based on Apache POI.
Additionally, it contains an example that demonstrates how an Excel input file on HDFS can be converted into a simple CSV file on HDFS.
Finally, you may want to look at this wiki page that explains how you can improve the performance for processing a lot of small files, such as Office documents, on Hadoop.
Of course this is only the beginning. The following things are planned for the near future:
- Support of other office formats as input: ODF Spreadsheets, ODF Database, MS Access, Dbase, MS Word….
- Support of other office formats as output
- A HiveSerde to query office documents in Hive using SQL
- An official release on Maven Central
- An example for Apache Spark