Reading/Writing Excel documents with the HadoopOffice library on Hadoop and Spark – First release

Reading/Writing office documents, such as Excel, has been always challenging on Big data platforms. Although many libraries exist for reading/writing office documents, they have never been really integrated in Hadoop or Spark and thus lead to a lot of development efforts.

There are several use cases for using office documents jointly with Big data technologies:

  • Enabling the full customer-centric data science lifecycle: Within your Big Data platform you crunch numbers for complex models. However, you have to make them accessible to your customers. Le us assume you work in the insurance industry. Your Big Data platform calculates various models focused on your customer for insurance products. Your sales staff receives the models in Excel format. They can now play together with the customers on the different parameters, e.g. retirement age, individual risks etc. They may also come up with a different proposal more suitable for your customer and you want to feed it back into your Big Data platform to see if it is feasible.
  • You still have a lot of data in Excel files related to your computation. Let it be code lists, data collected manually or your existing systems simply support this format.

Hence, the HadoopOffice library was created and the first version has just been released!

It features:

Of course, further releases are planned:

  • Support for signing and verification of signature of Excel documents
  • Going beyond Excel with further office formats, such as ODF Calc
  • A Hive Serde for querying and writing Excel documents directly in Hive
  • Further examples including one for Apache Flink

Sneak Preview – HadoopOffice: Processing Office documents using the Hadoop Ecosystem – The example of Excel files

I present in this blog post the sneak preview of the hadoopoffice library that will enable you to process Office files, such as MS Excel, using the Hadoop Ecosystem including Hive/Spark.
It currently contains only an ExcelInputFormat, which is based on Apache POI.

Additionally, it contains an example that demonstrates how an Excel input file on HDFS can be converted into a simple CSV file on HDFS.

Finally, you may want to look at this wiki page that explains how you can improve the performance for processing a lot of small files, such as Office documents, on Hadoop.

Of course this is only the beginning. The following things are planned for the near future:

  • Support of other office formats as input: ODF Spreadsheets, ODF Database, MS Access, Dbase, MS Word….
  • Support of other office formats as output
  • A HiveSerde to query office documents in Hive using SQL
  • An official release on Maven Central
  • An example for Apache Spark