HadoopOffice – A Vision for the coming Years

HadoopOffice is already since more than a year available (first commit: 16.10.2016). Currently it supports Excel formats based on the Apache POI parsers/writers. Meanwhile a lot of functionality has been added, such as:

  • Support for .xlsx and .xls formats – reading and writing
  • Encryption/Decryption Support
  • Support for Hadoop mapred.* and mapreduce.* APIs
  • Support for Spark 1.x (via mapreduce.*) and Spark 2.x (via data source APIs)
  • Low footprint mode to use less CPU and memory resources to parse and write Excel documents
  • Template support – add complex diagrams and other functionality in your Excel documents without coding

Within 2018 and the coming years we want to go beyond this functionality:

  • Add further security functionality: Signing and verification of signatures of new Excel files (in XML format via XML signature) / Store credentials for encryption, decryption, signing in keystores
  • Apache Hive Support
  • Apache Flink Support
  • Add support for reading/writing Access based on the Jackcess library including encryption/decryption support
  • Add support for dbase formats
  • Develop a new spreadsheet format suitable for the Big Data world: There is currently a significant gap in the Big Data world. There are formats optimized for data exchange, such as Apache Avro, and for large scale analytics queries, such as Apache ORC or Apache Parquet. These formats have been proven as very suitable in the Big Data world. However, they only store data, but not formulas. This means every time simple data calculation need to be done they have to be done in dedicated ETL/batch processes varying on each cluster or software instance. This makes it very limiting to exchange data, to determine how data was calculated, compare calculations or flexible recalculate data – one of the key advantages of Spreadsheet formats, such as Excel. However, Excel is not designed for Big Data processing. Hence, the goal is to find a SpreadSheet format suitable for Big Data processing and as flexible as Excel/LibreOffice Calc. Finally,  a streaming SpreadSheet format should be supported.


HadoopOffice aims at supporting legacy office formats (Excel, Access etc.) in a secure manner on Big Data platforms but also paving the way for a new spreadsheet format suitable for the Big Data world.


One thought on “HadoopOffice – A Vision for the coming Years

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s