HadoopOffice is already since more than a year available (first commit: 16.10.2016). Currently it supports Excel formats based on the Apache POI parsers/writers. Meanwhile a lot of functionality has been added, such as:
- Support for .xlsx and .xls formats – reading and writing
- Encryption/Decryption Support
- Support for Hadoop mapred.* and mapreduce.* APIs
- Support for Spark 1.x (via mapreduce.*) and Spark 2.x (via data source APIs)
- Low footprint mode to use less CPU and memory resources to parse and write Excel documents
- Template support – add complex diagrams and other functionality in your Excel documents without coding
Within 2018 and the coming years we want to go beyond this functionality:
- Add further security functionality: Signing and verification of signatures of new Excel files (in XML format via XML signature) / Store credentials for encryption, decryption, signing in keystores
- Apache Hive Support
- Apache Flink Support
- Add support for reading/writing Access based on the Jackcess library including encryption/decryption support
- Add support for dbase formats
- Develop a new spreadsheet format suitable for the Big Data world: There is currently a significant gap in the Big Data world. There are formats optimized for data exchange, such as Apache Avro, and for large scale analytics queries, such as Apache ORC or Apache Parquet. These formats have been proven as very suitable in the Big Data world. However, they only store data, but not formulas. This means every time simple data calculation need to be done they have to be done in dedicated ETL/batch processes varying on each cluster or software instance. This makes it very limiting to exchange data, to determine how data was calculated, compare calculations or flexible recalculate data – one of the key advantages of Spreadsheet formats, such as Excel. However, Excel is not designed for Big Data processing. Hence, the goal is to find a SpreadSheet format suitable for Big Data processing and as flexible as Excel/LibreOffice Calc. Finally, a streaming SpreadSheet format should be supported.
HadoopOffice aims at supporting legacy office formats (Excel, Access etc.) in a secure manner on Big Data platforms but also paving the way for a new spreadsheet format suitable for the Big Data world.
This blog post is about analyzing the Namecoin Blockchain using different Big Data technologies based on the HadoopCryptoLedger library. Currently, this library enables you to analyze the Bitcoin blockchain and Altcoins based on Bitcoin (incl. segregated witness), such as Namecoin, Litecoin, Zcash etc., on Big Data platforms, such as Hadoop, Hive, Flink and Spark. A support for Ethereum is planned for the near future.
However, back to Namecoin and why it is interesting:
- It was one of the first Altcoin based on Bitcoin
- It supports a decentralized domain name and identity system based on blockchain technology – no central actor can censor it
- It is one of the first systems that supports merged mining/AuxPOW. Merged mining enables normal Bitcoin mining pools to mine Altcoins (such as Namecoin) without any additional effort while mining normal Bitcoins. Hence, they can make more revenue and at the same time they support Altcoins, which would be without Bitcoin mining pool support much weaker or not existing. It can be expected that many Altcoins based on Bitcoin will switch to it eventually.
HadoopCryptoLedger supported already from the beginning Altcoins based on Bitcoin. Usually it is expected that Big Data applications based on the HadoopCryptoLedger library implement the analytics they are interested in. However, sometimes we add specific functionality to make it much easier, for instance we provide the Hive UDFs to make certain analysis easier in Hive or we provide certain utility functions for MapReduce, Flink and Spark to make some things which require more detailed know how more easily available to beginners in block chain technology.
We have done this also for Namecoin by providing functionality to:
- determine the type of name operation (new, firstupdate, update)
- get more information about available domains, subdomains, ip addresses, identities
With this analysis is rather easy, because Namecoin requires you to update the information at least every 35,999 blocks (roughly between 200 to 250 days) or information are considered as expired. You just have to take simply the most recent information for a given domain or identity – it contains a full update of all information and there is no need to merge it with information from previous transactions. However, sometimes additional information are provided in so-called delegate entries and in this case you need to combine information.
Finally, we provided additional support to read blockchains based on Bitcoin and the merged mining/AuxPOW specification. You can enable it with a single configuration option.
Of course, we provide examples, e.g. for Spark and Hive. Nevertheless all major Big Data platforms (including Flink) are supported.
Find here what we plan next – amongst others:
- Support analytics on Ethereum
- Run analytics based on HadoopCryptoLedger in the cloud (e.g. Amazon EMR and/or Microsoft Azure) and provide real time aggregations as a simple HTML page
- … much more
Let us know via Github issues what you are interested in!
The hadoopcryptoledger library has been enhanced with a datasource for Apache Flink. This means you can use the Big Data processing framework Apache Flink to analyze the Bitcoin Blockchain.
It also includes an example that counts the total number of transactions in the Bitcoin blockchain. Of course given the power of Apache Flink you can think about more complex analysis applications, such as:
- Graph analysis on the Bitcoin transaction graph, e.g. to identify clusters or connected components to find out close interactions between Bitcoin addresses
- Trace money flows through the Bitcoin network
- Predict power of mining pools, difficulty of block processing, impact of changes on the Bitcoin protocol or rules
- Join it with other data to make predictions on prices, criminal activity and economics
In the future, we want to work on the following things :
- Support for other cryptoledgers, e.g. Ethereum
- Provide examples for analyzing other currencies based on the Bitcoin Blockchain, such as Litecoin and Namecoin
- A Flume data source to stream Bitcoin Blockchain data directly into your cluster
- Support selected blockchains provided via the Hyperledger Framework
This is an update of the second big data lab for the cloud. Similar to previous versions, this document described how you can create a Big Data Lab in the cloud on Amazon EMR.
Besides some major upgrades to the newest Amazon Hadoop AMI (3.6.0) Spark (1.3.0) and R, it includes now also the possibility to use Python in the browser. There, you have the same functionality as in R. This means you can use Hadoop M/R, Spark and SparkSQL in Python from the browser. Similar to R, Python has gained attention by data scientists.
You can find the newest version here.
In future blog posts, I will write how you can use some of the open datasets on the Internet in the Big Data lab.
Recently, I published on github.com several example Java projects for using various NoSQL technologies:
- cassandra-tutorial : Apache Cassandra tutorial (Column-oriented database)
- mongodb-tutorial : Mongo DB tutorial (Document database)
- neo4j-tutorial : Neo4J (Graph Database)
- redis-tutorial : Redis (Key/Value Store)
- solr-tutorial : Apache SolrCloud (Search technology)
Other example Java projects aim at standardized big data processing platforms:
- MapReduce: A simple hadoop map reduce job to count the number of tweets in a text file on HDFS
- SparkStreaming: A simple spark streaming job to count the number of tweets send by a simple network server
- tweet-server : A simple server for sending tweets to any client connecting to port 1234 on localhost. It is a shellscript using netcat.
You can use them in lectures or courses for teaching these topics. The projects use gradle, so you can make sure that the students always use the right libraries and thus avoiding conflicts with dependent libraries. Each project is a basic skeleton for using one software. Students can easily extend them and you can avoid a long bootstrapping phase until everybody has created the right project with right libraries for your environment. Unit testing can easily be added.