Backing up Elasticsearch with snapshots

Taking backups of the data is a challenge for every big data storage. Unlike traditional databases their data volume is so large that it cannot fit a single storage system and you have to create another cluster just to hold the backup data. An acceptable solution can be backing up to cloud storage or to… Read More »

Dealing with HDFS small files problem using Hadoop archives

Hadoop was built to handle very large files. It’s default block size is 128Mb and it’s all about throughput. It has hard time handling many small files. The memory footprint of the namenodes becomes high as they have to keep track of many small blocks and the performance of scans goes down. The best way… Read More »

Setting default resource pool for JDBC connections

This is a quick tip about connecting to Hive or Impala via JDBC. Accessing hive or impala using their JDBC driver is very convenient. Client programs s like beeline or Jetbrains DataGrip use it as the main way of accessing Hive/Impala and many people also use it in their own written programs. Things get a… Read More »

Apache Livy – a REST gateway for Spark

Apache Livy is an open source server that exposes Spark as a service. Its backend connects to a Spark cluster while the frontend enables REST API. This enables running it as the organization’s Spark gateway and even run in in docker containers. Not only it enables running Spark jobs from anywhere, but it also enables… Read More »