Category Archives: Hadoop

Visualizing Kafka with Cloudera messaging manager (SMM)

Streams messaging manager (SMM) is not new to HortonWorks users, but since I was mostly using Cloudera I never had the opportunity to use it. Following the merger of Cloudera and Hortonworks in early 2019, many good products that were originally part of HDP finally made their way into the Cloudera platform including SMM. Cloudera’s… Read More »

Dealing with HDFS small files problem using Hadoop archives

Hadoop was built to handle very large files. It’s default block size is 128Mb and it’s all about throughput. It has hard time handling many small files. The memory footprint of the namenodes becomes high as they have to keep track of many small blocks and the performance of scans goes down. The best way… Read More »

Setting default resource pool for JDBC connections

This is a quick tip about connecting to Hive or Impala via JDBC. Accessing hive or impala using their JDBC driver is very convenient. Client programs s like beeline or Jetbrains DataGrip use it as the main way of accessing Hive/Impala and many people also use it in their own written programs. Things get a… Read More »

Apache Livy – a REST gateway for Spark

Apache Livy is an open source server that exposes Spark as a service. Its backend connects to a Spark cluster while the frontend enables REST API. This enables running it as the organization’s Spark gateway and even run in in docker containers. Not only it enables running Spark jobs from anywhere, but it also enables… Read More »