Category Archives: Hive

Dealing with HDFS small files problem using Hadoop archives

Hadoop was built to handle very large files. It’s default block size is 128Mb and it’s all about throughput. It has hard time handling many small files. The memory footprint of the namenodes becomes high as they have to keep track of many small blocks and the performance of scans goes down. The best way… Read More »

Setting default resource pool for JDBC connections

This is a quick tip about connecting to Hive or Impala via JDBC. Accessing hive or impala using their JDBC driver is very convenient. Client programs s like beeline or Jetbrains DataGrip use it as the main way of accessing Hive/Impala and many people also use it in their own written programs. Things get a… Read More »

Setting up Oracle external Metastore for Hive

When installing CDH, it asks you for a database where it can build Hive Metastore. You can choose your own external database (MySQL, Oracle or PostgreSQL), or an embedded PostgreSQL database that Cloudera installer installs for you. Many people choose the embedded database because it’s the quickest and easiest choice. However, this database is less… Read More »

Data replication across Hadoop clusters using Cloudera manager – part II

In the last post I showed how to replicate HDFS files and directories. Now we will continue where we left off and show a Hive replication. Hive and HDFS replication setup has many similarities and some of the steps are identical in both processes. Official documentation Cloudera recommends enabling snapshots on the directory where Hive… Read More »