Orc snappy compression

8/26/2023

In other instances you may need to give things a kick by sorting data. For instance if your data comes as a time series with a monotonically increasing timestamp, when you put a where condition on this timestamp, ORC will be able to skip a lot of row groups. Some times your dataset will naturally facilitate this. ORC’s Predicate Pushdown will consult the Inline Indexes to try to identify when entire blocks can be skipped all at once. A Word on ORCFile Inline Indexesīefore we move to the next section we need to spend a moment talking about how ORCFile breaks rows into row groups and applies columnar compression and indexing within these row groups.

The more columns you read from the table, the more data marshaling you avoid and the greater the speedup. The ORCFile reader will now only return rows that actually match the WHERE predicates and skip customers residing in any other state. SELECT COUNT(*) FROM CUSTOMER WHERE CUSTOMER.state = ‘CA’ Fortunately ORC has had the corresponding improvements to allow predicates to be pushed into it, and takes advantages of its inline indexes to deliver performance benefits.įor example if you have a SQL query like: This requires a reader that is smart enough to understand the predicates. There’s a lot of wasteful overhead and Hive 12 optimizes this by allowing predicates to be pushed down and evaluated in the storage layer itself. In older versions of Hive, rows are read out of the storage layer before being later eliminated by SQL processing. SQL queries will generally have some number of WHERE conditions which can be used to easily eliminate rows from consideration. Hive 12 builds on these impressive compression ratios and delivers deep integration at the Hive and execution layers to accelerate queries, both from the point of view of dealing with larger datasets and lower latencies. We’ve already seen customers whose clusters are maxed out from a storage perspective moving to ORCFile as a way to free up space while being 100% compatible with existing jobs.ĭata stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly. This dataset contains randomly generated data including strings, floating point and integer data. This picture shows the sizes of the TPC-DS dataset at Scale 500 in various encodings. This focus on efficiency leads to some impressive compression ratios.

ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding. The upcoming Hive 0.12 is set to bring some great new advancements in the storage layer in the forms of higher compression and better query performance.

0 Comments

Orc snappy compression

Leave a Reply.

Author

Archives

Categories