Use Case: When unsorted data is stored in carbon the pruning process tends to give false positives when comparing min/max. For example a blocklet might have 3,5,8 integer values in it which means the min=3 and max=8. If the user has a filter expression with the value 4 then the pruning process will give a false report that this blocklet will have data and the reading flow should decompress this page and read the contents. This would lead to unnecessary IO finally resulting in a performance degrade.
To improve the query performance, the Secondary Index has been designed on the existing min/max architecture which is basically a reverse index of the unique data to the blocklets it is present in. This will give the exact location of the data so that false positive scenarios during pruning are minimized.
Example:
spark.sql("""CREATE TABLE sales (order_time timestamp, user_id string, xingbie string, country string, quantity int, price bigint) STORED AS carbondata """) spark.sql("""CREATE INDEX index_sales ON TABLE sales(user_id) AS 'carbondata' """)
Get more about usage:
https://github.com/apache/carbondata/blob/master/index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestCreateIndexTable.scala
6. Support CDC merge functionality
In the current data warehouse world slowly changing dimensions (SCD) and change data capture(CDC) are very common scenarios. Legacy systems like RDBMS can handle these scenarios very well because of the support of transactions.
To keep up with the existing database technologies, CarbonData now supports CDC and SCD functionalities.
Example:
initframe.write .format("carbondata").option("tableName", "order").mode(SaveMode.Overwrite).save() val dwframe = sqlContext.read.format("carbondata").option("tableName", "order").load() val dwSelframe = dwframe.as("A") val updateMap = Map("id" -> "A.id", "name" -> "B.name", "c_name" -> "B.c_name", "quantity" -> "B.quantity"