Apache CarbonData 2.0 Preview（关键特性提前预览）

来源: | 作者:华为云折扣网 | 发布时间: 2020-12-20 | 4248 次浏览 | 分享到:

CarbonData是一种高性能大数据存储方案，已在100+企业生产环境上部署应用，其中最大的单一集群数据规模达到几万亿。

Use Case: When unsorted data is stored in carbon the pruning process tends to give false positives when comparing min/max. For example a blocklet might have 3,5,8 integer values in it which means the min=3 and max=8. If the user has a filter expression with the value 4 then the pruning process will give a false report that this blocklet will have data and the reading flow should decompress this page and read the contents. This would lead to unnecessary IO finally resulting in a performance degrade.

To improve the query performance, the Secondary Index has been designed on the existing min/max architecture which is basically a reverse index of the unique data to the blocklets it is present in. This will give the exact location of the data so that false positive scenarios during pruning are minimized.

Example:

spark.sql("""CREATE TABLE sales (order_time timestamp, user_id string, xingbie string, country string, quantity int, price bigint) STORED AS carbondata """) spark.sql("""CREATE INDEX index_sales ON TABLE sales(user_id) AS 'carbondata' """)

Get more about usage:

https://github.com/apache/carbondata/blob/master/index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestCreateIndexTable.scala

6. Support CDC merge functionality

In the current data warehouse world slowly changing dimensions (SCD) and change data capture(CDC) are very common scenarios. Legacy systems like RDBMS can handle these scenarios very well because of the support of transactions.

To keep up with the existing database technologies, CarbonData now supports CDC and SCD functionalities.

Example:

initframe.write .format("carbondata").option("tableName", "order").mode(SaveMode.Overwrite).save() val dwframe = sqlContext.read.format("carbondata").option("tableName", "order").load() val dwSelframe = dwframe.as("A") val updateMap = Map("id" -> "A.id", "name" -> "B.name", "c_name" -> "B.c_name", "quantity" -> "B.quantity"

« 上一页 1 234 5…7 下一页 » 查看全文 »

上一篇：基于Docker和D......

下一篇：全球最大CDN服务商......

客服微信

备案号：浙ICP备19010705号-2