资讯中心
关于我们
欢迎光临格子云商城!
GE ZI CLOUD
数字化应用聚合平台
格子云
按钮文本
热门搜索:惠普  复印纸  中性笔
全部商品分类
技术社区

Apache CarbonData 2.0 Preview(关键特性提前预览)

来源: | 作者:华为云折扣网 | 发布时间: 2020-12-20 | 4254 次浏览 | 分享到:
CarbonData是一种高性能大数据存储方案,已在100+企业生产环境上部署应用,其中最大的单一集群数据规模达到几万亿。

UsageBy default prestosql profile is selected for maven build. User can use prestodb profile to build CarbonData for prestodb.               

12) Insert into performance improvement

Use Case: Currently insert and load command have a common code flow, which includes many overheads to insert command because features like BadRecords are not required by the insert command.

Now load and insert flow have been separated and some additional optimizations are implemented to insert command such as,

    1.  Rearrange projections instead of rows.

    2. Use internal row instead row object from spark.

It is observed that these optimization resulted in 40% insert performance improvement for TPCH data.

13) Optimize Bucket Table

Use Case: Bucketing feature is used to distribute/organize the table/partition data into multiple files such that similar records are present in the same file. Join operation on datasets will cause a large volume of data shuffling making it quite slow, which can be avoided on bucket columns. Bucket tables have been made consistent with spark to improve the join performance by avoiding shuffle for the bucket column.

Example:

spark.sql("""CREATE TABLE bucket_table1 (stringField string, intField int, shortField short) STORED AS carbondata TBLPROPERTIES ('BUCKET_NUMBER'='10', 'BUCKET_COLUMNS'='stringField') """) spark.sql(""" CREATE TABLE bucket_table2 (stringField string, intField int, shortField short) STORED AS carbondata TBLPROPERTIES ('BUCKET_NUMBER'='10', 'BUCKET_COLUMNS'='stringField') """)

Get more about usage: https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/spark/carbondata/bucketing/TableBucketingTestCase.scala

14) pycarbon support

Use Case: CarbonData now provides python API(PyCarbon) support for integrating it with AI frameworks like TensorFlow, PyTorch, MXNet. By using PyCarbon, AI framework should be able to read training data faster by leveraging CarbonData's indexing and caching ability. Since CarbonData is a columnar storage, AI developers should also be able to perform projection and filtering to pick required data for training efficiently.