Usage: By default prestosql profile is selected for maven build. User can use prestodb profile to build CarbonData for prestodb.
12) Insert into performance improvement
Use Case: Currently insert and load command have a common code flow, which includes many overheads to insert command because features like BadRecords are not required by the insert command.
Now load and insert flow have been separated and some additional optimizations are implemented to insert command such as,
1. Rearrange projections instead of rows.
2. Use internal row instead row object from spark.
It is observed that these optimization resulted in 40% insert performance improvement for TPCH data.
13) Optimize Bucket Table
Use Case: Bucketing feature is used to distribute/organize the table/partition data into multiple files such that similar records are present in the same file. Join operation on datasets will cause a large volume of data shuffling making it quite slow, which can be avoided on bucket columns. Bucket tables have been made consistent with spark to improve the join performance by avoiding shuffle for the bucket column.
Example:
spark.sql("""CREATE TABLE bucket_table1 (stringField string, intField int, shortField short) STORED AS carbondata TBLPROPERTIES ('BUCKET_NUMBER'='10', 'BUCKET_COLUMNS'='stringField') """) spark.sql(""" CREATE TABLE bucket_table2 (stringField string, intField int, shortField short) STORED AS carbondata TBLPROPERTIES ('BUCKET_NUMBER'='10', 'BUCKET_COLUMNS'='stringField') """)
Get more about usage: https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/spark/carbondata/bucketing/TableBucketingTestCase.scala
14) pycarbon support
Use Case: CarbonData now provides python API(PyCarbon) support for integrating it with AI frameworks like TensorFlow, PyTorch, MXNet. By using PyCarbon, AI framework should be able to read training data faster by leveraging CarbonData's indexing and caching ability. Since CarbonData is a columnar storage, AI developers should also be able to perform projection and filtering to pick required data for training efficiently.