package dataobject
- Alphabetic
- Public
- Protected
Type Members
- class DeltaLakeModulePlugin extends ModulePlugin
- case class DeltaLakeTableDataObject(id: DataObjectId, path: Option[String] = None, partitions: Seq[String] = Seq(), options: Map[String, String] = Map(), schemaMin: Option[GenericSchema] = None, table: Table, constraints: Seq[Constraint] = Seq(), expectations: Seq[Expectation] = Seq(), preReadSql: Option[String] = None, postReadSql: Option[String] = None, preWriteSql: Option[String] = None, postWriteSql: Option[String] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, allowSchemaEvolution: Boolean = false, retentionPeriod: Option[Int] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TransactionalTableDataObject with CanMergeDataFrame with CanEvolveSchema with CanHandlePartitions with HasHadoopStandardFilestore with ExpectationValidation with CanCreateIncrementalOutput with Product with Serializable
DataObject of type DeltaLakeTableDataObject.
DataObject of type DeltaLakeTableDataObject. Provides details to access Tables in delta format to an Action.
Delta format maintains a transaction log in a separate _delta_log subfolder. The schema is registered in Metastore by DeltaLakeTableDataObject.
The following anomalies might occur: - table is registered in metastore but path does not exist -> table is dropped from metastore - table is registered in metastore but path is empty -> error is thrown. Delete the path to clean up - table is registered and path contains parquet files, but _delta_log subfolder is missing -> path is converted to delta format - table is not registered but path contains parquet files and _delta_log subfolder -> Table is registered - table is not registered but path contains parquet files without _delta_log subfolder -> path is converted to delta format and table is registered - table is not registered and path does not exists -> table is created on write
* DeltaLakeTableDataObject implements - CanMergeDataFrame by using DeltaTable.merge API. - CanEvolveSchema by using mergeSchema option. - Overwriting partitions is implemented by replaceWhere option in one transaction.
- id
unique name of this data object
- path
Optional hadoop directory for this table. If path is not defined, table is handled as a managed table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
- partitions
partition columns for this data object
- options
Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions
- schemaMin
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing. Define schema by using a DDL-formatted string, which is a comma separated list of field definitions, e.g., a INT, b STRING.
- table
DeltaLake table to be written by this output
- constraints
List of row-level Constraints to enforce when writing to this data object.
- expectations
List of Expectations to enforce when writing to this data object. Expectations are checks based on aggregates over all rows of a dataset.
- preReadSql
SQL-statement to be executed in exec phase before reading input table. If the catalog and/or schema are not explicitly defined, the ones present in the configured "table" object are used.
- postReadSql
SQL-statement to be executed in exec phase after reading input table and before action is finished. If the catalog and/or schema are not explicitly defined, the ones present in the configured "table" object are used.
- preWriteSql
SQL-statement to be executed in exec phase before writing output table. If the catalog and/or schema are not explicitly defined, the ones present in the configured "table" object are used.
- postWriteSql
SQL-statement to be executed in exec phase after writing output table. If the catalog and/or schema are not explicitly defined, the ones present in the configured "table" object are used.
- saveMode
SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.
- allowSchemaEvolution
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
- retentionPeriod
Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.
- acl
override connection permissions for files created tables hadoop directory with this connection
- connectionId
optional id of io.smartdatalake.workflow.connection.HiveTableConnection
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- metadata
meta data
- Annotations
- @Scaladoc()
Value Members
- object DeltaLakeTableDataObject extends FromConfigFactory[DataObject] with Serializable