Packages

package dataobject

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. Protected

Type Members

  1. class DeltaLakeModulePlugin extends ModulePlugin
  2. case class DeltaLakeTableDataObject(id: DataObjectId, path: Option[String] = None, partitions: Seq[String] = Seq(), options: Map[String, String] = Map(), schemaMin: Option[GenericSchema] = None, table: Table, constraints: Seq[Constraint] = Seq(), expectations: Seq[Expectation] = Seq(), preReadSql: Option[String] = None, postReadSql: Option[String] = None, preWriteSql: Option[String] = None, postWriteSql: Option[String] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, allowSchemaEvolution: Boolean = false, retentionPeriod: Option[Int] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TransactionalTableDataObject with CanMergeDataFrame with CanEvolveSchema with CanHandlePartitions with HasHadoopStandardFilestore with ExpectationValidation with CanCreateIncrementalOutput with Product with Serializable

    DataObject of type DeltaLakeTableDataObject.

    DataObject of type DeltaLakeTableDataObject. Provides details to access Tables in delta format to an Action.

    Delta format maintains a transaction log in a separate _delta_log subfolder. The schema is registered in Metastore by DeltaLakeTableDataObject.

    The following anomalies might occur: - table is registered in metastore but path does not exist -> table is dropped from metastore - table is registered in metastore but path is empty -> error is thrown. Delete the path to clean up - table is registered and path contains parquet files, but _delta_log subfolder is missing -> path is converted to delta format - table is not registered but path contains parquet files and _delta_log subfolder -> Table is registered - table is not registered but path contains parquet files without _delta_log subfolder -> path is converted to delta format and table is registered - table is not registered and path does not exists -> table is created on write

    * DeltaLakeTableDataObject implements - CanMergeDataFrame by using DeltaTable.merge API. - CanEvolveSchema by using mergeSchema option. - Overwriting partitions is implemented by replaceWhere option in one transaction.

    id

    unique name of this data object

    path

    Optional hadoop directory for this table. If path is not defined, table is handled as a managed table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.

    partitions

    partition columns for this data object

    options

    Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions

    schemaMin

    An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing. Define schema by using a DDL-formatted string, which is a comma separated list of field definitions, e.g., a INT, b STRING.

    table

    DeltaLake table to be written by this output

    constraints

    List of row-level Constraints to enforce when writing to this data object.

    expectations

    List of Expectations to enforce when writing to this data object. Expectations are checks based on aggregates over all rows of a dataset.

    preReadSql

    SQL-statement to be executed in exec phase before reading input table. If the catalog and/or schema are not explicitly defined, the ones present in the configured "table" object are used.

    postReadSql

    SQL-statement to be executed in exec phase after reading input table and before action is finished. If the catalog and/or schema are not explicitly defined, the ones present in the configured "table" object are used.

    preWriteSql

    SQL-statement to be executed in exec phase before writing output table. If the catalog and/or schema are not explicitly defined, the ones present in the configured "table" object are used.

    postWriteSql

    SQL-statement to be executed in exec phase after writing output table. If the catalog and/or schema are not explicitly defined, the ones present in the configured "table" object are used.

    saveMode

    SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.

    allowSchemaEvolution

    If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.

    retentionPeriod

    Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.

    acl

    override connection permissions for files created tables hadoop directory with this connection

    connectionId

    optional id of io.smartdatalake.workflow.connection.HiveTableConnection

    expectedPartitionsCondition

    Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.

    housekeepingMode

    Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.

    metadata

    meta data

    Annotations
    @Scaladoc()

Value Members

  1. object DeltaLakeTableDataObject extends FromConfigFactory[DataObject] with Serializable

Ungrouped