DataFrame#

DataFrame#

A DataFrame is the core abstraction of the PyFlink DataFrame API. It wraps a PyFlink Table and provides a pandas-like interface for data transformations. All transformation methods return a new DataFrame (immutable/fluent API).

Example:

>>> import pyflink.dataframe as pf
>>> df = pf.from_dict({"a": [1, 2, 3], "b": ["x", "y", "z"]})
>>> df.select("a", "b").filter(pf.col("a") > 1).collect()

DataFrame.select(*columns, **projections)

Select columns from the DataFrame.

DataFrame.with_column(name, expr)

Add a new column or replace an existing column.

DataFrame.with_columns(*exprs, **named_exprs)

Add multiple columns or replace existing columns.

DataFrame.drop_columns(*columns[, strict])

Drop columns from the DataFrame.

DataFrame.filter(predicate, **constraints)

Filter rows based on a predicate.

DataFrame.rename_columns(*args[, mapping])

Rename columns.

DataFrame.group_by(*columns)

Group the DataFrame by columns.

DataFrame.agg(*aggs)

Apply aggregation expressions to the entire DataFrame.

DataFrame.join(other[, on, how, left_on, ...])

Join with another DataFrame.

DataFrame.map(func, *[, return_dtype, ...])

Apply a function to each row, producing a new DataFrame.

DataFrame.map_batches(func, *[, ...])

Apply a function to batches of rows, producing a new DataFrame.

DataFrame.pipe(func, *args, **kwargs)

Apply a chainable function to the DataFrame.

DataFrame.limit(n)

Limit the DataFrame to the first n rows.

DataFrame.offset(n)

Skip the first n rows of the DataFrame.

DataFrame.head(n)

Return the first n rows as a new DataFrame.

DataFrame.to_table()

Convert to a PyFlink Table.

DataFrame.to_pandas()

Convert to a Pandas DataFrame.

DataFrame.collect()

Collect the DataFrame results to a list.

DataFrame.iter_rows()

Return an iterator over rows of the DataFrame.

DataFrame.iter_batches(*[, batch_size])

Return an iterator of pandas DataFrames, each containing up to batch_size rows.

DataFrame.write_parquet(path, *[, mode, ...])

Write the DataFrame to Parquet file(s) at the given path.

DataFrame.write_kafka(bootstrap_servers, *)

Write the DataFrame to a Kafka topic.

DataFrame.write_odps(endpoint, *[, ...])

Write the DataFrame to a MaxCompute (ODPS) table.

DataFrame.write_paimon([path, primary_key, ...])

Write the DataFrame to a Paimon table.

DataFrame.write_sls(endpoint, *, project, ...)

Write the DataFrame to an SLS (Aliyun Log Service) logstore.

DataFrame.write_hologres(endpoint, *, ...[, ...])

Write the DataFrame to a Hologres table.

DataFrame.write_custom(connector, *[, ...])

Write the DataFrame using a custom connector.

DataFrame.explain(*[, show_estimated_cost, ...])

Print the AST and execution plan of this DataFrame.

DataFrame.schema

Get the schema of the DataFrame.

DataFrame.columns

Get the column names.

DataFrame.drop_null([subset])

Drop rows containing null values.

DataFrame.drop_nan([subset])

Drop rows containing NaN values in float columns.

DataFrame.fill_null(value[, subset])

Replace null values with a given value.

DataFrame.fill_nan(value[, subset])

Replace NaN values with a given value in float columns.

DataFrame.llm

Access LLM / AI functions.

GroupedDataFrame#

A DataFrame that has been grouped on a set of grouping keys.

GroupedDataFrame.agg(*aggs)

Apply aggregation expressions.

Expressions#

Helper functions to create column references and literal expressions.

col(name)

Create a column reference expression.

lit(value[, data_type])

Create a literal expression.