DataFrame#

A DataFrame is the core abstraction of the PyFlink DataFrame API. It wraps a PyFlink Table and provides a pandas-like interface for data transformations. All transformation methods return a new DataFrame (immutable/fluent API).

Example:

>>> import pyflink.dataframe as pf
>>> df = pf.from_dict({"a": [1, 2, 3], "b": ["x", "y", "z"]})
>>> df.select("a", "b").filter(pf.col("a") > 1).collect()

`DataFrame.select`(columns, *projections)	Select columns from the DataFrame.
`DataFrame.with_column`(name, expr)	Add a new column or replace an existing column.
`DataFrame.with_columns`(exprs, *named_exprs)	Add multiple columns or replace existing columns.
`DataFrame.drop_columns`(*columns[, strict])	Drop columns from the DataFrame.
`DataFrame.filter`(predicate, **constraints)	Filter rows based on a predicate.
`DataFrame.rename_columns`(*args[, mapping])	Rename columns.
`DataFrame.group_by`(*columns)	Group the DataFrame by columns.
`DataFrame.agg`(*aggs)	Apply aggregation expressions to the entire DataFrame.
`DataFrame.join`(other[, on, how, left_on, ...])	Join with another DataFrame.
`DataFrame.map`(func, *[, return_dtype, ...])	Apply a function to each row, producing a new DataFrame.
`DataFrame.map_batches`(func, *[, ...])	Apply a function to batches of rows, producing a new DataFrame.
`DataFrame.pipe`(func, args, *kwargs)	Apply a chainable function to the DataFrame.
`DataFrame.limit`(n)	Limit the DataFrame to the first n rows.
`DataFrame.offset`(n)	Skip the first n rows of the DataFrame.
`DataFrame.head`(n)	Return the first n rows as a new DataFrame.
`DataFrame.to_table`()	Convert to a PyFlink Table.
`DataFrame.to_pandas`()	Convert to a Pandas DataFrame.
`DataFrame.collect`()	Collect the DataFrame results to a list.
`DataFrame.iter_rows`()	Return an iterator over rows of the DataFrame.
`DataFrame.iter_batches`(*[, batch_size])	Return an iterator of pandas DataFrames, each containing up to batch_size rows.
`DataFrame.write_parquet`(path, *[, mode, ...])	Write the DataFrame to Parquet file(s) at the given path.
`DataFrame.write_kafka`(bootstrap_servers, *)	Write the DataFrame to a Kafka topic.
`DataFrame.write_odps`(endpoint, *[, ...])	Write the DataFrame to a MaxCompute (ODPS) table.
`DataFrame.write_paimon`([path, primary_key, ...])	Write the DataFrame to a Paimon table.
`DataFrame.write_sls`(endpoint, *, project, ...)	Write the DataFrame to an SLS (Aliyun Log Service) logstore.
`DataFrame.write_hologres`(endpoint, *, ...[, ...])	Write the DataFrame to a Hologres table.
`DataFrame.write_custom`(connector, *[, ...])	Write the DataFrame using a custom connector.
`DataFrame.explain`(*[, show_estimated_cost, ...])	Print the AST and execution plan of this DataFrame.
`DataFrame.schema`	Get the schema of the DataFrame.
`DataFrame.columns`	Get the column names.
`DataFrame.drop_null`([subset])	Drop rows containing null values.
`DataFrame.drop_nan`([subset])	Drop rows containing NaN values in float columns.
`DataFrame.fill_null`(value[, subset])	Replace null values with a given value.
`DataFrame.fill_nan`(value[, subset])	Replace NaN values with a given value in float columns.
`DataFrame.llm`	Access LLM / AI functions.

GroupedDataFrame#

A DataFrame that has been grouped on a set of grouping keys.

GroupedDataFrame.agg(*aggs)

Apply aggregation expressions.

Expressions#

Helper functions to create column references and literal expressions.

`col`(name)	Create a column reference expression.
`lit`(value[, data_type])	Create a literal expression.

DataFrame#

DataFrame#

GroupedDataFrame#

Expressions#

This Page