pyflink.dataframe.dataframe.DataFrame.map#

DataFrame.map(func, *, return_dtype=None, concurrency: int | None = None, num_gpus: float | None = None, gpu_type: str | None = None) → DataFrame[source]#

Apply a function to each row, producing a new DataFrame.

The function receives a dict[str, value] and returns a dict[str, value]. This is a 1-to-1 row transformation.

return_dtype can be omitted if the function has a TypedDict return type hint.

Parameters:

func – A function (dict -> dict), or a DataFrameUDFWrapper.
return_dtype – Output schema as DataType.struct(…). Can be omitted if func has a TypedDict return hint.
concurrency – Optional concurrency (parallelism) for the UDF operator. If specified, the operator running this UDF will use this parallelism. UDFs with different concurrency values will be split into separate operators.
num_gpus – Optional number of GPUs requested for this UDF (e.g., 0.5, 1.0). Each GPU UDF will run in its own operator.
gpu_type – GPU type (e.g., ‘A10’, ‘V100’).

Returns:

A new DataFrame with the transformed rows.

Example::

>>> # With explicit return_dtype
>>> df.map(lambda row: {"a": row["a"] + 1, "b": row["b"].upper()},
...        return_dtype=DataType.struct({"a": DataType.int64(),
...                                      "b": DataType.string()}))
>>>
>>> # With TypedDict return hint (auto-inferred)
>>> class Output(TypedDict):
...     a: int
...     b: str
>>> def process(row) -> Output:
...     return {"a": row["a"] + 1, "b": row["b"].upper()}
>>> df.map(process)
>>>
>>> # With concurrency
>>> df.map(process, concurrency=4)
>>>
>>> # With GPU resources
>>> df.map(process, num_gpus=0.5, gpu_type='A10')

pyflink.dataframe.dataframe.DataFrame.map#

This Page