pyflink.dataframe.io.read_parquet#

read_parquet(path: str, *, schema: Dict[str, DataType] | None = None, columns: List[str] | None = None, monitor_interval: str | None = None, path_regex_pattern: str | None = None) → DataFrame[source]#

Read a Parquet file or directory into a DataFrame.

Uses Flink’s FileSystem Connector with Parquet Format under the hood.

Parameters:

path – Path to a Parquet file, directory, or glob pattern. Supports local paths and any Flink-supported file system (HDFS, OSS, S3, etc.).
schema – Dict of {column_name: DataType} specifying the schema. This parameter is required.
columns – Optional list of column names to read (projection pushdown). If None, all columns are read.
monitor_interval – Optional interval for continuously monitoring the directory for new files (e.g., ’10s’, ‘1min’). If None, the path is scanned once (bounded source). If set, the source becomes unbounded (streaming).
path_regex_pattern – Optional regex pattern to filter files. The pattern is matched against each file’s absolute path. Only files whose path matches the pattern will be read. This is useful when reading from a directory that contains mixed file types or when you want to select a subset of files based on naming conventions.

Returns:

A new DataFrame.

Example::

>>> import pyflink.dataframe as pf
>>>
>>> # Read with explicit schema
>>> df = pf.read_parquet("/path/to/data.parquet", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
... })
>>>
>>> # Read only specific columns
>>> df = pf.read_parquet("/path/to/data.parquet", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
...     "value": pf.DataType.float64(),
... }, columns=["id", "name"])
>>>
>>> # Continuously monitor directory for new files
>>> df = pf.read_parquet("/path/to/dir", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
... }, monitor_interval="10s")
>>>
>>> # Filter files by regex pattern (matched against absolute path)
>>> df = pf.read_parquet("/path/to/dir", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
... }, path_regex_pattern="/path/to/dir/part-.*\.parquet")
>>>
>>> # Only read files from specific date partitions
>>> df = pf.read_parquet("/path/to/dir", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
... }, path_regex_pattern=".*/dt=2024-01-0[1-3]/.*")

pyflink.dataframe.io.read_parquet#

This Page