Skip to main content
Ctrl+K
PyFlink 1.20+vvr.11.7.dev0 documentation - Home PyFlink 1.20+vvr.11.7.dev0 documentation - Home
  • API Reference
  • Examples
  • API Reference
  • Examples

Section Navigation

  • PyFlink Table
  • PyFlink DataStream
  • PyFlink DataFrame
    • DataFrame
    • DataFrame Creation
    • Input/Output
    • SQL
    • DataType
    • User Defined Functions
    • Configuration
    • GPU Support
    • AI / LLM
  • PyFlink Common
  • API Reference
  • PyFlink DataFrame
  • Input/Output
  • pyflink.dataframe.io.read_parquet

pyflink.dataframe.io.read_parquet#

read_parquet(path: str, *, schema: Dict[str, DataType] | None = None, columns: List[str] | None = None, monitor_interval: str | None = None, path_regex_pattern: str | None = None) → DataFrame[source]#

Read a Parquet file or directory into a DataFrame.

Uses Flink’s FileSystem Connector with Parquet Format under the hood.

Parameters:
  • path – Path to a Parquet file, directory, or glob pattern. Supports local paths and any Flink-supported file system (HDFS, OSS, S3, etc.).

  • schema – Dict of {column_name: DataType} specifying the schema. This parameter is required.

  • columns – Optional list of column names to read (projection pushdown). If None, all columns are read.

  • monitor_interval – Optional interval for continuously monitoring the directory for new files (e.g., ’10s’, ‘1min’). If None, the path is scanned once (bounded source). If set, the source becomes unbounded (streaming).

  • path_regex_pattern – Optional regex pattern to filter files. The pattern is matched against each file’s absolute path. Only files whose path matches the pattern will be read. This is useful when reading from a directory that contains mixed file types or when you want to select a subset of files based on naming conventions.

Returns:

A new DataFrame.

Example::
>>> import pyflink.dataframe as pf
>>>
>>> # Read with explicit schema
>>> df = pf.read_parquet("/path/to/data.parquet", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
... })
>>>
>>> # Read only specific columns
>>> df = pf.read_parquet("/path/to/data.parquet", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
...     "value": pf.DataType.float64(),
... }, columns=["id", "name"])
>>>
>>> # Continuously monitor directory for new files
>>> df = pf.read_parquet("/path/to/dir", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
... }, monitor_interval="10s")
>>>
>>> # Filter files by regex pattern (matched against absolute path)
>>> df = pf.read_parquet("/path/to/dir", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
... }, path_regex_pattern="/path/to/dir/part-.*\.parquet")
>>>
>>> # Only read files from specific date partitions
>>> df = pf.read_parquet("/path/to/dir", schema={
...     "id": pf.DataType.int64(),
...     "name": pf.DataType.string(),
... }, path_regex_pattern=".*/dt=2024-01-0[1-3]/.*")

previous

Input/Output

next

pyflink.dataframe.io.read_kafka

On this page
  • read_parquet()

This Page

  • Show Source

Created using Sphinx 7.4.7.

Built with the PyData Sphinx Theme 0.16.1.