As you begin to handle Parquet data with tools in more than one framework and language you’ll probably wonder how all these related pieces fit together. Here is a summary of data formats, libraries and frameworks you will encounter when working with Parquet data and Spark.
First we’ll look at data formats. Then specific libraries and frameworks that let you use these formats and how they are used together.
Data Formats, Protocols and Schemas
There are two basic types of data formats: Record-oriented and column-oriented. More on column-oriented formats. Record-oriented formats and protocols may specialize in transporting large record batches, serializing to disk or streaming to and from multiple sources.
Some formats were designed with serialization to disk in mind (Parquet, Feather) others for efficient in-memory representation of data (Arrow) still others are meant for exchanging data across networks (Thrift, Avro, Flight.)
Here are formats and protocols you may encounter
|Parquet||columnar||On disk format, has schema with physical and logical data types. Designed for efficient storage and fast reads.|
|Arrow||columnar||In-memory format, has schema with physical and logical types, different data types from Parquet but closely compatible|
|Avro||records||Serialization format for records, in-memory, streaming. Designed for schema evolution. Part of the Hadoop ecosystem. Use in application code; requires schema conversion to/from Parquet types, for instance optional primitive types must be marked “nullable”|
|Feather||columnar||Serialization format for Arrow, allows subset of Arrow types for flat schema, store on disk|
|Pandas Dataframes||Collections of records||in memory; schema will change some Arrow data types converting to/from Arrow tables. Use types_mapper() to effect custom conversion to Pandas types, such as with nullable types|
|Thrift||records||Serialization format; use over network|
|Flight||collections of records||Streaming data protocol built on Protobuf; get and send Arrow record batches over network. Part of the Arrow project.|
|Orc||columnar||File format For Hadoop, has indexing, unlike Parquet|
- Parquet format stores data grouped by columns not records. Parquet is intended for efficient storage and fast read There are many implementations of Parquet in many languages.
- Apache Arrow is an in-memory columnar representation of data. This data could have originated from Parquet, Orc or other on-disk columnar data, or it may have been built programmatically. The Arrow project comes with its own Parquet libraries. Arrow has many language bindings. It is implemented in only one project.
- Avro is a language independent serialization format in the Hadoop ecosystem, can be used with multiple data sources. Designed to be robust to schema changes. Can be persisted to disk in addition to RPC. Schema described in JSON.Has language support for Java, Ruby, C++, C#.
- Flight is a batch data transfer system designed to move data stored in Arrow between systems without the need to serialize and deserialize data. Can move data in parallel to a cluster.
Languages and Frameworks
Record Data and Columnar Data Interchange
If possible, when performance matters, it’s best to minimize converting between column and record oriented representations of data.
For some applications you need to transform columnar data into more conventional records, or build columnar representation of data from conventional record-based data. So your source or target could be Parquet but you need to bring another format into the picture as well.
Difficulties may occur when transforming columnar data to record based and vice-versa when data types in the columnar and record serialization libraries aren’t a perfect match. You may for instance have generic Avro records created with Java or C++, using their schema to define a Parquet schema, then saving the data as Parquet for fast analytical work on that data later on. Then you might read that Parquet data with PyArrow or Spark.
Common Combinations of Libraries, Frameworks and Languages
Most libraries and frameworks here support both reading and writing Parquet.
I’ve tried to represent how various formats and libraries relate to one another. There’s a certain loss of pricision here when summarizing at such a high level. The idea is that the “Parquet” on the left side can be manipulated by the language or framework on the right side with the help of intermediate languages and formats between the two, and that, for the most part you can accept data from Parquet and create it as well. Hence the arrows going both directions.
parquet ⇐⇒ Arrow ⇐⇒ PyArrow ⇐⇒ Pandas
- Multithreaded read / decompress from disk by default, uses all available CPUs
- Can read multipart Parquet datasets,
- Arrow to Pandas data types conversion may require some attention See this recent PyArrow pull request
Parquet ⇐⇒ FastParquet
Parquet ⇐⇒ Arrow ⇐⇒ PySpark (Spark >= 2.3.x)
Spark (Java / Scala: / JVM)
Parquet ⇐⇒ RDDs ⇐rarr; Data Frames, Datasets
- Default conversion to/from Parquet makes multipart Parquet datasets, sets columns as optional on write (nullable) Writes multi part Parquet because each worker writes out its part of the dataset independently.
- Does not (yet) use Arrow in memory.
- More information Spark Data Sources (including Parquet)
Java (outside Spark, on the JVM)
parquet ⇐ ⇒ Hadoop libs ⇐ ⇒ Arrow ⇐ ⇒ Avro
- Reads multipart Parquet datasets
- Performs as well or better than C++ Parquet + Arrow when data > 1GB
- Provides generic record interface via Avro
Parquet ⇐ ⇒ custom records
- Only single file Parquet datasets
- Developer must build record based serializer /deserializers
- Performance at scale not likely to be as good as using Arrow libraries. When little data is in memory makes little difference.
Parquet ⇐ ⇒ Arrow ⇐ ⇒ custom record types (from record batches)
- Records can be serialized or deserialized fairly easily to / from the Arrow tables
- Multi-threaded read / decompression available
- Can define schema using Arrow logical types
- Good performance for tabulation type software
Parquet ⇐ ⇒ custom record types
- Directly read and write Parquet, any in-memory storage must be handled by developer
- Doesn’t perform as well as using Arrow for moving data in memory or tabulation type operations on data
Parquet ⇒ Red-arrow ⇒ Ruby data structures
- Red-arrow library reads Parquet and sets up Arrow tables in memory accessible with Ruby
- Similar to PyArrow but Not as mature
- Get with the “red-arrow” gem; to build the native extensions you must have installed the correct version of the C++ “libarrow” library from the Apache Arrow project
Parquet ⇒ Spark ⇐ ⇒ JRuby
- Use the Spark framework from JRuby and work with data from Parquet or any other Spark data source.
Parquet ⇐ ⇒ Arrow ⇐ ⇒ Arrow tables ⇒ Go data structures
for C# you can try an alternative Parquet library parquet.net
Also check the Arrow project for how to use the C++ libraries with other languages. There’s an included Lua example.