Parqify

Parqify provides a server application, packaged as an AMI, that customers can deploy within their AWS environment.

This server will be responsible for consuming CSV and JSON files from a specified S3 bucket, converting them to the Parquet format, and then writing the converted files back to another S3 bucket.

AWS marketplace - Parqify link

Key Features of Parqify

File Format Conversion: Supports conversion from CSV and JSON to Parquet.
S3 Integration: Seamless integration with Amazon S3 for both input and output.
AMI-based Deployment: Easy deployment via AWS Marketplace.
Scalability: Customers can scale the EC2 instance size based on their processing needs.
Configurability: Configuration options for S3 bucket names, file prefixes, and other conversion parameters.
Parallel Execution: Ability to process multiple files concurrently for improved performance.
Parquet File Optimizations: Techniques to optimize the generated Parquet files for efficient storage and querying.
Partitioning Support: Support for partitioning output Parquet files based on specified criteria.
Custom Schema Definition: Allows users to define custom schemas for Parquet conversion.
Compression Options: Provides various compression options for Parquet files to reduce storage size.

Data Flow

Customer places files: CSV or JSON files are uploaded by the customer to a designated input S3 bucket.
Server monitors: The server application, running on an EC2 instance launched from the AMI, continuously monitors the input S3 bucket for new files.
File download: When a new file is detected, the server application downloads it from the input S3 bucket.
File conversion: The server application converts the downloaded file from CSV or JSON format to Parquet format. During this step, custom schema definitions, partitioning, and compression options can be applied.
Parquet file upload: The newly converted Parquet file is then uploaded to a designated output S3 bucket.