Parqify
Parqify provides a server application, packaged as an AMI, that customers can deploy within their AWS environment.
This server will be responsible for consuming CSV and JSON files from a specified S3 bucket, converting them to the Parquet format, and then writing the converted files back to another S3 bucket.
Key Features of Parqify
-
File Format Conversion: Supports conversion from CSV and JSON to Parquet.
- S3 Integration: Seamless integration with Amazon S3 for both input and output.
- AMI-based Deployment: Easy deployment via AWS Marketplace.
- Scalability: Customers can scale the EC2 instance size based on their processing needs.
- Configurability: Configuration options for S3 bucket names, file prefixes, and other conversion parameters.
- Parallel Execution: Ability to process multiple files concurrently for improved performance.
- Parquet File Optimizations: Techniques to optimize the generated Parquet files for efficient storage and querying.
- Partitioning Support: Support for partitioning output Parquet files based on specified criteria.
- Custom Schema Definition: Allows users to define custom schemas for Parquet conversion.
- Compression Options: Provides various compression options for Parquet files to reduce storage size.
Data Flow
- Customer places files: CSV or JSON files are uploaded by the customer to a designated input S3 bucket.
- Server monitors: The server application, running on an EC2 instance launched from the AMI, continuously monitors the input S3 bucket for new files.
- File download: When a new file is detected, the server application downloads it from the input S3 bucket.
- File conversion: The server application converts the downloaded file from CSV or JSON format to Parquet format. During this step, custom schema definitions, partitioning, and compression options can be applied.
- Parquet file upload: The newly converted Parquet file is then uploaded to a designated output S3 bucket.