streamfeed-parser - A python library for memory effecient processing large csv and xml files, from both http and ftp protocols
In developing the Trendbook platform, we encountered challenges in efficiently processing vast amounts of data from diverse sources. Traditional methods of loading entire datasets into memory proved impractical and inefficient. To address this, we developed streamfeed-parser, a Python library designed to handle large CSV and XML feeds over HTTP, HTTPS, and FTP protocols with minimal memory usage.
Key Features of streamfeed-parser
-
Memory-Efficient Streaming: Processes gigabytes of data without significant memory consumption by streaming data incrementally.
-
Multi-Format Support: Seamlessly handles both CSV and XML feed formats, automatically detecting the file format and compression type.
-
Protocol Versatility: Supports HTTP, HTTPS, and FTP protocols, allowing flexibility in data sourcing.
-
Compression Handling: Automatically detects and decompresses ZIP, GZIP, and BZ2 compressed files, simplifying the processing pipeline.
-
Data Transformation: Offers the ability to expand fields with multiple values into separate records, facilitating more granular data analysis.
Installation
To integrate streamfeed-parser into your project, install it using pip:
pip install streamfeed-parser
Quick Start Guide
Here's how you can get started with streamfeed-parser:
-
Previewing Data:
To preview the first few rows of a feed without loading the entire dataset:
from streamfeed import preview_feed # Preview the first 10 rows from a feed preview_data = preview_feed('https://example.com/large-feed.csv', limit_rows=10) print(preview_data)
-
Streaming Data:
To stream and process a large feed efficiently:
from streamfeed import stream_feed # Stream and process a large feed without memory constraints for record in stream_feed('https://example.com/large-feed.csv'): # Process each record individually print(record)
Advanced Usage
-
Customizing XML Item Tags:
Specify the XML item tag for feeds with different structures:
from streamfeed import stream_feed feed_logic = { 'xml_item_tag': 'item' # Default is 'product' } for record in stream_feed('https://example.com/feed.xml', feed_logic=feed_logic): print(record)
-
Expanding Comma-Separated Values:
Transform fields with multiple values into separate records:
from streamfeed import stream_feed feed_logic = { 'explode_fields': ['size', 'color'], # Fields to explode 'divider': ',' # Character that separates values (default is ',') } # Input: {'id': '123', 'size': 'S,M,L', 'color': 'red,blue,green'} # Output: Multiple records with each size-color combination for record in stream_feed('https://example.com/feed.csv', feed_logic=feed_logic): print(record)
For more details and to access the library, visit the streamfeed-parser PyPI page.