Data Preparation
For best data use within any VEDA instance, the format and structure of data can either hinder visualization or ensure success and user satisfaction. This section lays out the optimal formats, file sizes, and data structure for the best performance.
Suggested File Formats
Raster Data
1. Cloud Optimized GeoTIFF (COG)
- Best for:
- 2D gridded data (e.g., NO₂, AOD, precipitation snapshots)
- Why use it:
- Enables HTTP range requests for fast map visualization
- Optimized for web-based tile access
- Recommendations::
- Use internal tiling
- Generate overviews for multi-resolution access
Additional COG details that can be helpful.
2. Zarr (Preferred for Multidimensional Data)
- Best for:
- Time-series and multi-variable datasets (e.g., GEOS-CF, IMERG)
- Large-scale analytics and cloud-native workflows
- Why use it:
- Designed for cloud object storage with parallel, chunk-based access
- Enables efficient subsetting across time and space
- Works seamlessly with tools like
xarray,dask, and Python analytics ecosystems
- Designed for cloud object storage with parallel, chunk-based access
- Recommendations:
- Align chunking with access patterns (e.g., time vs spatial queries)
- Store data in cloud object storage (e.g., S3) with clear directory structure
- Consolidate metadata for faster access
- Align chunking with access patterns (e.g., time vs spatial queries)
Additional Zarr details that can be helpful.
3. NetCDF4 (CF-compliant)
- Best for:
- Widely distributed scientific datasets and legacy data formats
- Interoperability with existing Earth science tools and workflows
- Widely distributed scientific datasets and legacy data formats
- Why use it:
- Mature, widely supported format across the scientific community
- Supports chunking and compression (via HDF5)
- Compatible with most analysis tools and libraries
- Mature, widely supported format across the scientific community
- Recommendations:
- Ensure chunking is properly configured (avoid default chunking)
- Use compression (e.g., gzip with shuffle filter)
- Consider complementing with a Virtual Zarr layer (e.g., Kerchunk) for improved cloud performance and scalability
- Ensure chunking is properly configured (avoid default chunking)
Additional NetCDF details that can be helpful. ### Vector Data
1. GeoJSON (Preferred)
- Best for:
- Small to medium-sized vector datasets
- Web-based visualization and APIs
- Why use it:
- Widely supported across web mapping tools
- Human-readable and easy to debug
- Native support in most visualization libraries
- Widely supported across web mapping tools
- Recommendations:
- Avoid for large datasets due to performance limitations
2. Shapefiles
- Best for:
- Legacy GIS workflows
- Interoperability with traditional GIS tools
- Legacy GIS workflows
- Why use it:
- Broad compatibility across GIS software
- Common exchange format in many datasets
- Broad compatibility across GIS software
- Recommendations:
- Avoid for cloud-native workflows
- Consider converting to GeoJSON or GeoParquet for better performance
- Avoid for cloud-native workflows
3. GeoParquet (Emerging / Future Recommended)
- Best for:
- Large-scale vector datasets
- Analytical and cloud-native workflows
- Large-scale vector datasets
- Why use it:
- Columnar format optimized for performance
- Efficient storage and query capabilities
- Well-suited for big data processing
- Columnar format optimized for performance
- Recommendations:
- Use for large datasets where performance is critical
- Partition data appropriately for scalable access
- Use for large datasets where performance is critical
Tabular Data
1. CSV
- Best for:
- Simple, small tabular datasets
- Data exchange and quick inspection
- Simple, small tabular datasets
- Why use it:
- Universally supported
- Easy to read and use
- Universally supported
- Recommendations:
- Avoid for large datasets due to size and performance limitations
2. JSON (Structured as Tabular)
- Best for:
- Lightweight structured data
- API responses and metadata
- Lightweight structured data
- Why use it:
- Flexible schema
- Easy integration with web applications
- Flexible schema
- Recommendations:
- Ensure data is structured as records (row-based)
- Avoid deeply nested structures for tabular use cases
- Not recommended for large datasets
- Ensure data is structured as records (row-based)
3. Parquet (Preferred for Large Tabular Data)
- Best for:
- Large-scale tabular datasets
- Analytical and cloud-native workflows
- Large-scale tabular datasets
- Why use it:
- Columnar storage enables efficient queries
- Compressed and optimized for performance
- Works well with analytics tools (e.g., Spark, Pandas)
- Columnar storage enables efficient queries
- Recommendations:
- Use for large datasets instead of CSV or JSON
- Partition data for scalable access in cloud environments
- Use for large datasets instead of CSV or JSON
Optimal File Sizes
Cloud Optimized GeoTIFF (COG)
- Recommended:
- 10 MB – 1 GB per file
- Avoid:
- Very small files (<10 MB) — high request overhead
- Very large files (>2–5 GB) — reduced interactivity and slower partial reads
- Very small files (<10 MB) — high request overhead
Zarr Stores
- Recommended:
- Total dataset size can scale to TBs or more
- Individual chunk sizes: ~1–10 MB
- Total dataset size can scale to TBs or more
- Notes:
- Performance depends on chunking strategy rather than total size
- Designed for scalable, cloud-native access
- Performance depends on chunking strategy rather than total size
NetCDF4 (HDF5-based)
- Recommended:
- File size should align with dataset scope and access patterns
- Moderate file sizes (e.g., ~x MB – x GB) are often practical
- File size should align with dataset scope and access patterns
- Notes:
- Performance is driven primarily by chunking and metadata layout
- Large files are acceptable if properly chunked and cloud-optimized (e.g., via Kerchunk)
- Performance is driven primarily by chunking and metadata layout
Optimal Chunks for Visualization
Chunking Guidelines
- Chunk Size Target:
- ~1–10 MB per chunk
- Avoid:
- Very small chunks (<256 KB) — request overhead dominates
- Very large chunks (>50–100 MB) — slow reads and poor interactivity
- Very small chunks (<256 KB) — request overhead dominates
Key Considerations
Chunking should reflect expected access patterns:
- Spatial access (map visualization)
- Temporal access (time-series analysis)
- Spatial access (map visualization)
Balanced chunking across dimensions is recommended for mixed-use workloads (e.g., dashboards)
Many legacy datasets use:
- Very small chunks
- Inefficient file layouts
→ This leads to severely degraded cloud performance
- Very small chunks
Cloud-native best practices emphasize:
- Minimizing the number of network requests
- While still enabling efficient subsetting of data
- Minimizing the number of network requests
Balanced Strategy
The chunking strategy should ultimately be determined by the dataset’s expected use cases. Below is a balanced approach that supports both map visualization and time-series access:
| Dimension | Recommended |
|---|---|
| Time | 1–24 timesteps per chunk |
| Spatial | 256 × 256 or 512 × 512 pixels |
| Chunk size | ~1–10 MB per chunk |
Why this matters:
- Enables efficient subsetting across both time and space
- Reduces unnecessary data reads (minimizes I/O overhead)
- Improves responsiveness in interactive dashboards (e.g., VEDA)
Compression
Best Practices
- Use DEFLATE (gzip) compression
- Use moderate compression levels (e.g., ~x–y) to balance size and performance
- Apply a shuffle filter to improve compression efficiency
HDF/NetCDF-Specific Considerations
When preparing HDF/NetCDF data for cloud environments:
Use formats that support chunking, compression, and consolidated metadata
(HDF5 / NetCDF4; not supported in HDF4 or NetCDF3)Consolidate metadata to enable efficient access in a single request
Tune chunk sizes appropriately:
- Preferred: ~1–10 MB per chunk
- Acceptable range: 100 KB – 16 MB
Design chunk shapes based on expected access patterns:
- Spatial visualization (map-based access)
- Time-series analysis
- Spatial visualization (map-based access)
Apply gzip (deflate) compression with shuffle filter
Include in data product documentation:
- Instructions for direct cloud access
- Guidance on using client libraries (e.g.,
xarray,kerchunk) for efficient data access
- Instructions for direct cloud access