Data Preparation

For best data use within any VEDA instance, the format and structure of data can either hinder visualization or ensure success and user satisfaction. This section lays out the optimal formats, file sizes, and data structure for the best performance.

Suggested File Formats

Raster Data

1. Cloud Optimized GeoTIFF (COG)

Best for:
- 2D gridded data (e.g., NO₂, AOD, precipitation snapshots)
Why use it:
- Enables HTTP range requests for fast map visualization
- Optimized for web-based tile access
Recommendations::
- Use internal tiling
- Generate overviews for multi-resolution access

Additional COG details that can be helpful.

2. Zarr (Preferred for Multidimensional Data)

Best for:
- Time-series and multi-variable datasets (e.g., GEOS-CF, IMERG)
- Large-scale analytics and cloud-native workflows
Why use it:
- Designed for cloud object storage with parallel, chunk-based access
- Enables efficient subsetting across time and space
- Works seamlessly with tools like xarray, dask, and Python analytics ecosystems
Recommendations:
- Align chunking with access patterns (e.g., time vs spatial queries)
- Store data in cloud object storage (e.g., S3) with clear directory structure
- Consolidate metadata for faster access

Additional Zarr details that can be helpful.

3. NetCDF4 (CF-compliant)

Best for:
- Widely distributed scientific datasets and legacy data formats
- Interoperability with existing Earth science tools and workflows
Why use it:
- Mature, widely supported format across the scientific community
- Supports chunking and compression (via HDF5)
- Compatible with most analysis tools and libraries
Recommendations:
- Ensure chunking is properly configured (avoid default chunking)
- Use compression (e.g., gzip with shuffle filter)
- Consider complementing with a Virtual Zarr layer (e.g., Kerchunk) for improved cloud performance and scalability

Additional NetCDF details that can be helpful. ### Vector Data

1. GeoJSON (Preferred)

Best for:
- Small to medium-sized vector datasets
- Web-based visualization and APIs
Why use it:
- Widely supported across web mapping tools
- Human-readable and easy to debug
- Native support in most visualization libraries
Recommendations:
- Avoid for large datasets due to performance limitations

2. Shapefiles

Best for:
- Legacy GIS workflows
- Interoperability with traditional GIS tools
Why use it:
- Broad compatibility across GIS software
- Common exchange format in many datasets
Recommendations:
- Avoid for cloud-native workflows
- Consider converting to GeoJSON or GeoParquet for better performance

3. GeoParquet (Emerging / Future Recommended)

Best for:
- Large-scale vector datasets
- Analytical and cloud-native workflows
Why use it:
- Columnar format optimized for performance
- Efficient storage and query capabilities
- Well-suited for big data processing
Recommendations:
- Use for large datasets where performance is critical
- Partition data appropriately for scalable access

Tabular Data

1. CSV

Best for:
- Simple, small tabular datasets
- Data exchange and quick inspection
Why use it:
- Universally supported
- Easy to read and use
Recommendations:
- Avoid for large datasets due to size and performance limitations

2. JSON (Structured as Tabular)

Best for:
- Lightweight structured data
- API responses and metadata
Why use it:
- Flexible schema
- Easy integration with web applications
Recommendations:
- Ensure data is structured as records (row-based)
- Avoid deeply nested structures for tabular use cases
- Not recommended for large datasets

3. Parquet (Preferred for Large Tabular Data)

Best for:
- Large-scale tabular datasets
- Analytical and cloud-native workflows
Why use it:
- Columnar storage enables efficient queries
- Compressed and optimized for performance
- Works well with analytics tools (e.g., Spark, Pandas)
Recommendations:
- Use for large datasets instead of CSV or JSON
- Partition data for scalable access in cloud environments

Optimal File Sizes

Cloud Optimized GeoTIFF (COG)

Recommended:
- 10 MB – 1 GB per file
Avoid:
- Very small files (<10 MB) — high request overhead
- Very large files (>2–5 GB) — reduced interactivity and slower partial reads

Zarr Stores

Recommended:
- Total dataset size can scale to TBs or more
- Individual chunk sizes: ~1–10 MB
Notes:
- Performance depends on chunking strategy rather than total size
- Designed for scalable, cloud-native access

NetCDF4 (HDF5-based)

Recommended:
- File size should align with dataset scope and access patterns
- Moderate file sizes (e.g., ~x MB – x GB) are often practical
Notes:
- Performance is driven primarily by chunking and metadata layout
- Large files are acceptable if properly chunked and cloud-optimized (e.g., via Kerchunk)

Optimal Chunks for Visualization

Chunking Guidelines

Chunk Size Target:
- ~1–10 MB per chunk
Avoid:
- Very small chunks (<256 KB) — request overhead dominates
- Very large chunks (>50–100 MB) — slow reads and poor interactivity

Key Considerations

Chunking should reflect expected access patterns:
- Spatial access (map visualization)
- Temporal access (time-series analysis)
Balanced chunking across dimensions is recommended for mixed-use workloads (e.g., dashboards)
Many legacy datasets use:
- Very small chunks
- Inefficient file layouts
→ This leads to severely degraded cloud performance
Cloud-native best practices emphasize:
- Minimizing the number of network requests
- While still enabling efficient subsetting of data

Balanced Strategy

The chunking strategy should ultimately be determined by the dataset’s expected use cases. Below is a balanced approach that supports both map visualization and time-series access:

Dimension	Recommended
Time	1–24 timesteps per chunk
Spatial	256 × 256 or 512 × 512 pixels
Chunk size	~1–10 MB per chunk

Why this matters:

Enables efficient subsetting across both time and space
Reduces unnecessary data reads (minimizes I/O overhead)
Improves responsiveness in interactive dashboards (e.g., VEDA)

Compression

Best Practices

Use DEFLATE (gzip) compression
Use moderate compression levels (e.g., ~x–y) to balance size and performance
Apply a shuffle filter to improve compression efficiency

HDF/NetCDF-Specific Considerations

When preparing HDF/NetCDF data for cloud environments:

Use formats that support chunking, compression, and consolidated metadata
(HDF5 / NetCDF4; not supported in HDF4 or NetCDF3)
Consolidate metadata to enable efficient access in a single request
Tune chunk sizes appropriately:
- Preferred: ~1–10 MB per chunk
- Acceptable range: 100 KB – 16 MB
Design chunk shapes based on expected access patterns:
- Spatial visualization (map-based access)
- Time-series analysis
Apply gzip (deflate) compression with shuffle filter
Include in data product documentation:
- Instructions for direct cloud access
- Guidance on using client libraries (e.g., xarray, kerchunk) for efficient data access