Overview#

Use intake-axds to create intake catalogs containing sources in Axiom databases representing datasets. You can search in time and space as well as by variable and text to narrow to datasets for your project, then easily read in the data.

import intake

Datatypes#

The default page size is 10, so requesting a datatype without any other input arguments will return the first 10 datasets of that datatype. The input argument page_size controls the maximum number of entries in the catalog.

Sensors (fixed location dataset like buoys)#

Access sensor datasets by creating an AXDS catalog with datatype="sensor_station". Note that webcam data is ignored.

cat = intake.open_axds_cat(datatype="sensor_station", page_size=10)
len(cat)
6

See what search was performed with .get_search_urls().

cat.get_search_urls()
['https://search.axds.co/v2/search?portalId=-1&page=1&pageSize=10&verbose=true&type=sensor_station']

See catalog-level metadata:

cat
catalog:
  args:
    datatype: sensor_station
    page_size: 10
  description: Catalog of Axiom assets.
  driver: intake_axds.axds_cat.AXDSCatalog
  metadata:
    kwargs_search:
      search_for:
      - null
    pgids:
    - null
    pglabels:
    - null
    query_type: union

What sources make up the catalog?

list(cat)
['org_mxak_naked_island',
 'ward-cove',
 'urn:ioos:station:gov.usgs.waterdata:02312600',
 'grave-point',
 'org_mxak_mary_island',
 'org_mxak_portland_island']

See source-level metadata for first source in catalog:

cat[list(cat)[0]]
org_mxak_naked_island:
  args:
    bin_interval: null
    binned: false
    end_time: null
    internal_id: 5
    only_pgids: null
    qartod: false
    start_time: null
    use_units: true
    uuid: org_mxak_naked_island
  description: AXDS dataset_id org_mxak_naked_island of datatype sensor_station
  driver: intake_axds.axds.AXDSSensorSource
  metadata:
    catalog_dir: ''
    datumConversions: []
    foreignNames:
    - NKXA2
    - NAKED_ISLAND
    - null
    internal_id: 5
    maxLatitude: 58.255308
    maxLongitude: -134.945049
    maxTime: '2023-09-12T19:20:00Z'
    metadata_url: https://sensors.axds.co/api/metadata/filter/custom?filter=%7B%22stations%22:%5B%225%22%5D%7D
    minLatitude: 58.255308
    minLongitude: -134.945049
    minTime: '2015-05-05T14:10:00.000Z'
    summary: Check that values are within reasonable bounds.
    title: 'Atmospheric Pressure: Barometric Pressure'
    uuid: org_mxak_naked_island
    variables:
    - wind_speed_of_gust
    - relative_humidity
    - wind_gust_from_direction
    - wind_from_direction
    - dew_point_temperature
    - air_pressure
    - air_temperature
    - wind_speed
    variables_details:
    - annotations: []
      label: 'Atmospheric Pressure: Barometric Pressure'
      parameterGroupId: 9
      plots:
      - label: '[default]'
        subPlots:
        - availableZ:
          - 0.0
          availableZBins: []
          datasetVariableId: air_pressure
          deviceId: 1014658
          discriminant: null
          endDate: '2023-09-12T19:20:00Z'
          feeds:
          - 1014760
          hasQc: true
          instrument: {}
          label: Barometric Pressure
          maxVal: 1043.4
          maxZ: 0.0
          medianTimeIntervalSecs: 600
          minVal: 965.8
          minZ: 0.0
          numObservations: 414369
          parameterGroupId: 9
          parameterId: 14
          plotLabel: '[default]'
          qcConfigId: 28
          sensorParameterId: 14
          startDate: '2015-05-05T14:10:00Z'
          unitId: 24
          units: millibars
    - annotations: []
      label: Dew Point
      parameterGroupId: 29
      plots:
      - label: '[default]'
        subPlots:
        - availableZ:
          - 0.0
          availableZBins: []
          datasetVariableId: dew_point_temperature
          deviceId: 1014654
          discriminant: null
          endDate: '2023-09-12T19:20:00Z'
          feeds:
          - 1014756
          hasQc: true
          instrument: {}
          label: Dew Point
          maxVal: 17.22
          maxZ: 0.0
          medianTimeIntervalSecs: 600
          minVal: -22.89
          minZ: 0.0
          numObservations: 414018
          parameterGroupId: 29
          parameterId: 16
          plotLabel: '[default]'
          qcConfigId: 72
          sensorParameterId: 16
          startDate: '2015-05-05T14:10:00Z'
          unitId: 8
          units: degree_Celsius
    - annotations: []
      label: 'Humidity: Relative Humidity'
      parameterGroupId: 22
      plots:
      - label: '[default]'
        subPlots:
        - availableZ:
          - 0.0
          availableZBins: []
          datasetVariableId: relative_humidity
          deviceId: 1014652
          discriminant: null
          endDate: '2023-09-12T19:20:00Z'
          feeds:
          - 1014754
          hasQc: true
          instrument: {}
          label: Relative Humidity
          maxVal: 99.9
          maxZ: 0.0
          medianTimeIntervalSecs: 600
          minVal: 21.1
          minZ: 0.0
          numObservations: 413936
          parameterGroupId: 22
          parameterId: 4
          plotLabel: '[default]'
          qcConfigId: 24
          sensorParameterId: 4
          startDate: '2015-05-05T14:10:00Z'
          unitId: 1
          units: '%'
    - annotations: []
      label: 'Temperature: Air Temperature'
      parameterGroupId: 6
      plots:
      - label: '[default]'
        subPlots:
        - availableZ:
          - 0.0
          availableZBins: []
          datasetVariableId: air_temperature
          deviceId: 1014651
          discriminant: null
          endDate: '2023-09-12T19:20:00Z'
          feeds:
          - 1014753
          hasQc: true
          instrument: {}
          label: Air Temperature
          maxVal: 26.5
          maxZ: 0.0
          medianTimeIntervalSecs: 600
          minVal: -14.0
          minZ: 0.0
          numObservations: 414491
          parameterGroupId: 6
          parameterId: 3
          plotLabel: '[default]'
          qcConfigId: 5
          sensorParameterId: 3
          startDate: '2015-05-05T14:10:00Z'
          unitId: 8
          units: degree_Celsius
    - annotations: []
      label: 'Winds: Gusts'
      parameterGroupId: 186
      plots:
      - label: '[default]'
        subPlots:
        - availableZ:
          - 0.0
          availableZBins: []
          datasetVariableId: wind_speed_of_gust
          deviceId: 1014656
          discriminant: null
          endDate: '2023-09-12T19:20:00Z'
          feeds:
          - 1014758
          hasQc: true
          instrument: {}
          label: Wind Gust
          maxVal: 114.96
          maxZ: 0.0
          medianTimeIntervalSecs: 600
          minVal: 0.58
          minZ: 0.0
          numObservations: 415753
          parameterGroupId: 186
          parameterId: 7
          plotLabel: '[default]'
          qcConfigId: null
          sensorParameterId: 7
          startDate: '2015-05-05T14:10:00Z'
          unitId: 23
          units: mile.hour-1
        - availableZ:
          - 0.0
          availableZBins: []
          datasetVariableId: wind_gust_from_direction
          deviceId: 1014657
          discriminant: null
          endDate: '2023-09-12T19:20:00Z'
          feeds:
          - 1014759
          hasQc: true
          instrument: {}
          label: Wind Gust From Direction
          maxVal: 359.9
          maxZ: 0.0
          medianTimeIntervalSecs: 600
          minVal: 0.0
          minZ: 0.0
          numObservations: 414353
          parameterGroupId: 186
          parameterId: 75
          plotLabel: '[default]'
          qcConfigId: 11
          sensorParameterId: 75
          startDate: '2015-05-05T14:10:00Z'
          unitId: 10
          units: degrees
    - annotations: []
      label: 'Winds: Speed and Direction'
      parameterGroupId: 8
      plots:
      - label: '[default]'
        subPlots:
        - availableZ:
          - 0.0
          availableZBins: []
          datasetVariableId: wind_speed
          deviceId: 1014653
          discriminant: null
          endDate: '2023-09-12T19:20:00Z'
          feeds:
          - 1014755
          hasQc: true
          instrument: {}
          label: Wind Speed
          maxVal: 48.92
          maxZ: 0.0
          medianTimeIntervalSecs: 600
          minVal: 0.1
          minZ: 0.0
          numObservations: 413810
          parameterGroupId: 8
          parameterId: 5
          plotLabel: '[default]'
          qcConfigId: 29
          sensorParameterId: 5
          startDate: '2015-05-05T14:10:00Z'
          unitId: 27
          units: m.s-1
        - availableZ:
          - 0.0
          availableZBins: []
          datasetVariableId: wind_from_direction
          deviceId: 1014655
          discriminant: null
          endDate: '2023-09-12T19:20:00Z'
          feeds:
          - 1014757
          hasQc: true
          instrument: {}
          label: Wind From Direction
          maxVal: 359.9
          maxZ: 0.0
          medianTimeIntervalSecs: 600
          minVal: 0.0
          minZ: 0.0
          numObservations: 417118
          parameterGroupId: 8
          parameterId: 6
          plotLabel: '[default]'
          qcConfigId: 6
          sensorParameterId: 6
          startDate: '2015-05-05T14:10:00Z'
          unitId: 10
          units: degrees
    version: 2

Read data from first source in catalog. Note that since no start time or stop time was entered, the full data range will be read in, along with all available variables. The output is a DataFrame.

cat[list(cat)[0]].read()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 cat[list(cat)[0]].read()

File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:404, in AXDSSensorSource.read(self)
    402 def read(self):
    403     """read data in"""
--> 404     return self._get_partition(None)

File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:399, in AXDSSensorSource._get_partition(self, _)
    397 """get partition"""
    398 if self._dataframe is None:
--> 399     self._load_metadata()
    400 return self._dataframe

File ~/checkouts/readthedocs.org/user_builds/intake-axds/conda/latest/lib/python3.9/site-packages/intake/source/base.py:283, in DataSourceBase._load_metadata(self)
    281 """load metadata only if needed"""
    282 if self._schema is None:
--> 283     self._schema = self._get_schema()
    284     self.dtype = self._schema.dtype
    285     self.shape = self._schema.shape

File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:387, in AXDSSensorSource._get_schema(self)
    383 """get schema"""
    384 if self._dataframe is None:
    385     # TODO: could do partial read with chunksize to get likely schema from
    386     # first few records, rather than loading the whole thing
--> 387     self._load()
    388 return base.Schema(
    389     datashape=None,
    390     dtype=self._dataframe.dtypes,
   (...)
    393     extra_metadata={},
    394 )

File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:372, in AXDSSensorSource._load(self)
    369 def _load(self):
    370     """How to load in a specific station once you know it by uuid"""
--> 372     dfs = [self._load_to_dataframe(url) for url in self.data_urls]
    374     df = dfs[0]
    375     # this gets different and I think better results than dfs[0].join(dfs[1:], how="outer", sort=True)
    376     # even though they should probably return the same thing.

File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:372, in <listcomp>(.0)
    369 def _load(self):
    370     """How to load in a specific station once you know it by uuid"""
--> 372     dfs = [self._load_to_dataframe(url) for url in self.data_urls]
    374     df = dfs[0]
    375     # this gets different and I think better results than dfs[0].join(dfs[1:], how="outer", sort=True)
    376     # even though they should probably return the same thing.

File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:204, in AXDSSensorSource._load_to_dataframe(self, url)
    202 if len(data_raw["data"]["groupedFeeds"]) == 0:
    203     self._dataframe = None
--> 204     raise ValueError(f"No data found for url {url}.")
    206 # loop over the data feeds and read the data into DataFrames
    207 # link to other metadata as needed
    208 dfs = []

ValueError: No data found for url https://sensors.axds.co/api/observations/filter/custom?filter=%7B%22stations%22%3A%5B%225%22%5D%7D&start=2015-05-05T14:10:00Z&end=2023-09-12T19:20:00Z.

Sensor-specific options#

Options that are specific to sensors are QARTOD, units, and binning.

QARTOD#

All time series available for sensors optionally come with an aggregate QARTOD flag time series.

By default, QARTOD flags are not returned, but will be returned if qartod=True is input to the call for catalog. Alternatively, a user can select that values that correspond to specific flags should be returned (with other values nan’ed out) with an input like qartod=[1,2] to only return the values that either pass the QARTOD tests or were not tested. Is not available if binned==True.

Flags are:

  • 1: Pass

  • 2: Not Evaluated

  • 3: Suspect

  • 4: Fail

  • 9: Missing Data

More information on QARTOD is available here.

Units#

By defaults units will be returned, syntax is “standard_name [units]”. If False, no units will be included and then the syntax for column names is “standard_name”.

Binning#

By default, raw data for sensors is returned. However, binned data can instead by returned by entering binned=True and bin_interval options of hourly, daily, weekly, monthly, yearly. If bin_interval is input, binned is set to True.

Examples#

For example, the following would return data columns as well as associated QARTOD columns, without units in the column names:

cat = intake.open_axds_cat(datatype="sensor_station", qartod=True, use_units=False)

This example would return data columns binned monthly:

cat = intake.open_axds_cat(datatype="sensor_station", bin_interval="monthly")

Platforms (traveling sensor, like gliders)#

Access platforms datasets by creating an AXDS catalog with datatype="platform2". Everything should work the same as demonstrated for sensors.

Data is output into a DataFrame for platforms. It is accessed by parquet file if available and otherwise by csv.

cat = intake.open_axds_cat(datatype="platform2")
list(cat)

See source-level metadata for first source in catalog:

cat[list(cat)[0]]

Filter in time and space#

When setting up an AXDS intake catalog, you can narrow your search in time and space. The longitude values min_lon and max_lon should be in the range -180 to 180. You can search through the kwargs_search keyword or you can search explicitly using bbox (min_lon, min_lat, max_lon, max_lat) and start_time and end_time.

kw = {
    "min_lon": -180,
    "max_lon": -158,
    "min_lat": 50,
    "max_lat": 66,
    "min_time": '2015-1-1',
    "max_time": '2015-1-2',
}

cat = intake.open_axds_cat(datatype='sensor_station', kwargs_search=kw, page_size=5)
len(cat)
cat[list(cat)[0]]

Filter with keyword(s)#

You can also narrow your search by one or more keywords, by passing a string or list of strings with kwargs_search["search_for"] or explicitly using search_for. If you input more than one string, be aware that the multiple searches required will be combined according to query_type, either as a logical OR if query_type=="union" or as a logical AND if query_type=="intersection".

cat = intake.open_axds_cat(datatype='platform2', search_for=["whale", "bering"],
                           query_type="intersection", page_size=1000)
len(cat)
cat = intake.open_axds_cat(datatype='platform2', search_for=["whale", "bering"],
                           query_type="union", page_size=1000)
len(cat)

Filter by variable#

This section describes two approaches for searching by variable. As with search_for, how multiple variable requests are combined depends on the input choice of query_type. However, in the case of variables there are three options for query_type:

  • query_type=="union" logical OR

  • query_type=="intersection" as a logical AND

  • query_type=="intersection_constrained" as a logical AND but also only the requested variables are returned.

Select variable(s) to search for by standard_name#

Check available standard names with:

import intake_axds

standard_names = intake_axds.utils.available_names()
len(standard_names), standard_names[:5]

Make a catalog of sensors that contain either of the standard_names input.

std_names = ["sea_water_practical_salinity", "sea_water_temperature"]
cat = intake.open_axds_cat(datatype="sensor_station", standard_names=std_names,
                           query_type="union")
cat[list(cat)[0]].metadata["variables"]

Make a catalog of sensors that contain both of the standard_names input.

std_names = ["sea_water_practical_salinity", "sea_water_temperature"]
cat = intake.open_axds_cat(datatype="sensor_station", standard_names=std_names,
                           query_type="intersection", page_size=100)
cat[list(cat)[0]].metadata["variables"]

Make a catalog of sensors that contain both of the standard_names input but then also only return those two variable types. All variables available in the dataset will still be present in the metadata, but only values for those requested will be returned in the DataFrame. We can look at the catalog metadata to see the parameterGroupIds and parameterGroupLables that will be used in data collection.

std_names = ["sea_water_practical_salinity", "sea_water_temperature"]
cat = intake.open_axds_cat(datatype="sensor_station", standard_names=std_names,
                           query_type="intersection_constrained", page_size=100)
cat

If you request standard_names that aren’t present in the system, you will be told (cell commented out but will return exception and say that they aren’t present).

# std_names = "sea_water_surface_salinity"
# cat = intake.open_axds_cat(datatype="sensor_station", standard_names=std_names)

Select variable(s) to search for by custom vocabulary#

Instead of selecting the exact standard_names to search on, you can set up a collections of regular expressions to match on the variables you want. This is particularly useful if you are running with several different searches and ultimately will need to select data variables from datasets using a generic name.

Set up vocabulary#

One way to set up a custom vocabulary is with a helper class from cf-pandas (see more information in the docs). Choose a nickname for each variable you want to be able to match on, like “temp” for matching sea water temperature variables, then set up the regular expressions you want to “count” as your variable “temp” — you can use the “Reg” class from cf-pandas to write these expressions easily. The following example shows setting up a custom vocabulary for identifying variables of “temp”, “salt”, and “ssh”.

import cf_pandas as cfp

nickname = "temp"
vocab = cfp.Vocab()

# define a regular expression to represent your variable
reg = cfp.Reg(include="temp", exclude=["air","qc","status","atmospheric"])

# Make an entry to add to your vocabulary
vocab.make_entry(nickname, reg.pattern(), attr="name")

vocab.make_entry("salt", cfp.Reg(include="sal", exclude=["soil","qc","status"]).pattern(), attr="name")
vocab.make_entry("ssh", cfp.Reg(include=["sea_surface_height","surface_elevation"], exclude=["qc","status"]).pattern(), attr="name")

# what does the vocabulary look like?
vocab.vocab

You can use your custom vocab with a context manager, as in the following example. Alternatively, you can set the vocabulary up so all commands will know about it:

cf_xarray.set_options(custom_criteria=vocab.vocab)  # for cf-xarray
cfp.set_options(custom_criteria=vocab.vocab)  # for cf-pandas
with cfp.set_options(custom_criteria=vocab.vocab):
    cat = intake.open_axds_cat(datatype="platform2", keys_to_match=["temp","salt"])
cat[list(cat)[0]].metadata["variables"]

Catalog metadata and options#

Can provide metadata at the catalog level with input arguments name, description, and metadata to override the defaults.

cat = intake.open_axds_cat(datatype="platform2", name="Catalog name", description="This is the catalog.", page_size=1,
                           metadata={"special entry": "platforms"})
cat

ttl#

The default ttl argument, or time before force-reloading the catalog, is None, but can be overridden by inputting a value:

cat.ttl is None
cat = intake.open_axds_cat(datatype="platform2", page_size=1, ttl=60)
cat.ttl

Verbose#

Get information as the catalog function runs.

cat = intake.open_axds_cat(datatype="sensor_station", verbose=True, page_size=1)

Sensor Source#

You can use the intake AXDSSensorSource directly with intake.open_axds_sensor if you know the dataset_id (UUID) or the internal_id (Axiom station id). Alternatively, you can search using intake.open_axds_cat for a sensor if you know the dataset_id and search for it with “search_for”.

Note that only some metadata will be available until the dataset is read in, at which point the full metadata is also read in.

source = intake.open_axds_sensor(internal_id=110532, bin_interval="monthly")
source
source.read()

If you prefer a catalog approach for a known dataset_id, you can do that like this:

cat = intake.open_axds_cat(datatype="sensor_station", search_for="ism-aoos-noaa_nos_co_ops_9469439",
                           verbose=True)
cat[list(cat)[0]]

You can request only specific data variable(s) be returned directly in the Sensor Source, though you need to know the parameterGroupId. You could access this by running the desired source once and looking at the metadata to select the IDs you want to use. For example using the information from the previous catalog listed immediately above, we could set up the following:

source = intake.open_axds_sensor(internal_id=110532, bin_interval="monthly", only_pgids=[47])
source