Overview#
Use intake-axds
to create intake catalogs containing sources in Axiom databases representing datasets. You can search in time and space as well as by variable and text to narrow to datasets for your project, then easily read in the data.
import intake
Datatypes#
The default page size is 10, so requesting a datatype without any other input arguments will return the first 10 datasets of that datatype. The input argument page_size
controls the maximum number of entries in the catalog.
Sensors (fixed location dataset like buoys)#
Access sensor datasets by creating an AXDS catalog with datatype="sensor_station"
. Note that webcam data is ignored.
cat = intake.open_axds_cat(datatype="sensor_station", page_size=10)
len(cat)
6
See what search was performed with .get_search_urls()
.
cat.get_search_urls()
['https://search.axds.co/v2/search?portalId=-1&page=1&pageSize=10&verbose=true&type=sensor_station']
See catalog-level metadata:
cat
catalog:
args:
datatype: sensor_station
page_size: 10
description: Catalog of Axiom assets.
driver: intake_axds.axds_cat.AXDSCatalog
metadata:
kwargs_search:
search_for:
- null
pgids:
- null
pglabels:
- null
query_type: union
What sources make up the catalog?
list(cat)
['org_mxak_naked_island',
'ward-cove',
'urn:ioos:station:gov.usgs.waterdata:02312600',
'grave-point',
'org_mxak_mary_island',
'org_mxak_portland_island']
See source-level metadata for first source in catalog:
cat[list(cat)[0]]
org_mxak_naked_island:
args:
bin_interval: null
binned: false
end_time: null
internal_id: 5
only_pgids: null
qartod: false
start_time: null
use_units: true
uuid: org_mxak_naked_island
description: AXDS dataset_id org_mxak_naked_island of datatype sensor_station
driver: intake_axds.axds.AXDSSensorSource
metadata:
catalog_dir: ''
datumConversions: []
foreignNames:
- NKXA2
- NAKED_ISLAND
- null
internal_id: 5
maxLatitude: 58.255308
maxLongitude: -134.945049
maxTime: '2023-09-12T19:20:00Z'
metadata_url: https://sensors.axds.co/api/metadata/filter/custom?filter=%7B%22stations%22:%5B%225%22%5D%7D
minLatitude: 58.255308
minLongitude: -134.945049
minTime: '2015-05-05T14:10:00.000Z'
summary: Check that values are within reasonable bounds.
title: 'Atmospheric Pressure: Barometric Pressure'
uuid: org_mxak_naked_island
variables:
- wind_speed_of_gust
- relative_humidity
- wind_gust_from_direction
- wind_from_direction
- dew_point_temperature
- air_pressure
- air_temperature
- wind_speed
variables_details:
- annotations: []
label: 'Atmospheric Pressure: Barometric Pressure'
parameterGroupId: 9
plots:
- label: '[default]'
subPlots:
- availableZ:
- 0.0
availableZBins: []
datasetVariableId: air_pressure
deviceId: 1014658
discriminant: null
endDate: '2023-09-12T19:20:00Z'
feeds:
- 1014760
hasQc: true
instrument: {}
label: Barometric Pressure
maxVal: 1043.4
maxZ: 0.0
medianTimeIntervalSecs: 600
minVal: 965.8
minZ: 0.0
numObservations: 414369
parameterGroupId: 9
parameterId: 14
plotLabel: '[default]'
qcConfigId: 28
sensorParameterId: 14
startDate: '2015-05-05T14:10:00Z'
unitId: 24
units: millibars
- annotations: []
label: Dew Point
parameterGroupId: 29
plots:
- label: '[default]'
subPlots:
- availableZ:
- 0.0
availableZBins: []
datasetVariableId: dew_point_temperature
deviceId: 1014654
discriminant: null
endDate: '2023-09-12T19:20:00Z'
feeds:
- 1014756
hasQc: true
instrument: {}
label: Dew Point
maxVal: 17.22
maxZ: 0.0
medianTimeIntervalSecs: 600
minVal: -22.89
minZ: 0.0
numObservations: 414018
parameterGroupId: 29
parameterId: 16
plotLabel: '[default]'
qcConfigId: 72
sensorParameterId: 16
startDate: '2015-05-05T14:10:00Z'
unitId: 8
units: degree_Celsius
- annotations: []
label: 'Humidity: Relative Humidity'
parameterGroupId: 22
plots:
- label: '[default]'
subPlots:
- availableZ:
- 0.0
availableZBins: []
datasetVariableId: relative_humidity
deviceId: 1014652
discriminant: null
endDate: '2023-09-12T19:20:00Z'
feeds:
- 1014754
hasQc: true
instrument: {}
label: Relative Humidity
maxVal: 99.9
maxZ: 0.0
medianTimeIntervalSecs: 600
minVal: 21.1
minZ: 0.0
numObservations: 413936
parameterGroupId: 22
parameterId: 4
plotLabel: '[default]'
qcConfigId: 24
sensorParameterId: 4
startDate: '2015-05-05T14:10:00Z'
unitId: 1
units: '%'
- annotations: []
label: 'Temperature: Air Temperature'
parameterGroupId: 6
plots:
- label: '[default]'
subPlots:
- availableZ:
- 0.0
availableZBins: []
datasetVariableId: air_temperature
deviceId: 1014651
discriminant: null
endDate: '2023-09-12T19:20:00Z'
feeds:
- 1014753
hasQc: true
instrument: {}
label: Air Temperature
maxVal: 26.5
maxZ: 0.0
medianTimeIntervalSecs: 600
minVal: -14.0
minZ: 0.0
numObservations: 414491
parameterGroupId: 6
parameterId: 3
plotLabel: '[default]'
qcConfigId: 5
sensorParameterId: 3
startDate: '2015-05-05T14:10:00Z'
unitId: 8
units: degree_Celsius
- annotations: []
label: 'Winds: Gusts'
parameterGroupId: 186
plots:
- label: '[default]'
subPlots:
- availableZ:
- 0.0
availableZBins: []
datasetVariableId: wind_speed_of_gust
deviceId: 1014656
discriminant: null
endDate: '2023-09-12T19:20:00Z'
feeds:
- 1014758
hasQc: true
instrument: {}
label: Wind Gust
maxVal: 114.96
maxZ: 0.0
medianTimeIntervalSecs: 600
minVal: 0.58
minZ: 0.0
numObservations: 415753
parameterGroupId: 186
parameterId: 7
plotLabel: '[default]'
qcConfigId: null
sensorParameterId: 7
startDate: '2015-05-05T14:10:00Z'
unitId: 23
units: mile.hour-1
- availableZ:
- 0.0
availableZBins: []
datasetVariableId: wind_gust_from_direction
deviceId: 1014657
discriminant: null
endDate: '2023-09-12T19:20:00Z'
feeds:
- 1014759
hasQc: true
instrument: {}
label: Wind Gust From Direction
maxVal: 359.9
maxZ: 0.0
medianTimeIntervalSecs: 600
minVal: 0.0
minZ: 0.0
numObservations: 414353
parameterGroupId: 186
parameterId: 75
plotLabel: '[default]'
qcConfigId: 11
sensorParameterId: 75
startDate: '2015-05-05T14:10:00Z'
unitId: 10
units: degrees
- annotations: []
label: 'Winds: Speed and Direction'
parameterGroupId: 8
plots:
- label: '[default]'
subPlots:
- availableZ:
- 0.0
availableZBins: []
datasetVariableId: wind_speed
deviceId: 1014653
discriminant: null
endDate: '2023-09-12T19:20:00Z'
feeds:
- 1014755
hasQc: true
instrument: {}
label: Wind Speed
maxVal: 48.92
maxZ: 0.0
medianTimeIntervalSecs: 600
minVal: 0.1
minZ: 0.0
numObservations: 413810
parameterGroupId: 8
parameterId: 5
plotLabel: '[default]'
qcConfigId: 29
sensorParameterId: 5
startDate: '2015-05-05T14:10:00Z'
unitId: 27
units: m.s-1
- availableZ:
- 0.0
availableZBins: []
datasetVariableId: wind_from_direction
deviceId: 1014655
discriminant: null
endDate: '2023-09-12T19:20:00Z'
feeds:
- 1014757
hasQc: true
instrument: {}
label: Wind From Direction
maxVal: 359.9
maxZ: 0.0
medianTimeIntervalSecs: 600
minVal: 0.0
minZ: 0.0
numObservations: 417118
parameterGroupId: 8
parameterId: 6
plotLabel: '[default]'
qcConfigId: 6
sensorParameterId: 6
startDate: '2015-05-05T14:10:00Z'
unitId: 10
units: degrees
version: 2
Read data from first source in catalog. Note that since no start time or stop time was entered, the full data range will be read in, along with all available variables. The output is a DataFrame.
cat[list(cat)[0]].read()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 1
----> 1 cat[list(cat)[0]].read()
File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:404, in AXDSSensorSource.read(self)
402 def read(self):
403 """read data in"""
--> 404 return self._get_partition(None)
File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:399, in AXDSSensorSource._get_partition(self, _)
397 """get partition"""
398 if self._dataframe is None:
--> 399 self._load_metadata()
400 return self._dataframe
File ~/checkouts/readthedocs.org/user_builds/intake-axds/conda/latest/lib/python3.9/site-packages/intake/source/base.py:283, in DataSourceBase._load_metadata(self)
281 """load metadata only if needed"""
282 if self._schema is None:
--> 283 self._schema = self._get_schema()
284 self.dtype = self._schema.dtype
285 self.shape = self._schema.shape
File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:387, in AXDSSensorSource._get_schema(self)
383 """get schema"""
384 if self._dataframe is None:
385 # TODO: could do partial read with chunksize to get likely schema from
386 # first few records, rather than loading the whole thing
--> 387 self._load()
388 return base.Schema(
389 datashape=None,
390 dtype=self._dataframe.dtypes,
(...)
393 extra_metadata={},
394 )
File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:372, in AXDSSensorSource._load(self)
369 def _load(self):
370 """How to load in a specific station once you know it by uuid"""
--> 372 dfs = [self._load_to_dataframe(url) for url in self.data_urls]
374 df = dfs[0]
375 # this gets different and I think better results than dfs[0].join(dfs[1:], how="outer", sort=True)
376 # even though they should probably return the same thing.
File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:372, in <listcomp>(.0)
369 def _load(self):
370 """How to load in a specific station once you know it by uuid"""
--> 372 dfs = [self._load_to_dataframe(url) for url in self.data_urls]
374 df = dfs[0]
375 # this gets different and I think better results than dfs[0].join(dfs[1:], how="outer", sort=True)
376 # even though they should probably return the same thing.
File ~/checkouts/readthedocs.org/user_builds/intake-axds/checkouts/latest/intake_axds/axds.py:204, in AXDSSensorSource._load_to_dataframe(self, url)
202 if len(data_raw["data"]["groupedFeeds"]) == 0:
203 self._dataframe = None
--> 204 raise ValueError(f"No data found for url {url}.")
206 # loop over the data feeds and read the data into DataFrames
207 # link to other metadata as needed
208 dfs = []
ValueError: No data found for url https://sensors.axds.co/api/observations/filter/custom?filter=%7B%22stations%22%3A%5B%225%22%5D%7D&start=2015-05-05T14:10:00Z&end=2023-09-12T19:20:00Z.
Sensor-specific options#
Options that are specific to sensors are QARTOD, units, and binning.
QARTOD#
All time series available for sensors optionally come with an aggregate QARTOD flag time series.
By default, QARTOD flags are not returned, but will be returned if qartod=True
is input to the call for catalog. Alternatively, a user can select that values that correspond to specific flags should be returned (with other values nan’ed out) with an input like qartod=[1,2]
to only return the values that either pass the QARTOD tests or were not tested. Is not available if binned==True
.
Flags are:
1: Pass
2: Not Evaluated
3: Suspect
4: Fail
9: Missing Data
More information on QARTOD is available here.
Units#
By defaults units will be returned, syntax is “standard_name [units]”. If False, no units will be included and then the syntax for column names is “standard_name”.
Binning#
By default, raw data for sensors is returned. However, binned data can instead by returned by entering binned=True
and bin_interval
options of hourly, daily, weekly, monthly, yearly. If bin_interval
is input, binned is set to True.
Examples#
For example, the following would return data columns as well as associated QARTOD columns, without units in the column names:
cat = intake.open_axds_cat(datatype="sensor_station", qartod=True, use_units=False)
This example would return data columns binned monthly:
cat = intake.open_axds_cat(datatype="sensor_station", bin_interval="monthly")
Platforms (traveling sensor, like gliders)#
Access platforms datasets by creating an AXDS catalog with datatype="platform2"
. Everything should work the same as demonstrated for sensors.
Data is output into a DataFrame for platforms. It is accessed by parquet file if available and otherwise by csv.
cat = intake.open_axds_cat(datatype="platform2")
list(cat)
See source-level metadata for first source in catalog:
cat[list(cat)[0]]
Filter in time and space#
When setting up an AXDS intake catalog, you can narrow your search in time and space. The longitude values min_lon
and max_lon
should be in the range -180 to 180. You can search through the kwargs_search
keyword or you can search explicitly using bbox
(min_lon, min_lat, max_lon, max_lat) and start_time
and end_time
.
kw = {
"min_lon": -180,
"max_lon": -158,
"min_lat": 50,
"max_lat": 66,
"min_time": '2015-1-1',
"max_time": '2015-1-2',
}
cat = intake.open_axds_cat(datatype='sensor_station', kwargs_search=kw, page_size=5)
len(cat)
cat[list(cat)[0]]
Filter with keyword(s)#
You can also narrow your search by one or more keywords, by passing a string or list of strings with kwargs_search["search_for"]
or explicitly using search_for
. If you input more than one string, be aware that the multiple searches required will be combined according to query_type
, either as a logical OR if query_type=="union"
or as a logical AND if query_type=="intersection"
.
cat = intake.open_axds_cat(datatype='platform2', search_for=["whale", "bering"],
query_type="intersection", page_size=1000)
len(cat)
cat = intake.open_axds_cat(datatype='platform2', search_for=["whale", "bering"],
query_type="union", page_size=1000)
len(cat)
Filter by variable#
This section describes two approaches for searching by variable. As with search_for
, how multiple variable requests are combined depends on the input choice of query_type
. However, in the case of variables there are three options for query_type
:
query_type=="union"
logical ORquery_type=="intersection"
as a logical ANDquery_type=="intersection_constrained"
as a logical AND but also only the requested variables are returned.
Select variable(s) to search for by standard_name#
Check available standard names with:
import intake_axds
standard_names = intake_axds.utils.available_names()
len(standard_names), standard_names[:5]
Make a catalog of sensors that contain either of the standard_names input.
std_names = ["sea_water_practical_salinity", "sea_water_temperature"]
cat = intake.open_axds_cat(datatype="sensor_station", standard_names=std_names,
query_type="union")
cat[list(cat)[0]].metadata["variables"]
Make a catalog of sensors that contain both of the standard_names input.
std_names = ["sea_water_practical_salinity", "sea_water_temperature"]
cat = intake.open_axds_cat(datatype="sensor_station", standard_names=std_names,
query_type="intersection", page_size=100)
cat[list(cat)[0]].metadata["variables"]
Make a catalog of sensors that contain both of the standard_names input but then also only return those two variable types. All variables available in the dataset will still be present in the metadata, but only values for those requested will be returned in the DataFrame. We can look at the catalog metadata to see the parameterGroupIds and parameterGroupLables that will be used in data collection.
std_names = ["sea_water_practical_salinity", "sea_water_temperature"]
cat = intake.open_axds_cat(datatype="sensor_station", standard_names=std_names,
query_type="intersection_constrained", page_size=100)
cat
If you request standard_names that aren’t present in the system, you will be told (cell commented out but will return exception and say that they aren’t present).
# std_names = "sea_water_surface_salinity"
# cat = intake.open_axds_cat(datatype="sensor_station", standard_names=std_names)
Select variable(s) to search for by custom vocabulary#
Instead of selecting the exact standard_names to search on, you can set up a collections of regular expressions to match on the variables you want. This is particularly useful if you are running with several different searches and ultimately will need to select data variables from datasets using a generic name.
Set up vocabulary#
One way to set up a custom vocabulary is with a helper class from cf-pandas
(see more information in the docs). Choose a nickname for each variable you want to be able to match on, like “temp” for matching sea water temperature variables, then set up the regular expressions you want to “count” as your variable “temp” — you can use the “Reg” class from cf-pandas
to write these expressions easily. The following example shows setting up a custom vocabulary for identifying variables of “temp”, “salt”, and “ssh”.
import cf_pandas as cfp
nickname = "temp"
vocab = cfp.Vocab()
# define a regular expression to represent your variable
reg = cfp.Reg(include="temp", exclude=["air","qc","status","atmospheric"])
# Make an entry to add to your vocabulary
vocab.make_entry(nickname, reg.pattern(), attr="name")
vocab.make_entry("salt", cfp.Reg(include="sal", exclude=["soil","qc","status"]).pattern(), attr="name")
vocab.make_entry("ssh", cfp.Reg(include=["sea_surface_height","surface_elevation"], exclude=["qc","status"]).pattern(), attr="name")
# what does the vocabulary look like?
vocab.vocab
You can use your custom vocab with a context manager, as in the following example. Alternatively, you can set the vocabulary up so all commands will know about it:
cf_xarray.set_options(custom_criteria=vocab.vocab) # for cf-xarray
cfp.set_options(custom_criteria=vocab.vocab) # for cf-pandas
with cfp.set_options(custom_criteria=vocab.vocab):
cat = intake.open_axds_cat(datatype="platform2", keys_to_match=["temp","salt"])
cat[list(cat)[0]].metadata["variables"]
Catalog metadata and options#
Can provide metadata at the catalog level with input arguments name
, description
, and metadata
to override the defaults.
cat = intake.open_axds_cat(datatype="platform2", name="Catalog name", description="This is the catalog.", page_size=1,
metadata={"special entry": "platforms"})
cat
ttl#
The default ttl
argument, or time before force-reloading the catalog, is None
, but can be overridden by inputting a value:
cat.ttl is None
cat = intake.open_axds_cat(datatype="platform2", page_size=1, ttl=60)
cat.ttl
Verbose#
Get information as the catalog function runs.
cat = intake.open_axds_cat(datatype="sensor_station", verbose=True, page_size=1)
Sensor Source#
You can use the intake AXDSSensorSource
directly with intake.open_axds_sensor
if you know the dataset_id
(UUID) or the internal_id
(Axiom station id). Alternatively, you can search using intake.open_axds_cat
for a sensor if you know the dataset_id and search for it with “search_for”.
Note that only some metadata will be available until the dataset is read in, at which point the full metadata is also read in.
source = intake.open_axds_sensor(internal_id=110532, bin_interval="monthly")
source
source.read()
If you prefer a catalog approach for a known dataset_id, you can do that like this:
cat = intake.open_axds_cat(datatype="sensor_station", search_for="ism-aoos-noaa_nos_co_ops_9469439",
verbose=True)
cat[list(cat)[0]]
You can request only specific data variable(s) be returned directly in the Sensor Source, though you need to know the parameterGroupId. You could access this by running the desired source once and looking at the metadata to select the IDs you want to use. For example using the information from the previous catalog listed immediately above, we could set up the following:
source = intake.open_axds_sensor(internal_id=110532, bin_interval="monthly", only_pgids=[47])
source