diff -Nru python-airr-1.3.1/NEWS.rst python-airr-1.5.0/NEWS.rst --- python-airr-1.3.1/NEWS.rst 2020-10-13 21:38:13.000000000 +0000 +++ python-airr-1.5.0/NEWS.rst 2023-08-29 20:02:43.000000000 +0000 @@ -1,3 +1,51 @@ +Version 1.5.0: August 29, 2023 +-------------------------------------------------------------------------------- + +1. Updated schema set and examples to v1.5. +2. Officially dropped support for Python 2. +3. Added check for valid enum values to schema validation routines. +4. Set enum values to first defined value during template generation routines. +5. Removed mock dependency installation in ReadTheDocs environments from setup. +6. Improved package import time. + + +Version 1.4.1: August 27, 2022 +-------------------------------------------------------------------------------- + +General: + +1. Updated pandas requirement to 0.24.0 or higher. +2. Added support for missing integer values (``NaN``) in ``load_rearrangement`` + by casting to the pandas ``Int64`` data type. +3. Added gzip support to ``read_rearrangement``. +4. Significant internal refactoring to improve schema generalizability, + harmonize behavior between the python and R libraries, and prepare for + AIRR Standards v2.0. +5. Fixed a bug in the ``validate`` subcommand of ``airr-tools`` causing + validation errors to only be reporting for the first invalid file when + multiple files were specified on the command line. + +Data Model and Schema: + +1. Added support for arrays of objects in a single JSON or YAML file. +2. Added support for the AIRR Data File and associated schema + (DataFile, Info). The Data File data format holds AIRR object of + multiple types and is backwards compatible with Repertoire metadata. +3. Added support for the new germline and genotyping schema + (GermlineSet, GenotypeSet) and associated schema. +4. Renamed ``schema.CachedSchema`` to ``schema.AIRRSchema``. +5. Removed ``specs/blank.airr.yaml``. + +Deprecations: + +1. Deprecated ``load_repertoire``. Use ``read_airr`` instead. +2. Deprecated ``write_repertoire``. Use ``write_airr`` instead. +3. Deprecated ``validate_repertoire``. Use ``validate_airr`` instead. +4. Deprecated ``repertoire_template``. Use ``schema.RepertoireSchema.template`` instead. +5. Deprecated the commandline tool ``airr-tools validate repertoire``. + Use ``airr-tools validate airr`` instead. + + Version 1.3.1: October 13, 2020 -------------------------------------------------------------------------------- diff -Nru python-airr-1.3.1/PKG-INFO python-airr-1.5.0/PKG-INFO --- python-airr-1.3.1/PKG-INFO 2020-10-13 21:51:22.000000000 +0000 +++ python-airr-1.5.0/PKG-INFO 2023-08-31 18:00:52.852684500 +0000 @@ -1,184 +1,205 @@ -Metadata-Version: 1.1 +Metadata-Version: 2.1 Name: airr -Version: 1.3.1 +Version: 1.5.0 Summary: AIRR Community Data Representation Standard reference library for antibody and TCR sequencing data. Home-page: http://docs.airr-community.org Author: AIRR Community -Author-email: UNKNOWN +Author-email: License: CC BY 4.0 -Description: Installation - ------------------------------------------------------------------------------ - - Install in the usual manner from PyPI:: - - > pip3 install airr --user - - Or from the `downloaded `__ - source code directory:: - - > python3 setup.py install --user - - - Quick Start - ------------------------------------------------------------------------------ - - Reading AIRR Repertoire metadata files - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - The ``airr`` package contains functions to read and write AIRR repertoire metadata - files. The file format is either YAML or JSON, and the package provides a - light wrapper over the standard parsers. The file needs a ``json``, ``yaml``, or ``yml`` - file extension so that the proper parser is utilized. All of the repertoires are loaded - into memory at once and no streaming interface is provided:: - - import airr - - # Load the repertoires - data = airr.load_repertoire('input.airr.json') - for rep in data['Repertoire']: - print(rep) - - Why are the repertoires in a list versus in a dictionary keyed by the ``repertoire_id``? - There are two primary reasons for this. First, the ``repertoire_id`` might not have been - assigned yet. Some systems might allow MiAIRR metadata to be entered but the - ``repertoire_id`` is assigned to that data later by another process. Without the - ``repertoire_id``, the data could not be stored in a dictionary. Secondly, the list allows - the repertoire data to have a default ordering. If you know that the repertoires all have - a unique ``repertoire_id`` then you can quickly create a dictionary object using a - comprehension:: - - rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } - - Writing AIRR Repertoire metadata files - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - Writing AIRR repertoire metadata is also a light wrapper over standard YAML or JSON - parsers. The ``airr`` library provides a function to create a blank repertoire object - in the appropriate format with all of the required fields. As with the load function, - the complete list of repertoires are written at once, there is no streaming interface:: - - import airr - - # Create some blank repertoire objects in a list - reps = [] - for i in range(5): - reps.append(airr.repertoire_template()) - - # Write the repertoires - airr.write_repertoire('output.airr.json', reps) - - Reading AIRR Rearrangement TSV files - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - The ``airr`` package contains functions to read and write AIRR rearrangement files - as either iterables or pandas data frames. The usage is straightforward, - as the file format is a typical tab delimited file, but the package - performs some additional validation and type conversion beyond using a - standard CSV reader:: - - import airr - - # Create an iteratable that returns a dictionary for each row - reader = airr.read_rearrangement('input.tsv') - for row in reader: print(row) - - # Load the entire file into a pandas data frame - df = airr.load_rearrangement('input.tsv') - - Writing AIRR formatted files - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - Similar to the read operations, write functions are provided for either creating - a writer class to perform row-wise output or writing the entire contents of - a pandas data frame to a file. Again, usage is straightforward with the ``airr`` - output functions simply performing some type conversion and field ordering - operations:: - - import airr - - # Create a writer class for iterative row output - writer = airr.create_rearrangement('output.tsv') - for row in reader: writer.write(row) - - # Write an entire pandas data frame to a file - airr.dump_rearrangement(df, 'file.tsv') - - By default, ``create_rearrangement`` will only write the ``required`` fields - in the output file. Additional fields can be included in the output file by - providing the ``fields`` parameter with an array of additional field names:: - - # Specify additional fields in the output - fields = ['new_calc', 'another_field'] - writer = airr.create_rearrangement('output.tsv', fields=fields) - - A common operation is to read an AIRR rearrangement file, and then - write an AIRR rearrangement file with additional fields in it while - keeping all of the existing fields from the original file. The - ``derive_rearrangement`` function provides this capability:: - - import airr - - # Read rearrangement data and write new file with additional fields - reader = airr.read_rearrangement('input.tsv') - fields = ['new_calc'] - writer = airr.derive_rearrangement('output.tsv', 'input.tsv', fields=fields) - for row in reader: - row['new_calc'] = 'a value' - writer.write(row) - - - Validating AIRR data files - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - The ``airr`` package can validate repertoire and rearrangement data files - to insure that they contain all required fields and that the fields types - match the AIRR Schema. This can be done using the ``airr-tools`` command - line program or the validate functions in the library can be called:: - - # Validate a rearrangement file - airr-tools validate rearrangement -a input.tsv - - # Validate a repertoire metadata file - airr-tools validate repertoire -a input.airr.json - - Combining Repertoire metadata and Rearrangement files - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - The ``airr`` package does not keep track of which repertoire metadata files - are associated with rearrangement files, so users will need to handle those - associations themselves. However, in the data, the ``repertoire_id`` field forms - the link. The typical usage is that a program is going to perform some - computation on the rearrangements, and it needs access to the repertoire metadata - as part of the computation logic. This example code shows the basic framework - for doing that, in this case doing gender specific computation:: - - import airr - - # Load the repertoires - data = airr.load_repertoire('input.airr.json') - - # Put repertoires in dictionary keyed by repertoire_id - rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } - - # Create an iteratable for rearrangement data - reader = airr.read_rearrangement('input.tsv') - for row in reader: - # get repertoire metadata with this rearrangement - rep = rep_dict[row['repertoire_id']] - - # check the gender - if rep['subject']['sex'] == 'male': - # do male specific computation - elif rep['subject']['sex'] == 'female': - # do female specific computation - else: - # do other specific computation - Keywords: AIRR,bioinformatics,sequencing,immunoglobulin,antibody,adaptive immunity,T cell,B cell,BCR,TCR -Platform: UNKNOWN Classifier: Intended Audience :: Science/Research Classifier: Natural Language :: English Classifier: Operating System :: OS Independent -Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics + +Installation +------------------------------------------------------------------------------ + +Install in the usual manner from PyPI:: + + > pip3 install airr --user + +Or from the `downloaded `__ +source code directory:: + + > python3 setup.py install --user + + +Quick Start +------------------------------------------------------------------------------ + +Deprecation Notice +^^^^^^^^^^^^^^^^^^^^ + +The ``load_repertoire``, ``write_repertoire``, and ``validate_repertoire`` functions +have been deprecated for the new generic ``load_airr_data``, ``write_airr_data``, and +``validate_airr_data`` functions. These new functions are backwards compatible with +the Repertoire metadata format but also support the new AIRR objects such as GermlineSet, +RepertoireGroup, GenotypeSet, Cell and Clone. This new format is defined by the DataFile +Schema, which describes a standard set of objects included in a file containing +AIRR Data Model presentations. Currently, the AIRR DataFile does not completely support +Rearrangement, so users should continue using AIRR TSV files and its specific functions. +Also, the ``repertoire_template`` function has been deprecated for the ``Schema.template`` +method, which can now be called on any AIRR Schema to create a blank object. + +Reading AIRR Data Files +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``airr`` package contains functions to read and write AIRR Data +Model files. The file format is either YAML or JSON, and the package provides a +light wrapper over the standard parsers. The file needs a ``json``, ``yaml``, or ``yml`` +file extension so that the proper parser is utilized. All of the AIRR objects +are loaded into memory at once and no streaming interface is provided:: + + import airr + + # Load the AIRR data + data = airr.read_airr('input.airr.json') + # loop through the repertoires + for rep in data['Repertoire']: + print(rep) + +Why are the AIRR objects, such as Repertoire, GermlineSet, and etc., in a list versus in a +dictionary keyed by their identifier (e.g., ``repertoire_id``)? There are two primary reasons for +this. First, the identifier might not have been assigned yet. Some systems might allow MiAIRR +metadata to be entered but the identifier is assigned to that data later by another process. Without +the identifier, the data could not be stored in a dictionary. Secondly, the list allows the data to +have a default ordering. If you know that the data has a unique identifier then you can quickly +create a dictionary object using a comprehension. For example, with repertoires:: + + rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } + +another example with germline sets:: + + germline_dict = { obj['germline_set_id'] : obj for obj in data['GermlineSet'] } + +Writing AIRR Data Files +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Writing an AIRR Data File is also a light wrapper over standard YAML or JSON +parsers. Multiple AIRR objects, such as Repertoire, GermlineSet, and etc., can be +written together into the same file. In this example, we use the ``airr`` library ``template`` +method to create some blank Repertoire objects, and write them to a file. +As with the read function, the complete list of repertoires are written at once, +there is no streaming interface:: + + import airr + + # Create some blank repertoire objects in a list + data = { 'Repertoire': [] } + for i in range(5): + data['Repertoire'].append(airr.schema.RepertoireSchema.template()) + + # Write the AIRR Data + airr.write_airr('output.airr.json', data) + +Reading AIRR Rearrangement TSV files +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``airr`` package contains functions to read and write AIRR Rearrangement +TSV files as either iterables or pandas data frames. The usage is straightforward, +as the file format is a typical tab delimited file, but the package +performs some additional validation and type conversion beyond using a +standard CSV reader:: + + import airr + + # Create an iteratable that returns a dictionary for each row + reader = airr.read_rearrangement('input.tsv') + for row in reader: print(row) + + # Load the entire file into a pandas data frame + df = airr.load_rearrangement('input.tsv') + +Writing AIRR Rearrangement TSV files +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Similar to the read operations, write functions are provided for either creating +a writer class to perform row-wise output or writing the entire contents of +a pandas data frame to a file. Again, usage is straightforward with the ``airr`` +output functions simply performing some type conversion and field ordering +operations:: + + import airr + + # Create a writer class for iterative row output + writer = airr.create_rearrangement('output.tsv') + for row in reader: writer.write(row) + + # Write an entire pandas data frame to a file + airr.dump_rearrangement(df, 'file.tsv') + +By default, ``create_rearrangement`` will only write the ``required`` fields +in the output file. Additional fields can be included in the output file by +providing the ``fields`` parameter with an array of additional field names:: + + # Specify additional fields in the output + fields = ['new_calc', 'another_field'] + writer = airr.create_rearrangement('output.tsv', fields=fields) + +A common operation is to read an AIRR rearrangement file, and then +write an AIRR rearrangement file with additional fields in it while +keeping all of the existing fields from the original file. The +``derive_rearrangement`` function provides this capability:: + + import airr + + # Read rearrangement data and write new file with additional fields + reader = airr.read_rearrangement('input.tsv') + fields = ['new_calc'] + writer = airr.derive_rearrangement('output.tsv', 'input.tsv', fields=fields) + for row in reader: + row['new_calc'] = 'a value' + writer.write(row) + + +Validating AIRR data files +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``airr`` package can validate AIRR Data Model JSON/YAML files and Rearrangement +TSV files to ensure that they contain all required fields and that the fields types +match the AIRR Schema. This can be done using the ``airr-tools`` command +line program or the validate functions in the library can be called:: + + # Validate a rearrangement TSV file + airr-tools validate rearrangement -a input.tsv + + # Validate an AIRR DataFile + airr-tools validate airr -a input.airr.json + +Combining Repertoire metadata and Rearrangement files +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``airr`` package does not currently keep track of which AIRR Data Model files +are associated with which Rearrangement TSV files, though there is ongoing work to define +a standardized manifest, so users will need to handle those +associations themselves. However, in the data, AIRR identifier fields, such as ``repertoire_id``, +form the link between objects in the AIRR Data Model. +The typical usage is that a program is going to perform some +computation on the Rearrangements, and it needs access to the Repertoire metadata +as part of the computation logic. This example code shows the basic framework +for doing that, in this case doing gender specific computation:: + + import airr + + # Load AIRR data containing repertoires + data = airr.read_airr('input.airr.json') + + # Put repertoires in dictionary keyed by repertoire_id + rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } + + # Create an iteratable for rearrangement data + reader = airr.read_rearrangement('input.tsv') + for row in reader: + # get repertoire metadata with this rearrangement + rep = rep_dict[row['repertoire_id']] + + # check the gender + if rep['subject']['sex'] == 'male': + # do male specific computation + elif rep['subject']['sex'] == 'female': + # do female specific computation + else: + # do other specific computation + diff -Nru python-airr-1.3.1/README.rst python-airr-1.5.0/README.rst --- python-airr-1.3.1/README.rst 2020-10-13 21:31:46.000000000 +0000 +++ python-airr-1.5.0/README.rst 2022-08-28 17:34:43.000000000 +0000 @@ -14,56 +14,76 @@ Quick Start ------------------------------------------------------------------------------ -Reading AIRR Repertoire metadata files +Deprecation Notice +^^^^^^^^^^^^^^^^^^^^ + +The ``load_repertoire``, ``write_repertoire``, and ``validate_repertoire`` functions +have been deprecated for the new generic ``load_airr_data``, ``write_airr_data``, and +``validate_airr_data`` functions. These new functions are backwards compatible with +the Repertoire metadata format but also support the new AIRR objects such as GermlineSet, +RepertoireGroup, GenotypeSet, Cell and Clone. This new format is defined by the DataFile +Schema, which describes a standard set of objects included in a file containing +AIRR Data Model presentations. Currently, the AIRR DataFile does not completely support +Rearrangement, so users should continue using AIRR TSV files and its specific functions. +Also, the ``repertoire_template`` function has been deprecated for the ``Schema.template`` +method, which can now be called on any AIRR Schema to create a blank object. + +Reading AIRR Data Files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The ``airr`` package contains functions to read and write AIRR repertoire metadata -files. The file format is either YAML or JSON, and the package provides a +The ``airr`` package contains functions to read and write AIRR Data +Model files. The file format is either YAML or JSON, and the package provides a light wrapper over the standard parsers. The file needs a ``json``, ``yaml``, or ``yml`` -file extension so that the proper parser is utilized. All of the repertoires are loaded -into memory at once and no streaming interface is provided:: +file extension so that the proper parser is utilized. All of the AIRR objects +are loaded into memory at once and no streaming interface is provided:: import airr - # Load the repertoires - data = airr.load_repertoire('input.airr.json') + # Load the AIRR data + data = airr.read_airr('input.airr.json') + # loop through the repertoires for rep in data['Repertoire']: print(rep) -Why are the repertoires in a list versus in a dictionary keyed by the ``repertoire_id``? -There are two primary reasons for this. First, the ``repertoire_id`` might not have been -assigned yet. Some systems might allow MiAIRR metadata to be entered but the -``repertoire_id`` is assigned to that data later by another process. Without the -``repertoire_id``, the data could not be stored in a dictionary. Secondly, the list allows -the repertoire data to have a default ordering. If you know that the repertoires all have -a unique ``repertoire_id`` then you can quickly create a dictionary object using a -comprehension:: +Why are the AIRR objects, such as Repertoire, GermlineSet, and etc., in a list versus in a +dictionary keyed by their identifier (e.g., ``repertoire_id``)? There are two primary reasons for +this. First, the identifier might not have been assigned yet. Some systems might allow MiAIRR +metadata to be entered but the identifier is assigned to that data later by another process. Without +the identifier, the data could not be stored in a dictionary. Secondly, the list allows the data to +have a default ordering. If you know that the data has a unique identifier then you can quickly +create a dictionary object using a comprehension. For example, with repertoires:: rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } -Writing AIRR Repertoire metadata files +another example with germline sets:: + + germline_dict = { obj['germline_set_id'] : obj for obj in data['GermlineSet'] } + +Writing AIRR Data Files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Writing AIRR repertoire metadata is also a light wrapper over standard YAML or JSON -parsers. The ``airr`` library provides a function to create a blank repertoire object -in the appropriate format with all of the required fields. As with the load function, -the complete list of repertoires are written at once, there is no streaming interface:: +Writing an AIRR Data File is also a light wrapper over standard YAML or JSON +parsers. Multiple AIRR objects, such as Repertoire, GermlineSet, and etc., can be +written together into the same file. In this example, we use the ``airr`` library ``template`` +method to create some blank Repertoire objects, and write them to a file. +As with the read function, the complete list of repertoires are written at once, +there is no streaming interface:: import airr # Create some blank repertoire objects in a list - reps = [] + data = { 'Repertoire': [] } for i in range(5): - reps.append(airr.repertoire_template()) + data['Repertoire'].append(airr.schema.RepertoireSchema.template()) - # Write the repertoires - airr.write_repertoire('output.airr.json', reps) + # Write the AIRR Data + airr.write_airr('output.airr.json', data) Reading AIRR Rearrangement TSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The ``airr`` package contains functions to read and write AIRR rearrangement files -as either iterables or pandas data frames. The usage is straightforward, +The ``airr`` package contains functions to read and write AIRR Rearrangement +TSV files as either iterables or pandas data frames. The usage is straightforward, as the file format is a typical tab delimited file, but the package performs some additional validation and type conversion beyond using a standard CSV reader:: @@ -77,7 +97,7 @@ # Load the entire file into a pandas data frame df = airr.load_rearrangement('input.tsv') -Writing AIRR formatted files +Writing AIRR Rearrangement TSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Similar to the read operations, write functions are provided for either creating @@ -122,32 +142,34 @@ Validating AIRR data files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The ``airr`` package can validate repertoire and rearrangement data files -to insure that they contain all required fields and that the fields types +The ``airr`` package can validate AIRR Data Model JSON/YAML files and Rearrangement +TSV files to ensure that they contain all required fields and that the fields types match the AIRR Schema. This can be done using the ``airr-tools`` command line program or the validate functions in the library can be called:: - # Validate a rearrangement file + # Validate a rearrangement TSV file airr-tools validate rearrangement -a input.tsv - # Validate a repertoire metadata file - airr-tools validate repertoire -a input.airr.json + # Validate an AIRR DataFile + airr-tools validate airr -a input.airr.json Combining Repertoire metadata and Rearrangement files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The ``airr`` package does not keep track of which repertoire metadata files -are associated with rearrangement files, so users will need to handle those -associations themselves. However, in the data, the ``repertoire_id`` field forms -the link. The typical usage is that a program is going to perform some -computation on the rearrangements, and it needs access to the repertoire metadata +The ``airr`` package does not currently keep track of which AIRR Data Model files +are associated with which Rearrangement TSV files, though there is ongoing work to define +a standardized manifest, so users will need to handle those +associations themselves. However, in the data, AIRR identifier fields, such as ``repertoire_id``, +form the link between objects in the AIRR Data Model. +The typical usage is that a program is going to perform some +computation on the Rearrangements, and it needs access to the Repertoire metadata as part of the computation logic. This example code shows the basic framework for doing that, in this case doing gender specific computation:: import airr - # Load the repertoires - data = airr.load_repertoire('input.airr.json') + # Load AIRR data containing repertoires + data = airr.read_airr('input.airr.json') # Put repertoires in dictionary keyed by repertoire_id rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } @@ -165,3 +187,4 @@ # do female specific computation else: # do other specific computation + diff -Nru python-airr-1.3.1/airr/_version.py python-airr-1.5.0/airr/_version.py --- python-airr-1.3.1/airr/_version.py 2020-10-13 21:51:22.000000000 +0000 +++ python-airr-1.5.0/airr/_version.py 2023-08-31 18:00:52.853881000 +0000 @@ -8,11 +8,11 @@ version_json = ''' { - "date": "2020-10-13T14:38:25-0700", + "date": "2023-08-29T13:03:47-0700", "dirty": false, "error": null, - "full-revisionid": "725baa9cacf0db009e1bfbb276837c9a05d1f965", - "version": "1.3.1" + "full-revisionid": "a98d307a190fc03143fbf2d3d20966d647da28f8", + "version": "1.5.0" } ''' # END VERSION_JSON diff -Nru python-airr-1.3.1/airr/interface.py python-airr-1.5.0/airr/interface.py --- python-airr-1.3.1/airr/interface.py 2020-10-13 21:31:46.000000000 +0000 +++ python-airr-1.5.0/airr/interface.py 2022-08-28 17:34:43.000000000 +0000 @@ -4,20 +4,28 @@ from __future__ import absolute_import # System imports +import gzip +import json import sys import pandas as pd -from collections import OrderedDict -from itertools import chain -from pkg_resources import resource_filename -import json import yaml import yamlordereddictloader +from collections import OrderedDict +from itertools import chain from io import open +from warnings import warn + +if (sys.version_info > (3, 0)): + from io import StringIO +else: + # Python 2 code in this block + from io import BytesIO as StringIO # Load imports from airr.io import RearrangementReader, RearrangementWriter -from airr.schema import ValidationError, RearrangementSchema, RepertoireSchema +from airr.schema import Schema, RearrangementSchema, RepertoireSchema, AIRRSchema, DataFileSchema, ValidationError +#### Rearrangement #### def read_rearrangement(filename, validate=False, debug=False): """ @@ -32,8 +40,12 @@ Returns: airr.io.RearrangementReader: iterable reader class. """ - - return RearrangementReader(open(filename, 'r'), validate=validate, debug=debug) + if filename.endswith(".gz"): + handle = gzip.open(filename, 'r') + else: + handle = open(filename, 'r') + + return RearrangementReader(handle, validate=validate, debug=debug) def create_rearrangement(filename, fields=None, debug=False): @@ -86,14 +98,19 @@ pandas.DataFrame: Rearrangement records as rows of a data frame. """ # TODO: test pandas.DataFrame.read_csv with converters argument as an alterative - # schema = RearrangementSchema - # df = pd.read_csv(handle, sep='\t', header=0, index_col=None, - # dtype=schema.numpy_types(), true_values=schema.true_values, - # false_values=schema.true_values) - # return df - with open(filename, 'r') as handle: - reader = RearrangementReader(handle, validate=validate, debug=debug) - df = pd.DataFrame(list(reader)) + schema = RearrangementSchema + + df = pd.read_csv(filename, sep='\t', header=0, index_col=None, + dtype=schema.pandas_types(), true_values=schema.true_values, + false_values=schema.false_values) + # added to use RearrangementReader without modifying it: + buffer = StringIO() # create an empty buffer + df.to_csv(buffer, sep='\t', index=False) # fill buffer + buffer.seek(0) # set to the start of the stream + + reader = RearrangementReader(buffer, validate=validate, debug=debug) + + df = pd.DataFrame(list(reader)) return df @@ -205,36 +222,224 @@ return valid +#### AIRR Data Model #### -def load_repertoire(filename, validate=False, debug=False): +def read_airr(filename, format=None, validate=False, model=True, debug=False): """ - Load an AIRR repertoire metadata file + Load an AIRR Data file Arguments: filename (str): path to the input file. + format (str): input file format valid strings are "yaml" or "json". If set to None, + the file format will be automatically detected from the file extension. validate (bool): whether to validate data as it is read, raising a ValidationError - exception in the event of an error. + exception in the event of a validation failure. + model (bool): If True only validate objects defined in the AIRR DataFile schema. + If False, attempt validation of all top-level objects. + Ignored if validate=False. debug (bool): debug flag. If True print debugging information to standard error. Returns: - list: list of Repertoire dictionaries. + dict: dictionary of AIRR Data objects. """ - # Because the repertoires are read in completely, we do not bother - # with a reader class. - md = None - - # determine file type from extension and use appropriate loader - ext = filename.split('.')[-1] + # Because the AIRR Data File is read in completely, we do not bother with a reader class. + # Determine file type from extension and use appropriate loader + ext = str.lower(filename.split('.')[-1]) if not format else format if ext in ('yaml', 'yml'): with open(filename, 'r', encoding='utf-8') as handle: - md = yaml.load(handle, Loader=yamlordereddictloader.Loader) + data = yaml.load(handle, Loader=yamlordereddictloader.Loader) elif ext == 'json': with open(filename, 'r', encoding='utf-8') as handle: - md = json.load(handle) + data = json.load(handle) + else: + if debug: sys.stderr.write('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % ext) + raise TypeError('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % ext) + data = None + + # Validate if requested + if validate: + if debug: sys.stderr.write('Validating: %s\n' % filename) + try: + valid = validate_airr(data, model=model, debug=debug) + except ValidationError as e: + if debug: sys.stderr.write('%s failed validation\n' % filename) + raise ValidationError(e) + + # We do not perform any additional processing + return data + + +def validate_airr(data, model=True, debug=False): + """ + Validates an AIRR Data file + + Arguments: + data (dict): dictionary containing AIRR Data Model objects + model (bool): If True only validate objects defined in the AIRR DataFile schema. + If False, attempt validation of all top-level objects + debug (bool): debug flag. If True print debugging information to standard error. + + Returns: + bool: True if files passed validation, otherwise False. + """ + # Type check that input type is either dict or OrderedDict + if not hasattr(data, 'items'): + if debug: sys.stderr.write('Data parameter is not a dictionary\n') + raise TypeError('Data parameter is not a dictionary') + + # Loop through each AIRR object and validate + valid = True + for k, object in data.items(): + if k in ('Info', 'DataFile'): continue + if not object: continue + + # Check for DataFile schema + if model and k not in DataFileSchema.properties: + if debug: sys.stderr.write('Skipping non-DataFile object: %s\n' % k) + continue + + # Get Schema + schema = AIRRSchema.get(k, Schema(k)) + + # Determine input type and set appropriate iterator + if hasattr(object, 'items'): + # Validate named array (dict) + obj_iter = object.items() + # Validate named array (dict) or a single object (dict) + # obj_iter = object.items() if 'definition' not in object.keys() else [0, object] + elif isinstance(object, list): + # Validate array + obj_iter = enumerate(object) + else: + # Unrecognized data structure + valid = False + if debug: sys.stderr.write('%s is an unrecognized data structure: %s\n' % k) + continue + + # Validate each record in array + for i, record in obj_iter: + try: + schema.validate_object(record) + except ValidationError as e: + valid = False + if debug: sys.stderr.write('%s at array position %s with validation error: %s\n' % (k, i, e)) + + if not valid: + raise ValidationError('AIRR Data Model has validation failures') + + return valid + + +def write_airr(filename, data, format=None, info=None, validate=False, model=True, debug=False): + """ + Write an AIRR Data file + + Arguments: + filename (str): path to the output file. + data (dict): dictionary of AIRR Data Model objects. + format (str): output file format valid strings are "yaml" or "json". If set to None, + the file format will be automatically detected from the file extension. + info (object): info object to write. Will write current AIRR Schema info if not specified. + validate (bool): whether to validate data before it is written, raising a ValidationError + exception in the event of a validation failure. + model (bool): If True only validate and write objects defined in the AIRR DataFile schema. + If False, attempt validation and write of all top-level objects + debug (bool): debug flag. If True print debugging information to standard error. + + Returns: + bool: True if the file is written without error. + """ + # Type check that input type is either dict or OrderedDict + if not hasattr(data, 'items'): + if debug: sys.stderr.write('Data parameter is not a dictionary\n') + raise TypeError('Data parameter is not a dictionary') + + # Validate if requested + if validate: + if debug: sys.stderr.write('Validating: %s\n' % filename) + try: + valid = validate_airr(data, model=model, debug=debug) + except ValidationError as e: + if debug: sys.stderr.write(e) + raise ValidationError(e) + + md = OrderedDict() + if info is None: + info = RearrangementSchema.info.copy() + info['title'] = 'AIRR Data File' + info['description'] = 'AIRR Data File written by AIRR Standards Python Library' + md['Info'] = info + + # Loop through each object and add them to the output dict + for k, obj in data.items(): + if k in ('Info', 'DataFile'): continue + if not obj: continue + if model and k not in DataFileSchema.properties: + if debug: sys.stderr.write('Skipping non-DataFile object: %s\n' % k) + continue + md[k] = obj + + # Determine file type from extension and use appropriate loader + ext = str.lower(filename.split('.')[-1]) if not format else format + if ext in ('yaml', 'yml'): + with open(filename, 'w') as handle: + yaml.dump(md, handle, default_flow_style=False) + elif ext == 'json': + with open(filename, 'w') as handle: + json.dump(md, handle, sort_keys=False, indent=2) else: if debug: - sys.stderr.write('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % (ext)) - raise TypeError('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % (ext)) + sys.stderr.write('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % ext) + raise TypeError('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % ext) + + return True + + +#### Deprecated #### + +def repertoire_template(): + """ + Return a blank repertoire object from the template. This object has the complete + structure with all of the fields and all values set to None or empty string. + + Returns: + object: empty repertoire object. + + .. deprecated:: 1.4 + Use :meth:`schema.Schema.template` instead. + """ + # Deprecation + warn('repertoire_template is deprecated and will be removed in a future release.\nUse schema.Schema.template instead.\n', + DeprecationWarning, stacklevel=2) + + # Build template + object = RepertoireSchema.template() + + return object + + +def load_repertoire(filename, validate=False, debug=False): + """ + Load an AIRR repertoire metadata file + + Arguments: + filename (str): path to the input file. + validate (bool): whether to validate data as it is read, raising a ValidationError + exception in the event of an error. + debug (bool): debug flag. If True print debugging information to standard error. + + Returns: + dict: dictionary of AIRR Data objects. + + .. deprecated:: 1.4 + Use :func:`read_airr` instead. + """ + # Deprecation + warn('load_repertoire is deprecated and will be removed in a future release.\nUse read_airr instead.\n', + DeprecationWarning, stacklevel=2) + + # use standard load function, we only validate Repertoire if requested + md = read_airr(filename, validate=validate, debug=debug) if md.get('Repertoire') is None: if debug: @@ -271,7 +476,14 @@ Returns: bool: True if files passed validation, otherwise False. + + .. deprecated:: 1.4 + Use :func:`validate_airr` instead. """ + # Deprecation + warn('validate_repertoire is deprecated and will be removed in a future release.\nUse validate_airr instead.\n', + DeprecationWarning, stacklevel=2) + valid = True if debug: sys.stderr.write('Validating: %s\n' % filename) @@ -302,7 +514,14 @@ Returns: bool: True if the file is written without error. + + .. deprecated:: 1.4 + Use :func:`write_airr` instead. """ + # Deprecation + warn('write_repertoire is deprecated and will be removed in a future release.\nUse write_airr instead.\n', + DeprecationWarning, stacklevel=2) + if not isinstance(repertoires, list): if debug: sys.stderr.write('Repertoires parameter is not a list\n') @@ -316,37 +535,4 @@ md['Info'] = info md['Repertoire'] = repertoires - # determine file type from extension and use appropriate loader - ext = filename.split('.')[-1] - if ext == 'yaml' or ext == 'yml': - with open(filename, 'w') as handle: - md = yaml.dump(md, handle, default_flow_style=False) - elif ext == 'json': - with open(filename, 'w') as handle: - md = json.dump(md, handle, sort_keys=False, indent=2) - else: - if debug: - sys.stderr.write('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % (ext)) - raise TypeError('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % (ext)) - - return True - - -def repertoire_template(): - """ - Return a blank repertoire object from the template. This object has the complete - structure with all of the fields and all values set to None or empty string. - - Returns: - object: empty repertoire object. - """ - - # TODO: I suppose we should dynamically create this from the schema - # versus loading a template. - - # Load blank template - f = resource_filename(__name__, 'specs/blank.airr.yaml') - object = load_repertoire(f) - - return object['Repertoire'][0] - + return write_airr(filename, md, info=info, debug=debug) diff -Nru python-airr-1.3.1/airr/schema.py python-airr-1.5.0/airr/schema.py --- python-airr-1.3.1/airr/schema.py 2020-05-27 18:18:47.000000000 +0000 +++ python-airr-1.5.0/airr/schema.py 2023-06-12 17:40:34.000000000 +0000 @@ -6,8 +6,12 @@ import sys import yaml import yamlordereddictloader +from collections import OrderedDict from pkg_resources import resource_stream +with resource_stream(__name__, 'specs/airr-schema.yaml') as f: + DEFAULT_SPEC = yaml.load(f, Loader=yamlordereddictloader.Loader) + class ValidationError(Exception): """ @@ -21,22 +25,23 @@ AIRR schema definitions Attributes: - properties (collections.OrderedDict): field definitions. + definition: name of the schema definition. info (collections.OrderedDict): schema info. + properties (collections.OrderedDict): field definitions. required (list): list of mandatory fields. optional (list): list of non-required fields. false_values (list): accepted string values for False. true_values (list): accepted values for True. """ # Boolean list for pandas - true_values = ['True', 'true', 'TRUE', 'T', 't', '1', 1, True] - false_values = ['False', 'false', 'FALSE', 'F', 'f', '0', 0, False] + true_values = ['True', 'true', 'TRUE', 'T', 't', '1'] + false_values = ['False', 'false', 'FALSE', 'F', 'f', '0'] # Generate dicts for booleans - _to_bool_map = {x: True for x in true_values} - _to_bool_map.update({x: False for x in false_values}) + _to_bool_map = {x: True for x in true_values + [1, True]} + _to_bool_map.update({x: False for x in false_values + [0, False]}) _from_bool_map = {k: 'T' if v else 'F' for k, v in _to_bool_map.items()} - + def __init__(self, definition): """ Initialization @@ -52,15 +57,18 @@ raise KeyError('Info is an invalid schema definition name') # Load object definition - with resource_stream(__name__, 'specs/airr-schema.yaml') as f: - spec = yaml.load(f, Loader=yamlordereddictloader.Loader) + if isinstance(definition, dict): # on-the-fly definition of a nested object + self.definition = definition + spec = {'Info': []} + else: + spec = DEFAULT_SPEC - try: - self.definition = spec[definition] - except KeyError: - raise KeyError('Schema definition %s cannot be found in the specifications' % definition) - except: - raise + try: + self.definition = spec[definition] + except KeyError: + raise KeyError('Schema definition %s cannot be found in the specifications' % definition) + except: + raise try: self.info = spec['Info'] @@ -69,14 +77,31 @@ except: raise - self.properties = self.definition['properties'] - - try: - self.required = self.definition['required'] - except KeyError: + if self.definition.get('properties') is not None: + self.properties = self.definition['properties'] + try: + self.required = self.definition['required'] + except KeyError: + self.required = [] + except: + raise + elif self.definition.get('allOf') is not None: + self.properties = {} self.required = [] - except: - raise + for s in self.definition['allOf']: + if s.get('$ref') is not None: + schema_name = s['$ref'].split('/')[-1] + # cannot use cache here + schema = Schema(schema_name) + # no nested allOf ... + self.properties.update(schema.properties) + self.required.extend(schema.required) + elif s.get('properties') is not None: + self.properties.update(s.get('properties')) + if s.get('required') is not None: + self.required.extend(s.get('required')) + else: + raise KeyError('Cannot find properties for schema definition %s' % definition) self.optional = [f for f in self.properties if f not in self.required] @@ -106,20 +131,25 @@ field_type = field_spec.get('type', None) if field_spec else None return field_type - # import numpy as np - # def numpy_types(self): - # type_mapping = {} - # for property in self.properties: - # if self.type(property) == 'boolean': - # type_mapping[property] = np.bool - # elif self.type(property) == 'integer': - # type_mapping[property] = np.int64 - # elif self.type(property) == 'number': - # type_mapping[property] = np.float64 - # elif self.type(property) == 'string': - # type_mapping[property] = np.unicode_ - # - # return type_mapping + def pandas_types(self): + """ + Map of schema types to pandas types + + Returns: + dict: mapping dictionary for pandas types + """ + type_mapping = {} + for property in self.properties: + if self.type(property) == 'boolean': + type_mapping[property] = bool + elif self.type(property) == 'integer': + type_mapping[property] = 'Int64' + elif self.type(property) == 'number': + type_mapping[property] = 'float64' + elif self.type(property) == 'string': + type_mapping[property] = str + + return type_mapping def to_bool(self, value, validate=False): """ @@ -193,7 +223,7 @@ return int(value) except ValueError: if validate: - raise ValidationError('invalid int %s'% value) + raise ValidationError('invalid int %s' % value) else: return None @@ -271,15 +301,15 @@ # Check types spec = self.type(f) try: - if spec == 'boolean': self.to_bool(row[f], validate=True) - if spec == 'integer': self.to_int(row[f], validate=True) - if spec == 'number': self.to_float(row[f], validate=True) + if spec == 'boolean': self.to_bool(row[f], validate=True) + if spec == 'integer': self.to_int(row[f], validate=True) + if spec == 'number': self.to_float(row[f], validate=True) except ValidationError as e: - raise ValidationError('field %s has %s' %(f, e)) + raise ValidationError('field %s has %s' % (f, e)) return True - def validate_object(self, obj, missing=True, nonairr = True, context=None): + def validate_object(self, obj, missing=True, nonairr=True, context=None): """ Validate Repertoire object data against schema @@ -295,13 +325,12 @@ Raises: airr.ValidationError: raised if object fails validation. """ - # object has to be a dictionary - if not isinstance(obj, dict): + if not hasattr(obj, 'items'): if context is None: raise ValidationError('object is not a dictionary') else: - raise ValidationError('field %s is not a dictionary object' %(context)) + raise ValidationError('field "%s" is not a dictionary object' % context) # first warn about non-AIRR fields if nonairr: @@ -333,16 +362,20 @@ # check MiAIRR keys exist if xairr and xairr.get('miairr'): if is_missing_key: - raise ValidationError('MiAIRR field %s is missing' %(full_field)) + raise ValidationError('MiAIRR field "%s" is missing' % full_field) # check if required field if f in self.required and is_missing_key: - raise ValidationError('Required field %s is missing' %(full_field)) + raise ValidationError('Required field "%s" is missing' % full_field) # check if identifier field if xairr and xairr.get('identifier'): if is_missing_key: - raise ValidationError('Identifier field %s is missing' %(full_field)) + if xairr.get('nullable'): + sys.stderr.write( + 'Warning: Nullable identifier field "%s" is missing.\n' % full_field) + else: + raise ValidationError('Not-nullable identifier field "%s" is missing' % full_field) # check nullable requirements if is_null: @@ -354,7 +387,7 @@ continue else: # nullable not allowed - raise ValidationError('Non-nullable field %s is null or missing' %(full_field)) + raise ValidationError('Non-nullable field "%s" is null or missing' % full_field) # if get to here, field should exist with non null value @@ -364,16 +397,16 @@ # for referenced object, recursively call validate with object and schema if spec.get('$ref') is not None: schema_name = spec['$ref'].split('/')[-1] - if CachedSchema.get(schema_name): - schema = CachedSchema[schema_name] + if AIRRSchema.get(schema_name): + schema = AIRRSchema[schema_name] else: schema = Schema(schema_name) schema.validate_object(obj[f], missing, nonairr, full_field) else: - raise ValidationError('Internal error: field %s in schema not handled by validation. File a bug report.' %(full_field)) + raise ValidationError('Internal error: field "%s" in schema not handled by validation. File a bug report.' % full_field) elif field_type == 'array': if not isinstance(obj[f], list): - raise ValidationError('field %s is not an array' %(full_field)) + raise ValidationError('field "%s" is not an array' % full_field) # for array, check each object in it for row in obj[f]: @@ -385,70 +418,147 @@ for s in spec['items']['allOf']: if s.get('$ref') is not None: schema_name = s['$ref'].split('/')[-1] - if CachedSchema.get(schema_name): - schema = CachedSchema[schema_name] + if AIRRSchema.get(schema_name): + schema = AIRRSchema[schema_name] else: schema = Schema(schema_name) schema.validate_object(row, missing, False, full_field) elif spec['items'].get('enum') is not None: if row not in spec['items']['enum']: - raise ValidationError('field %s has value "%s" not among possible enumeration values' %(full_field, row)) + raise ValidationError('field "%s" has value "%s" not among possible enumeration values' % (full_field, row)) elif spec['items'].get('type') == 'string': if not isinstance(row, str): - raise ValidationError('array field %s does not have string type: %s' %(full_field, row)) + raise ValidationError('array field "%s" does not have string type: %s' % (full_field, row)) elif spec['items'].get('type') == 'boolean': if not isinstance(row, bool): - raise ValidationError('array field %s does not have boolean type: %s' %(full_field, row)) + raise ValidationError('array field "%s" does not have boolean type: %s' % (full_field, row)) elif spec['items'].get('type') == 'integer': if not isinstance(row, int): - raise ValidationError('array field %s does not have integer type: %s' %(full_field, row)) + raise ValidationError('array field "%s" does not have integer type: %s' % (full_field, row)) elif spec['items'].get('type') == 'number': if not isinstance(row, float) and not isinstance(row, int): - raise ValidationError('array field %s does not have number type: %s' %(full_field, row)) + raise ValidationError('array field "%s" does not have number type: %s' % (full_field, row)) + elif spec['items'].get('type') == 'object': + sub_schema = Schema({'properties': spec['items'].get('properties')}) + sub_schema.validate_object(row, missing, nonairr, context) else: - raise ValidationError('Internal error: array field %s in schema not handled by validation. File a bug report.' %(full_field)) + raise ValidationError('Internal error: array field "%s" in schema not handled by validation. File a bug report.' % full_field) elif field_type == 'object': # right now all arrays of objects use $ref - raise ValidationError('Internal error: field %s in schema not handled by validation. File a bug report.' %(full_field)) + raise ValidationError('Internal error: field "%s" in schema not handled by validation. File a bug report.' % full_field) else: # check basic types if field_type == 'string': if not isinstance(obj[f], str): - raise ValidationError('Field %s does not have string type: %s' %(full_field, obj[f])) + raise ValidationError('Field "%s" does not have string type: %s' % (full_field, obj[f])) elif field_type == 'boolean': if not isinstance(obj[f], bool): - raise ValidationError('Field %s does not have boolean type: %s' %(full_field, obj[f])) + raise ValidationError('Field "%s" does not have boolean type: %s' % (full_field, obj[f])) elif field_type == 'integer': if not isinstance(obj[f], int): - raise ValidationError('Field %s does not have integer type: %s' %(full_field, obj[f])) + raise ValidationError('Field "%s" does not have integer type: %s' % (full_field, obj[f])) elif field_type == 'number': if not isinstance(obj[f], float) and not isinstance(obj[f], int): - raise ValidationError('Field %s does not have number type: %s' %(full_field, obj[f])) + raise ValidationError('Field "%s" does not have number type: %s' % (full_field, obj[f])) else: - raise ValidationError('Internal error: Field %s with type %s in schema not handled by validation. File a bug report.' %(full_field, field_type)) + raise ValidationError('Internal error: Field "%s" with type %s in schema not handled by validation. File a bug report.' % (full_field, field_type)) + + # check basic types enums + enums = spec.get('enum') + + if enums is not None: + field_value = obj[f] + if field_value not in enums: + raise ValidationError( + 'field "%s" has value "%s" not among possible enumeration values %s' % (full_field, field_value, enums) + ) return True + def template(self): + """ + Create an empty template object + + Returns: + collections.OrderedDict: dictionary with all schema properties set as None or an empty list. + """ + # Set defaults for each data type + type_default = {'boolean': False, 'integer': 0, 'number': 0.0, 'string': '', 'array':[]} + + # Fetch schema template definition for a $ref string + def _reference(ref): + x = ref.split('/')[-1] + schema = AIRRSchema.get(x, Schema(x)) + return(schema.template()) + + # Get default value + def _default(spec): + if 'nullable' in spec['x-airr'] and not spec['x-airr']['nullable']: + if 'enum' in spec: + return spec['enum'][0] + else: + return type_default.get(spec['type'], None) + else: + return None + + # Populate empty object + object = OrderedDict() + for k, spec in self.properties.items(): + # Skip deprecated + if 'x-airr' in spec and spec['x-airr'].get('deprecated', False): + continue + + # Population values + if '$ref' in spec: + object[k] = _reference(spec['$ref']) + elif spec['type'] == 'array': + if '$ref' in spec['items']: + object[k] = [_reference(spec['items']['$ref'])] + else: + object[k] = [] + elif 'x-airr' in spec: + object[k] = _default(spec) + else: + object[k] = None + + return(object) + # Preloaded schema -CachedSchema = { +AIRRSchema = { + 'Info': Schema('InfoObject'), + 'DataFile': Schema('DataFile'), 'Alignment': Schema('Alignment'), 'Rearrangement': Schema('Rearrangement'), 'Repertoire': Schema('Repertoire'), + 'RepertoireGroup': Schema('RepertoireGroup'), 'Ontology': Schema('Ontology'), 'Study': Schema('Study'), 'Subject': Schema('Subject'), 'Diagnosis': Schema('Diagnosis'), + 'SampleProcessing': Schema('SampleProcessing'), 'CellProcessing': Schema('CellProcessing'), 'PCRTarget': Schema('PCRTarget'), 'NucleicAcidProcessing': Schema('NucleicAcidProcessing'), 'SequencingRun': Schema('SequencingRun'), - 'RawSequenceData': Schema('RawSequenceData'), + 'SequencingData': Schema('SequencingData'), 'DataProcessing': Schema('DataProcessing'), - 'SampleProcessing': Schema('SampleProcessing') + 'GermlineSet': Schema('GermlineSet'), + 'Acknowledgement': Schema('Acknowledgement'), + 'RearrangedSequence': Schema('RearrangedSequence'), + 'UnrearrangedSequence': Schema('UnrearrangedSequence'), + 'SequenceDelineationV': Schema('SequenceDelineationV'), + 'AlleleDescription': Schema('AlleleDescription'), + 'GenotypeSet': Schema('GenotypeSet'), + 'Genotype': Schema('Genotype'), + 'Cell': Schema('Cell'), + 'Clone': Schema('Clone') } -AlignmentSchema = CachedSchema['Alignment'] -RearrangementSchema = CachedSchema['Rearrangement'] -RepertoireSchema = CachedSchema['Repertoire'] - +InfoSchema = AIRRSchema['Info'] +DataFileSchema = AIRRSchema['DataFile'] +AlignmentSchema = AIRRSchema['Alignment'] +RearrangementSchema = AIRRSchema['Rearrangement'] +RepertoireSchema = AIRRSchema['Repertoire'] +GermlineSetSchema = AIRRSchema['GermlineSet'] +GenotypeSetSchema = AIRRSchema['GenotypeSet'] diff -Nru python-airr-1.3.1/airr/specs/airr-schema.yaml python-airr-1.5.0/airr/specs/airr-schema.yaml --- python-airr-1.3.1/airr/specs/airr-schema.yaml 2020-09-14 16:38:36.000000000 +0000 +++ python-airr-1.5.0/airr/specs/airr-schema.yaml 2023-08-28 16:50:49.000000000 +0000 @@ -4,7 +4,7 @@ Info: title: AIRR Schema description: Schema definitions for AIRR standards objects - version: "1.3" + version: 1.4 contact: name: AIRR Community url: https://github.com/airr-community @@ -16,7 +16,6 @@ # Properties that are based upon an ontology use this # standard schema definition Ontology: - discriminator: AIRR type: object properties: id: @@ -26,35 +25,216 @@ type: string description: Label of the concept in the respective ontology +# Map to expand CURIE prefixes to full IRIs +CURIEMap: + ABREG: + type: identifier + default: + map: ABREG + map: + ABREG: + iri_prefix: "http://antibodyregistry.org/AB_" + CHEBI: + type: ontology + default: + map: OBO + provider: OLS + map: + OBO: + iri_prefix: "http://purl.obolibrary.org/obo/CHEBI_" + CL: + type: ontology + default: + map: OBO + provider: OLS + map: + OBO: + iri_prefix: "http://purl.obolibrary.org/obo/CL_" + DOI: + type: identifier + default: + map: DOI + map: + DOI: + iri_prefix: "https://doi.org/" + DOID: + type: ontology + default: + map: OBO + provider: OLS + map: + OBO: + iri_prefix: "http://purl.obolibrary.org/obo/DOID_" + ENA: + type: identifier + default: + map: ENA + map: + ENA: + iri_prefix: "https://www.ebi.ac.uk/ena/browser/view/" + ENSG: + type: identifier + default: + map: ENSG + map: + ENSG: + iri_prefix: "https://www.ensembl.org/Multi/Search/Results?q=" + IEDB_RECEPTOR: + type: identifier + default: + map: IEDB + provider: IEDB + map: + IEDB: + iri_prefix: "https://www.iedb.org/receptor/" + MRO: + type: ontology + default: + map: OBO + provider: OLS + map: + OBO: + iri_prefix: "http://purl.obolibrary.org/obo/MRO_" + NCBITAXON: + type: taxonomy + default: + map: OBO + provider: OLS + map: + OBO: + iri_prefix: "http://purl.obolibrary.org/obo/NCBITaxon_" + BioPortal: + iri_prefix: "http://purl.bioontology.org/ontology/NCBITAXON/" + NCIT: + type: ontology + default: + map: OBO + provider: OLS + map: + OBO: + iri_prefix: "http://purl.obolibrary.org/obo/NCIT_" + ORCID: + type: catalog + default: + map: ORCID + provider: ORCID + map: + ORCID: + iri_prefix: "https://orcid.org/" + ROR: + type: catalog + default: + map: ROR + provider: ROR + map: + ROR: + iri_prefix: "https://ror.org/" + SRA: + type: identifier + default: + map: SRA + map: + SRA: + iri_prefix: "https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=" + UBERON: + type: ontology + default: + map: OBO + provider: OLS + map: + OBO: + iri_prefix: "http://purl.obolibrary.org/obo/UBERON_" + UNIPROT: + type: identifier + default: + map: UNIPROT + map: + UniProt: + iri_prefix: "http://purl.uniprot.org/uniprot/" + UO: + type: ontology + default: + map: OBO + provider: OLS + map: + OBO: + iri_prefix: "http://purl.obolibrary.org/obo/UO_" -CURIEResolution: - - - curie_prefix: NCBITAXON - iri_prefix: - - "http://purl.obolibrary.org/obo/NCBITaxon_" - - "http://purl.bioontology.org/ontology/NCBITAXON/" - - - curie_prefix: NCIT - iri_prefix: - - "http://purl.obolibrary.org/obo/NCIT_" - - "http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#" - - - curie_prefix: UO - iri_prefix: - - "http://purl.obolibrary.org/obo/UO_" - - - curie_prefix: DOID - iri_prefix: - - "http://purl.obolibrary.org/obo/DOID_" - - - curie_prefix: UBERON - iri_prefix: - - "http://purl.obolibrary.org/obo/UBERON_" - - - curie_prefix: CL - iri_prefix: - - "http://purl.obolibrary.org/obo/CL_" - +InformationProvider: + provider: + ENA: + request: + url: "{iri}" + response: text/html + IEDB: + request: + url: "https://query-api.iedb.org/tcr_search?receptor_group_id=eq.{local_id}" + response: application/json + OLS: + request: + url: "https://www.ebi.ac.uk/ols/api/ontologies/{ontology_id}/terms?iri={iri}" + response: application/json + Ontobee: + request: + url: "http://www.ontobee.org/ontology/rdf/{ontology_id}?iri={iri}" + response: application/rdf+xml + ORCID: + request: + url: "https://pub.orcid.org/v2.1/{local_id}" + header: + Accept: application/json + response: application/json + ROR: + request: + url: "https://api.ror.org/organizations/{iri}" + response: application/json + SRA: + request: + url: "{iri}" + response: text/html + parameter: + CHEBI: + Ontobee: + ontology_id: CHEBI + OLS: + ontology_id: chebi + CL: + Ontobee: + ontology_id: CL + OLS: + ontology_id: cl + DOID: + Ontobee: + ontology_id: DOID + OLS: + ontology_id: doid + MRO: + Ontobee: + ontology_id: MRO + OLS: + ontology_id: mro + NCBITAXON: + Ontobee: + ontology_id: NCBITaxon + OLS: + ontology_id: ncbitaxon + BioPortal: + ontology_id: NCBITAXON + NCIT: + Ontobee: + ontology_id: NCIT + OLS: + ontology_id: ncit + UBERON: + Ontobee: + ontology_id: UBERON + OLS: + ontology_id: uberon + UO: + Ontobee: + ontology_id: UO + OLS: + ontology_id: uo # AIRR specification extensions # @@ -64,7 +244,6 @@ # attributes are attached to an AIRR field with the x-airr property. Attributes: - discriminator: AIRR type: object properties: miairr: @@ -74,7 +253,7 @@ - essential - important - defined - default: useful + default: defined identifier: type: boolean description: > @@ -117,8 +296,9 @@ description: Field format. If null then assume the full range of the field data type enum: - ontology - - controlled vocabulary - - physical quantity + - controlled_vocabulary + - physical_quantity + - CURIE ontology: type: object description: Ontology definition for field @@ -139,10 +319,1162 @@ type: string description: Ontology name for the top node term +# AIRR Data File +# +# A JSON data file that holds Repertoire metadata, data processing +# analysis objects, or any object in the AIRR Data Model. +# +# It is presumed that the objects gathered together in an AIRR Data File are related +# or relevant to each other, e.g. part of the same study; thus, the ID fields can be +# internally resolved unless the ID contains an external PID. This implies that AIRR +# Data Files cannot be merged simply by concatenating arrays; any merge program +# would need to manage duplicate or conflicting ID values. +# +# While the properties in an AIRR Data File are not required, if one is provided then +# the value should not be null. + +DataFile: + type: object + properties: + Info: + $ref: '#/InfoObject' + x-airr: + nullable: false + Repertoire: + type: array + description: List of repertoires + items: + $ref: '#/Repertoire' + x-airr: + nullable: false + RepertoireGroup: + type: array + description: List of repertoire collections + items: + $ref: '#/RepertoireGroup' + x-airr: + nullable: false + Rearrangement: + type: array + description: List of rearrangement records + items: + $ref: '#/Rearrangement' + x-airr: + nullable: false + Cell: + type: array + description: List of cells + items: + $ref: '#/Cell' + x-airr: + nullable: false + Clone: + type: array + description: List of clones + items: + $ref: '#/Clone' + x-airr: + nullable: false + GermlineSet: + type: array + description: List of germline sets + items: + $ref: '#/GermlineSet' + x-airr: + nullable: false + GenotypeSet: + type: array + description: List of genotype sets + items: + $ref: '#/GenotypeSet' + x-airr: + nullable: false + +# AIRR Info object, should be similar to openapi +# should we point to an openapi schema? +InfoObject: + type: object + description: Provides information about data and API responses. + required: + - title + - version + properties: + title: + type: string + x-airr: + nullable: false + version: + type: string + x-airr: + nullable: false + description: + type: string + contact: + type: object + properties: + name: + type: string + url: + type: string + email: + type: string + license: + type: object + required: + - name + properties: + name: + type: string + x-airr: + nullable: false + url: + type: string + +# A time point +TimePoint: + description: Time point at which an observation or other action was performed. + type: object + properties: + label: + type: string + description: Informative label for the time point + example: Pre-operative sampling of cancer tissue + x-airr: + nullable: true + adc-query-support: true + value: + type: number + description: Value of the time point + example: -5.0 + x-airr: + nullable: true + adc-query-support: true + unit: + $ref: '#/Ontology' + description: Unit of the time point + title: Unit of immunization schedule + example: + id: UO:0000033 + label: day + x-airr: + nullable: true + adc-query-support: true + format: ontology + ontology: + draft: false + top_node: + id: UO:0000003 + label: time unit + +# +# General objects +# TODO: link to global schema with JSON-LD? +# + +# An individual +Acknowledgement: + description: Individual whose contribution to this work should be acknowledged + type: object + required: + - acknowledgement_id + - name + - institution_name + properties: + acknowledgement_id: + type: string + description: unique identifier of this Acknowledgement within the file + x-airr: + identifier: true + miairr: important + name: + type: string + description: Full name of individual + institution_name: + type: string + description: Individual's department and institution name + orcid_id: + type: string + description: Individual's ORCID identifier + +# +# Germline gene schema +# + +# Rearranged and genomic germline sequences +RearrangedSequence: + type: object + description: > + Details of a directly observed rearranged sequence or an inference from rearranged sequences + contributing support for a gene or allele. + required: + - sequence_id + - sequence + - derivation + - observation_type + - repository_name + - repository_id + - deposited_version + - seq_start + - seq_end + properties: + sequence_id: + type: string + description: > + Unique identifier of this RearrangedSequence within the file, typically generated by the repository + hosting the schema, for example from the underlying ID of the database record. + x-airr: + identifier: true + miairr: important + sequence: + type: string + description: nucleotide sequence + x-airr: + miairr: essential + nullable: false + derivation: + type: string + enum: + - DNA + - RNA + - null + description: The class of nucleic acid that was used as primary starting material + x-airr: + miairr: important + nullable: true + observation_type: + type: string + enum: + - direct_sequencing + - inference_from_repertoire + description: > + The type of observation from which this sequence was drawn, such as direct sequencing or + inference from repertoire sequencing data. + x-airr: + miairr: essential + nullable: false + curation: + type: string + description: Curational notes on the sequence + repository_name: + type: string + description: Name of the repository in which the sequence has been deposited + x-airr: + miairr: defined + repository_ref: + type: string + description: Queryable id or accession number of the sequence published by the repository + x-airr: + miairr: defined + deposited_version: + type: string + description: Version number of the sequence within the repository + x-airr: + miairr: defined + sequence_start: + type: integer + description: Start co-ordinate of the sequence detailed in this record, within the sequence deposited + x-airr: + miairr: essential + nullable: false + sequence_end: + type: integer + description: End co-ordinate of the sequence detailed in this record, within the sequence deposited + x-airr: + miairr: essential + nullable: false + +UnrearrangedSequence: + description: Details of an unrearranged sequence contributing support for a gene or allele + type: object + required: + - sequence_id + - sequence + - repository_name + - assembly_id + - gff_seqid + - gff_start + - gff_end + - strand + properties: + sequence_id: + type: string + description: unique identifier of this UnrearrangedSequence within the file + x-airr: + identifier: true + miairr: important + sequence: + type: string + description: > + Sequence of interest described in this record. Typically, this will include gene and promoter region. + x-airr: + miairr: essential + nullable: false + curation: + type: string + description: Curational notes on the sequence + repository_name: + type: string + description: Name of the repository in which the assembly or contig is deposited + x-airr: + miairr: defined + repository_ref: + type: string + description: Queryable id or accession number of the sequence published by the repository + x-airr: + miairr: defined + patch_no: + type: string + description: Genome assembly patch number in which this gene was determined + gff_seqid: + type: string + description: > + Sequence (from the assembly) of a window including the gene and preferably also the promoter region. + gff_start: + type: integer + description: > + Genomic co-ordinates of the start of the sequence of interest described in this record in + Ensemble GFF version 3. + gff_end: + type: integer + description: > + Genomic co-ordinates of the end of the sequence of interest described in this record in + Ensemble GFF version 3. + strand: + type: string + enum: + - "+" + - "-" + - null + description: sense (+ or -) + x-airr: + nullable: true + +# V gene delineation +SequenceDelineationV: + description: Delineation of a V-gene in a particular system + type: object + required: + - sequence_delineation_id + - delineation_scheme + - fwr1_start + - fwr1_end + - cdr1_start + - cdr1_end + - fwr2_start + - fwr2_end + - cdr2_start + - cdr2_end + - fwr3_start + - fwr3_end + - cdr3_start + properties: + sequence_delineation_id: + type: string + description: > + Unique identifier of this SequenceDelineationV within the file. Typically, generated by the + repository hosting the record. + x-airr: + identifier: true + miairr: important + delineation_scheme: + type: string + description: Name of the delineation scheme + example: Chothia + x-airr: + miairr: important + unaligned_sequence: + type: string + x-airr: + miairr: important + description: entire V-sequence covered by this delineation + aligned_sequence: + type: string + description: > + Aligned sequence if this delineation provides an alignment. An aligned sequence should always be + provided for IMGT delineations. + fwr1_start: + type: integer + description: FWR1 start co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + fwr1_end: + type: integer + description: FWR1 end co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + cdr1_start: + type: integer + description: CDR1 start co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + cdr1_end: + type: integer + description: CDR1 end co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + fwr2_start: + type: integer + description: FWR2 start co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + fwr2_end: + type: integer + description: FWR2 end co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + cdr2_start: + type: integer + description: CDR2 start co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + cdr2_end: + type: integer + description: CDR2 end co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + fwr3_start: + type: integer + description: FWR3 start co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + fwr3_end: + type: integer + description: FWR3 end co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + cdr3_start: + type: integer + description: CDR3 start co-ordinate in the 'unaligned sequence' field + x-airr: + miairr: important + alignment_labels: + type: array + items: + type: string + description: > + One string for each codon in the aligned_sequence indicating the label of that codon according to + the numbering of the delineation scheme if it provides one. + +# Description of a putative or confirmed Ig receptor gene/allele +AlleleDescription: + description: Details of a putative or confirmed Ig receptor gene/allele inferred from one or more observations + type: object + required: + - allele_description_id + - maintainer + - lab_address + - release_version + - release_date + - release_description + - sequence + - coding_sequence + - locus + - sequence_type + - functional + - inference_type + - species + properties: + allele_description_id: + type: string + description: > + Unique identifier of this AlleleDescription within the file. Typically, generated by the + repository hosting the record. + x-airr: + identifier: true + miairr: important + allele_description_ref: + type: string + description: Unique reference to the allele description, in standardized form (Repo:Label:Version) + example: OGRDB:Human_IGH:IGHV1-69*01.001 + x-airr: + miairr: important + maintainer: + type: string + description: Maintainer of this sequence record + x-airr: + miairr: defined + acknowledgements: + type: array + description: List of individuals whose contribution to the gene description should be acknowledged + items: + $ref: '#/Acknowledgement' + lab_address: + type: string + description: Institution and full address of corresponding author + x-airr: + miairr: defined + release_version: + type: integer + description: Version number of this record, updated whenever a revised version is published or released + x-airr: + miairr: important + release_date: + type: string + format: date-time + description: Date of this release + title: Release Date + example: "2021-02-02" + x-airr: + miairr: important + release_description: + type: string + description: Brief descriptive notes of the reason for this release and the changes embodied + x-airr: + miairr: important + label: + type: string + description: > + The accepted name for this gene or allele following the relevant nomenclature. + The value in this field should correspond to values in acceptable name fields of other schemas, + such as v_call, d_call, and j_call fields. + example: IGHV1-69*01 + x-airr: + miairr: important + sequence: + type: string + description: > + Nucleotide sequence of the gene. This should cover the full length that is available, + including where possible RSS, and 5' UTR and lead-in for V-gene sequences. + x-airr: + miairr: essential + nullable: false + coding_sequence: + type: string + description: > + Nucleotide sequence of the core coding region, such as the coding region of a D-, J- or C- gene + or the coding region of a V-gene excluding the leader. + x-airr: + miairr: important + aliases: + type: array + items: + type: string + description: Alternative names for this sequence + locus: + type: string + enum: + - IGH + - IGI + - IGK + - IGL + - TRA + - TRB + - TRG + - TRD + description: Gene locus + x-airr: + miairr: essential + nullable: false + chromosome: + type: integer + description: chromosome on which the gene is located + sequence_type: + type: string + enum: + - V + - D + - J + - C + description: Sequence type (V, D, J, C) + x-airr: + miairr: essential + nullable: false + functional: + type: boolean + description: True if the gene is functional, false if it is a pseudogene + x-airr: + miairr: important + inference_type: + type: string + enum: + - genomic_and_rearranged + - genomic_only + - rearranged_only + - null + description: Type of inference(s) from which this gene sequence was inferred + x-airr: + miairr: important + nullable: true + species: + $ref: '#/Ontology' + description: Binomial designation of subject's species + title: Organism + example: + id: NCBITAXON:9606 + label: Homo sapiens + x-airr: + miairr: essential + nullable: false + species_subgroup: + type: string + description: Race, strain or other species subgroup to which this subject belongs + example: BALB/c + species_subgroup_type: + type: string + enum: + - breed + - strain + - inbred + - outbred + - locational + - null + x-airr: + nullable: true + status: + type: string + enum: + - active + - draft + - retired + - withdrawn + - null + description: Status of record, assumed active if the field is not present + x-airr: + nullable: true + subgroup_designation: + type: string + description: Identifier of the gene subgroup or clade, as (and if) defined + gene_designation: + type: string + description: Gene number or other identifier, as (and if) defined + allele_designation: + type: string + description: Allele number or other identifier, as (and if) defined + allele_similarity_cluster_designation: + type: string + description: ID of the similarity cluster used in this germline set, if designated + allele_similarity_cluster_member_id: + type: string + description: Membership ID of the allele within the similarity cluster, if a cluster is designated + j_codon_frame: + type: integer + enum: + - 1 + - 2 + - 3 + - null + description: > + Codon position of the first nucleotide in the 'coding_sequence' field. Mandatory for J genes. + Not used for V or D genes. '1' means the sequence is in-frame, '2' means that the first bp is + missing from the first codon, and '3' means that the first 2 bp are missing. + x-airr: + nullable: true + gene_start: + type: integer + description: > + Co-ordinate in the sequence field of the first nucleotide in the coding_sequence field. + x-airr: + miairr: important + gene_end: + type: integer + description: > + Co-ordinate in the sequence field of the last gene-coding nucleotide in the coding_sequence field. + x-airr: + miairr: important + utr_5_prime_start: + type: integer + description: Start co-ordinate in the sequence field of the 5 prime UTR (V-genes only). + utr_5_prime_end: + type: integer + description: End co-ordinate in the sequence field of the 5 prime UTR (V-genes only). + leader_1_start: + type: integer + description: Start co-ordinate in the sequence field of L-PART1 (V-genes only). + leader_1_end: + type: integer + description: End co-ordinate in the sequence field of L-PART1 (V-genes only). + leader_2_start: + type: integer + description: Start co-ordinate in the sequence field of L-PART2 (V-genes only). + leader_2_end: + type: integer + description: End co-ordinate in the sequence field of L-PART2 (V-genes only). + v_rs_start: + type: integer + description: Start co-ordinate in the sequence field of the V recombination site (V-genes only). + v_rs_end: + type: integer + description: End co-ordinate in the sequence field of the V recombination site (V-genes only). + d_rs_3_prime_start: + type: integer + description: Start co-ordinate in the sequence field of the 3 prime D recombination site (D-genes only). + d_rs_3_prime_end: + type: integer + description: End co-ordinate in the sequence field of the 3 prime D recombination site (D-genes only). + d_rs_5_prime_start: + type: integer + description: Start co-ordinate in the sequence field of the 5 prime D recombination site (D-genes only). + d_rs_5_prime_end: + type: integer + description: End co-ordinate in the sequence field of 5 the prime D recombination site (D-genes only). + j_cdr3_end: + type: integer + description: > + In the case of a J-gene, the co-ordinate in the sequence field of the first nucelotide of the + conserved PHE or TRP (IMGT codon position 118). + j_rs_start: + type: integer + description: Start co-ordinate in the sequence field of J recombination site (J-genes only). + j_rs_end: + type: integer + description: End co-ordinate in the sequence field of J recombination site (J-genes only). + j_donor_splice: + type: integer + description: Co-ordinate in the sequence field of the final 3' nucleotide of the J-REGION (J-genes only). + v_gene_delineations: + type: array + items: + $ref: '#/SequenceDelineationV' + unrearranged_support: + type: array + items: + $ref: '#/UnrearrangedSequence' + rearranged_support: + type: array + items: + $ref: '#/RearrangedSequence' + paralogs: + type: array + items: + type: string + description: Gene symbols of any paralogs + curation: + type: string + description: > + Curational notes on the AlleleDescription. This can be used to give more extensive notes on the + decisions taken than are provided in the release_description. + curational_tags: + type: array + items: + type: string + enum: + - likely_truncated + - likely_full_length + description: Controlled-vocabulary tags applied to this description + x-airr: + nullable: true + +# Collection of gene descriptions into a germline set +GermlineSet: + type: object + description: > + A germline object set bringing together multiple AlleleDescriptions from the same strain or species. + All genes in a GermlineSet should be from a single locus. + required: + - germline_set_id + - author + - lab_name + - lab_address + - release_version + - release_description + - release_date + - germline_set_name + - germline_set_ref + - species + - locus + - allele_descriptions + properties: + germline_set_id: + type: string + description: > + Unique identifier of the GermlineSet within this file. Typically, generated by the + repository hosting the record. + x-airr: + identifier: true + miairr: important + author: + type: string + description: Corresponding author + x-airr: + miairr: important + lab_name: + type: string + description: Department of corresponding author + x-airr: + miairr: important + lab_address: + type: string + description: Institutional address of corresponding author + x-airr: + miairr: important + acknowledgements: + type: array + description: List of individuals whose contribution to the germline set should be acknowledged + items: + $ref: '#/Acknowledgement' + release_version: + type: number + description: Version number of this record, allocated automatically + x-airr: + miairr: important + release_description: + type: string + description: Brief descriptive notes of the reason for this release and the changes embodied + x-airr: + miairr: important + release_date: + type: string + format: date-time + description: Date of this release + title: Release Date + example: "2021-02-02" + x-airr: + miairr: important + germline_set_name: + type: string + description: descriptive name of this germline set + x-airr: + miairr: important + germline_set_ref: + type: string + description: Unique identifier of the germline set and version, in standardized form (Repo:Label:Version) + example: OGRDB:Human_IGH:2021.11 + x-airr: + miairr: important + pub_ids: + type: string + description: Publications describing the germline set + example: "PMID:85642,PMID:12345" + species: + $ref: '#/Ontology' + description: Binomial designation of subject's species + title: Organism + example: + id: NCBITAXON:9606 + label: Homo sapiens + x-airr: + miairr: essential + nullable: false + species_subgroup: + type: string + description: Race, strain or other species subgroup to which this subject belongs + example: BALB/c + species_subgroup_type: + type: string + enum: + - breed + - strain + - inbred + - outbred + - locational + - null + x-airr: + nullable: true + locus: + type: string + enum: + - IGH + - IGI + - IGK + - IGL + - TRA + - TRB + - TRG + - TRD + description: Gene locus + x-airr: + miairr: essential + nullable: false + allele_descriptions: + type: array + items: + $ref: '#/AlleleDescription' + description: list of allele_descriptions in the germline set + x-airr: + miairr: important + curation: + type: string + description: > + Curational notes on the GermlineSet. This can be used to give more extensive notes on the + decisions taken than are provided in the release_description. + +# +# Genotype schema +# + +# GenotypeSet lists the Genotypes (describing different loci) inferred for this subject + +GenotypeSet: + type: object + required: + - receptor_genotype_set_id + properties: + receptor_genotype_set_id: + type: string + description: > + A unique identifier for this Receptor Genotype Set, typically generated by the repository + hosting the schema, for example from the underlying ID of the database record. + x-airr: + identifier: true + miairr: important + genotype_class_list: + description: List of Genotypes included in this Receptor Genotype Set. + type: array + items: + $ref: '#/Genotype' + +# This enumerates the alleles and gene deletions inferred in a single subject. Included alleles may either be listed by reference to a GermlineSet, or +# listed as 'undocumented', in which case the inferred sequence is provided + +# Genotype of adaptive immune receptors +Genotype: + type: object + required: + - receptor_genotype_id + - locus + properties: + receptor_genotype_id: + type: string + description: > + A unique identifier within the file for this Receptor Genotype, typically generated by the + repository hosting the schema, for example from the underlying ID of the database record. + x-airr: + identifier: true + miairr: important + locus: + type: string + enum: + - IGH + - IGI + - IGK + - IGL + - TRA + - TRB + - TRD + - TRG + description: Gene locus + example: IGH + x-airr: + miairr: essential + nullable: false + adc-query-support: true + format: controlled_vocabulary + documented_alleles: + type: array + description: List of alleles documented in reference set(s) + items: + $ref: '#/DocumentedAllele' + x-airr: + miairr: important + undocumented_alleles: + type: array + description: List of alleles inferred to be present and not documented in an identified GermlineSet + items: + $ref: '#/UndocumentedAllele' + x-airr: + adc-query-support: true + deleted_genes: + type: array + description: Array of genes identified as being deleted in this genotype + items: + $ref: '#/DeletedGene' + x-airr: + adc-query-support: true + inference_process: + type: string + enum: + - genomic_sequencing + - repertoire_sequencing + - null + description: Information on how the genotype was acquired. Controlled vocabulary. + title: Genotype acquisition process + example: repertoire_sequencing + x-airr: + adc-query-support: true + format: controlled_vocabulary + nullable: true + +# Documented Allele +# This describes a 'known' allele found in a genotype +# It 'known' in the sense that it is documented in a reference set + +DocumentedAllele: + type: object + required: + - label + - germline_set_ref + properties: + label: + type: string + x-airr: + miairr: important + description: The accepted name for this allele, taken from the GermlineSet + germline_set_ref: + type: string + x-airr: + miairr: important + description: GermlineSet from which it was taken, referenced in standardized form (Repo:Label:Version) + example: OGRDB:Human_IGH:2021.11 + phasing: + type: integer + nullable: true + description: > + Chromosomal phasing indicator. Alleles with the same value are inferred to be located on the + same chromosome. + +# Undocumented Allele +# This describes a 'undocumented' allele found in a genotype +# It is 'undocumented' in the sense that it was not found in reference sets consulted for the analysis + +UndocumentedAllele: + required: + - allele_name + - sequence + type: object + properties: + allele_name: + type: string + x-airr: + miairr: important + description: Allele name as allocated by the inference pipeline + sequence: + type: string + x-airr: + miairr: essential + nullable: false + description: nt sequence of the allele, as provided by the inference pipeline + phasing: + type: integer + nullable: true + description: > + Chromosomal phasing indicator. Alleles with the same value are inferred to be located on the + same chromosome. + +# Deleted Gene +# It is regarded as 'deleted' in the sense that it was not identified during inference of the genotype + +DeletedGene: + required: + - label + - germline_set_ref + type: object + properties: + label: + type: string + x-airr: + miairr: essential + nullable: false + description: The accepted name for this gene, taken from the GermlineSet + germline_set_ref: + type: string + x-airr: + miairr: important + description: GermlineSet from which it was taken (issuer/name/version) + phasing: + type: integer + nullable: true + description: > + Chromosomal phasing indicator. Alleles with the same value are inferred to be located on the + same chromosome. + + +# List of MHCGenotypes describing a subject's genotype +MHCGenotypeSet: + type: object + required: + - mhc_genotype_set_id + - mhc_genotype_list + properties: + mhc_genotype_set_id: + type: string + description: A unique identifier for this MHCGenotypeSet + x-airr: + identifier: true + miairr: important + mhc_genotype_list: + description: List of MHCGenotypes included in this set + type: array + items: + $ref: '#/MHCGenotype' + x-airr: + miairr: important + +# Genotype of major histocompatibility complex (MHC) class I, class II and non-classical loci +MHCGenotype: + type: object + required: + - mhc_genotype_id + - mhc_class + - mhc_alleles + properties: + mhc_genotype_id: + type: string + description: A unique identifier for this MHCGenotype, assumed to be unique in the context of the study + x-airr: + identifier: true + miairr: important + mhc_class: + type: string + enum: + - MHC-I + - MHC-II + - MHC-nonclassical + description: Class of MHC alleles described by the MHCGenotype + example: MHC-I + x-airr: + miairr: essential + nullable: false + adc-query-support: true + format: controlled_vocabulary + mhc_alleles: + type: array + description: List of MHC alleles of the indicated mhc_class identified in an individual + items: + $ref: '#/MHCAllele' + x-airr: + miairr: important + adc-query-support: true + mhc_genotyping_method: + type: string + description: > + Information on how the genotype was determined. The content of this field should come from a list of + recommended terms provided in the AIRR Schema documentation. + title: MHC genotyping method + example: pcr_low_resolution + x-airr: + miairr: important + adc-query-support: true + + +# Allele of an MHC gene +MHCAllele: + type: object + properties: + allele_designation: + type: string + description: > + The accepted designation of an allele, usually its gene symbol plus allele/sub-allele/etc + identifiers, if provided by the mhc_typing method + x-airr: + miairr: important + gene: + $ref: '#/Ontology' + description: The MHC gene to which the described allele belongs + title: MHC gene + example: + id: MRO:0000046 + label: HLA-A + x-airr: + miairr: important + adc-query-support: false + format: ontology + ontology: + draft: true + top_node: + id: MRO:0000004 + label: MHC gene + reference_set_ref: + type: string + description: Repository and list from which it was taken (issuer/name/version) + x-airr: + miairr: important + +# +# Repertoire metadata schema +# # The overall study with a globally unique study_id Study: - discriminator: AIRR type: object required: - study_id @@ -159,10 +1491,13 @@ properties: study_id: type: string - description: Unique ID assigned by study registry + description: > + Unique ID assigned by study registry such as one of the International Nucleotide Sequence Database + Collaboration (INSDC) repositories. title: Study ID example: PRJNA001 x-airr: + identifier: true miairr: important nullable: true adc-query-support: true @@ -234,13 +1569,25 @@ set: 1 subset: study name: Grant funding agency + study_contact: + type: string + description: > + Full contact information of the contact persons for this study This should include an e-mail address + and a persistent identifier such as an ORCID ID. + title: Contact information (study) + example: Dr. P. Stibbons, p.stibbons@unseenu.edu, https://orcid.org/0000-0002-1825-0097 + x-airr: + nullable: true + adc-query-support: true + name: Contact information (study) collected_by: type: string description: > - Full contact information of the data collector, i.e. the person who is legally responsible for - data collection and release. This should include an e-mail address. + Full contact information of the data collector, i.e. the person who is legally responsible for data + collection and release. This should include an e-mail address and a persistent identifier such as an + ORCID ID. title: Contact information (data collection) - example: Dr. P. Stibbons, p.stibbons@unseenu.edu + example: Dr. P. Stibbons, p.stibbons@unseenu.edu, https://orcid.org/0000-0002-1825-0097 x-airr: miairr: important nullable: true @@ -275,10 +1622,11 @@ submitted_by: type: string description: > - Full contact information of the data depositor, i.e. the person submitting the data to a repository. - This is supposed to be a short-lived and technical role until the submission is relased. + Full contact information of the data depositor, i.e., the person submitting the data to a repository. + This should include an e-mail address and a persistent identifier such as an ORCID ID. This is + supposed to be a short-lived and technical role until the submission is relased. title: Contact information (data deposition) - example: Adrian Turnipseed, a.turnipseed@unseenu.edu + example: Adrian Turnipseed, a.turnipseed@unseenu.edu, https://orcid.org/0000-0002-1825-0097 x-airr: miairr: important nullable: true @@ -288,7 +1636,9 @@ name: Contact information (data deposition) pub_ids: type: string - description: Publications describing the rationale and/or outcome of the study + description: > + Publications describing the rationale and/or outcome of the study. Where ever possible, a persistent + identifier should be used such as a DOI or a Pubmed ID title: Relevant publications example: "PMID:85642" x-airr: @@ -304,14 +1654,23 @@ type: string enum: - contains_ig - - contains_tcr - - contains_single_cell + - contains_tr - contains_paired_chain - description: Keywords describing properties of one or more data sets in a study + - contains_schema_rearrangement + - contains_schema_clone + - contains_schema_cell + - contains_schema_receptor + description: > + Keywords describing properties of one or more data sets in a study. "contains_schema" keywords indicate that + the study contains data objects from the AIRR Schema of that type (Rearrangement, Clone, Cell, Receptor) while + the other keywords indicate that the study design considers the type of data indicated (e.g. it is possible to have + a study that "contains_paired_chain" but does not "contains_schema_cell"). title: Keywords for study example: - - contains_ig - - contains_paired_chain + - contains_ig + - contains_schema_rearrangement + - contains_schema_clone + - contains_schema_cell x-airr: miairr: important nullable: true @@ -319,12 +1678,43 @@ set: 1 subset: study name: Keywords for study - format: controlled vocabulary + format: controlled_vocabulary + adc_publish_date: + type: string + format: date-time + description: > + Date the study was first published in the AIRR Data Commons. + title: ADC Publish Date + example: "2021-02-02" + x-airr: + nullable: true + adc-query-support: true + name: ADC Publish Date + adc_update_date: + type: string + format: date-time + description: > + Date the study data was updated in the AIRR Data Commons. + title: ADC Update Date + example: "2021-02-02" + x-airr: + nullable: true + adc-query-support: true + name: ADC Update Date + +SubjectGenotype: + type: object + properties: + receptor_genotype_set: + $ref: '#/GenotypeSet' + description: Immune receptor genotype set for this subject. + mhc_genotype_set: + $ref: '#/MHCGenotypeSet' + description: MHC genotype set for this subject. # 1-to-n relationship between a study and its subjects # subject_id is unique within a study Subject: - discriminator: AIRR type: object required: - subject_id @@ -344,10 +1734,13 @@ properties: subject_id: type: string - description: Subject ID assigned by submitter, unique within study + description: > + Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to + an INSDC or similar repository study should be used. title: Subject ID example: SUB856413 x-airr: + identifier: true miairr: important nullable: true adc-query-support: true @@ -401,8 +1794,7 @@ - pooled - hermaphrodite - intersex - - "not collected" - - "not applicable" + - null description: Biological sex of subject title: Sex example: female @@ -413,7 +1805,7 @@ set: 1 subset: subject name: Sex - format: controlled vocabulary + format: controlled_vocabulary age_min: type: number description: Specific age or lower boundary of age range. @@ -564,10 +1956,12 @@ x-airr: nullable: false adc-query-support: true + genotype: + $ref: '#/SubjectGenotype' + title: SubjectGenotype # 1-to-n relationship between a subject and its diagnoses Diagnosis: - discriminator: AIRR type: object required: - study_group_description @@ -623,7 +2017,7 @@ set: 1 subset: diagnosis and intervention name: Length of disease - format: physical quantity + format: physical_quantity disease_stage: type: string description: Stage of disease at current intervention @@ -688,7 +2082,6 @@ # 1-to-n relationship between a subject and its samples # sample_id is unique within a study Sample: - discriminator: AIRR type: object required: - sample_id @@ -697,15 +2090,19 @@ - anatomic_site - disease_state_sample - collection_time_point_relative + - collection_time_point_relative_unit - collection_time_point_reference - biomaterial_provider properties: sample_id: type: string - description: Sample ID assigned by submitter, unique within study + description: > + Sample ID assigned by submitter, unique within study. If possible, a persistent sample ID linked to + INSDC or similar repository study should be used. title: Biological sample ID example: SUP52415 x-airr: + identifier: true miairr: important nullable: true adc-query-support: true @@ -769,10 +2166,10 @@ subset: sample name: Disease state of sample collection_time_point_relative: - type: string + type: number description: Time point at which sample was taken, relative to `Collection time event` title: Sample collection time - example: "14 d" + example: 14 x-airr: miairr: important nullable: true @@ -780,7 +2177,26 @@ set: 2 subset: sample name: Sample collection time - format: physical quantity + collection_time_point_relative_unit: + $ref: '#/Ontology' + description: Unit of Sample collection time + title: Sample collection time unit + example: + id: UO:0000033 + label: day + x-airr: + miairr: important + nullable: true + adc-query-support: true + set: 2 + subset: sample + name: Sample collection time unit + format: ontology + ontology: + draft: false + top_node: + id: UO:0000003 + label: time unit collection_time_point_reference: type: string description: Event in the study schedule to which `Sample collection time` relates to @@ -808,7 +2224,6 @@ # 1-to-n relationship between a sample and processing of its cells CellProcessing: - discriminator: AIRR type: object required: - tissue_processing @@ -870,8 +2285,8 @@ $ref: '#/Ontology' description: > Binomial designation of the species from which the analyzed cells originate. Typically, this value - should be identical to `species`, if which case it SHOULD NOT be set explicitly. Howver, there are - valid experimental setups in which the two might differ, e.g. chimeric animal models. If set, this + should be identical to `species`, in which case it SHOULD NOT be set explicitly. However, there are + valid experimental setups in which the two might differ, e.g., chimeric animal models. If set, this key will overwrite the `species` information for all lower layers of the schema. title: Cell species example: @@ -979,7 +2394,6 @@ # object for PCR primer targets PCRTarget: - discriminator: AIRR type: object required: - pcr_target_locus @@ -997,6 +2411,7 @@ - TRB - TRD - TRG + - null description: > Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to @@ -1010,7 +2425,7 @@ set: 3 subset: process (nucleic acid [pcr]) name: Target locus for PCR - format: controlled vocabulary + format: controlled_vocabulary forward_pcr_primer_target_location: type: string description: Position of the most distal nucleotide templated by the forward primer or primer mix @@ -1039,12 +2454,12 @@ # generally, a 1-to-1 relationship between a CellProcessing and processing of its nucleic acid # but may be 1-to-n for technical replicates. NucleicAcidProcessing: - discriminator: AIRR type: object required: - template_class - template_quality - template_amount + - template_amount_unit - library_generation_method - library_generation_protocol - library_generation_kit_version @@ -1067,7 +2482,7 @@ set: 3 subset: process (nucleic acid) name: Target substrate - format: controlled vocabulary + format: controlled_vocabulary template_quality: type: string description: Description and results of the quality control performed on the template material @@ -1081,10 +2496,10 @@ subset: process (nucleic acid) name: Target substrate quality template_amount: - type: string + type: number description: Amount of template that went into the process title: Template amount - example: 1000 ng + example: 1000 x-airr: miairr: important nullable: true @@ -1092,7 +2507,26 @@ set: 3 subset: process (nucleic acid) name: Template amount - format: physical quantity + template_amount_unit: + $ref: '#/Ontology' + description: Unit of template amount + title: Template amount time unit + example: + id: UO:0000024 + label: nanogram + x-airr: + miairr: important + nullable: true + adc-query-support: true + set: 3 + subset: process (nucleic acid) + name: Template amount time unit + format: ontology + ontology: + draft: false + top_node: + id: UO:0000002 + label: physical quantity library_generation_method: type: string enum: @@ -1118,7 +2552,7 @@ set: 3 subset: process (nucleic acid) name: Library generation method - format: controlled vocabulary + format: controlled_vocabulary library_generation_protocol: type: string description: Description of processes applied to substrate to obtain a library that is ready for sequencing @@ -1181,14 +2615,14 @@ set: 3 subset: process (nucleic acid) name: Complete sequences - format: controlled vocabulary + format: controlled_vocabulary physical_linkage: type: string enum: - none - - "hetero_head-head" - - "hetero_tail-head" - - "hetero_prelinked" + - hetero_head-head + - hetero_tail-head + - hetero_prelinked description: > In case an experimental setup is used that physically links nucleic acids derived from distinct `Rearrangements` before library preparation, this field describes the mode of that linkage. All @@ -1207,11 +2641,10 @@ set: 3 subset: process (nucleic acid) name: Physical linkage of different rearrangements - format: controlled vocabulary + format: controlled_vocabulary # 1-to-n relationship between a NucleicAcidProcessing and SequencingRun with resultant raw sequence file(s) SequencingRun: - discriminator: AIRR type: object required: - sequencing_run_id @@ -1227,6 +2660,7 @@ title: Batch number example: 160101_M01234 x-airr: + identifier: true miairr: important nullable: true adc-query-support: true @@ -1295,17 +2729,17 @@ subset: process (sequencing) name: Sequencing kit sequencing_files: - $ref: '#/RawSequenceData' + $ref: '#/SequencingData' description: Set of sequencing files produced by the sequencing run x-airr: nullable: false adc-query-support: true # Resultant raw sequencing files from a SequencingRun -RawSequenceData: - discriminator: AIRR +SequencingData: type: object required: + - sequencing_data_id - file_type - filename - read_direction @@ -1314,6 +2748,21 @@ - paired_read_direction - paired_read_length properties: + sequencing_data_id: + type: string + description: > + Persistent identifier of raw data stored in an archive (e.g. INSDC run ID). Data archive should + be identified in the CURIE prefix. + title: Raw sequencing data persistent identifier + example: "SRA:SRR11610494" + x-airr: + identifier: true + miairr: important + nullable: true + adc-query-support: true + set: 4 + subset: data (raw reads) + format: CURIE file_type: type: string description: File format for the raw reads or sequences @@ -1321,6 +2770,7 @@ enum: - fasta - fastq + - null x-airr: miairr: important nullable: true @@ -1328,7 +2778,7 @@ set: 4 subset: data (raw reads) name: Raw sequencing data file type - format: controlled vocabulary + format: controlled_vocabulary filename: type: string description: File name for the raw reads or sequences. The first file in paired-read sequencing. @@ -1350,6 +2800,7 @@ - forward - reverse - mixed + - null x-airr: miairr: important nullable: true @@ -1357,7 +2808,7 @@ set: 4 subset: data (raw reads) name: Read direction - format: controlled vocabulary + format: controlled_vocabulary read_length: type: integer description: Read length in bases for the first file in paired-read sequencing @@ -1368,7 +2819,7 @@ nullable: true adc-query-support: true set: 4 - subset: process (sequencing) + subset: data (raw reads) name: Forward read length paired_filename: type: string @@ -1391,6 +2842,7 @@ - forward - reverse - mixed + - null x-airr: miairr: important nullable: true @@ -1398,7 +2850,7 @@ set: 4 subset: data (raw reads) name: Paired read direction - format: controlled vocabulary + format: controlled_vocabulary paired_read_length: type: integer description: Read length in bases for the second file in paired-read sequencing @@ -1409,15 +2861,30 @@ nullable: true adc-query-support: true set: 4 - subset: process (sequencing) + subset: data (raw reads) name: Paired read length + index_filename: + type: string + description: File name for the index file + title: Sequencing index file name + example: MS10R-NMonson-C7JR9_S1_R3_001.fastq + x-airr: + nullable: true + adc-query-support: true + index_length: + type: integer + description: Read length in bases for the index file + title: Index read length + example: 8 + x-airr: + nullable: true + adc-query-support: true # 1-to-n relationship between a repertoire and data processing # # Set of annotated rearrangement sequences produced by # data processing upon the raw sequence data for a repertoire. DataProcessing: - discriminator: AIRR type: object required: - software_versions @@ -1433,10 +2900,10 @@ description: Identifier for the data processing object. title: Data processing ID x-airr: + identifier: true nullable: true name: Data processing ID adc-query-support: true - identifier: true primary_annotation: type: boolean default: false @@ -1475,7 +2942,7 @@ name: Paired read assembly quality_thresholds: type: string - description: How sequences were removed from (4) based on base quality scores + description: How/if sequences were removed from (4) based on base quality scores title: Quality thresholds example: Average Phred score >=20 x-airr: @@ -1547,6 +3014,13 @@ set: 5 subset: data (processed sequence) name: V(D)J germline reference database + germline_set_ref: + type: string + description: Unique identifier of the germline set and version, in standardized form (Repo:Label:Version) + example: OGRDB:Human_IGH:2021.11 + x-airr: + nullable: true + adc-query-support: true analysis_provenance_id: type: string description: Identifier for machine-readable PROV model of analysis provenance @@ -1556,21 +3030,25 @@ adc-query-support: true SampleProcessing: - discriminator: AIRR - type: object - properties: - sample_processing_id: - type: string - description: > - Identifier for the sample processing object. This field should be unique within the repertoire. - This field can be used to uniquely identify the combination of sample, cell processing, - nucleic acid processing and sequencing run information for the repertoire. - title: Sample processing ID - x-airr: - nullable: true - name: Sample processing ID - adc-query-support: true - identifier: true + allOf: + - type: object + properties: + sample_processing_id: + type: string + description: > + Identifier for the sample processing object. This field should be unique within the repertoire. + This field can be used to uniquely identify the combination of sample, cell processing, + nucleic acid processing and sequencing run information for the repertoire. + title: Sample processing ID + x-airr: + identifier: true + nullable: true + name: Sample processing ID + adc-query-support: true + - $ref: '#/Sample' + - $ref: '#/CellProcessing' + - $ref: '#/NucleicAcidProcessing' + - $ref: '#/SequencingRun' # The composite schema for the repertoire object @@ -1579,7 +3057,6 @@ # and experimentally observed by raw sequence data. A repertoire # can only be for one subject but may include multiple samples. Repertoire: - discriminator: AIRR type: object required: - study @@ -1629,14 +3106,9 @@ adc-query-support: true sample: type: array - description: List of Sample objects + description: List of Sample Processing objects items: - allOf: - - $ref: '#/SampleProcessing' - - $ref: '#/Sample' - - $ref: '#/CellProcessing' - - $ref: '#/NucleicAcidProcessing' - - $ref: '#/SequencingRun' + $ref: '#/SampleProcessing' x-airr: nullable: false adc-query-support: true @@ -1649,9 +3121,52 @@ nullable: false adc-query-support: true +# A collection of repertoires for analysis purposes, includes optional time course +RepertoireGroup: + type: object + required: + - repertoire_group_id + - repertoires + properties: + repertoire_group_id: + type: string + description: Identifier for this repertoire collection + x-airr: + identifier: true + repertoire_group_name: + type: string + description: Short display name for this repertoire collection + repertoire_group_description: + type: string + description: Repertoire collection description + repertoires: + type: array + description: > + List of repertoires in this collection with an associated description and time point designation + items: + type: object + properties: + repertoire_id: + type: string + description: Identifier to the repertoire + x-airr: + nullable: false + adc-query-support: true + repertoire_description: + type: string + description: Description of this repertoire within the group + x-airr: + nullable: true + adc-query-support: true + time_point: + $ref: '#/TimePoint' + description: Time point designation for this repertoire within the group + x-airr: + nullable: true + adc-query-support: true + Alignment: - discriminator: AIRR type: object required: - sequence_id @@ -1666,6 +3181,8 @@ Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. + x-airr: + identifier: true segment: type: string description: > @@ -1743,7 +3260,6 @@ # The extended rearrangement object Rearrangement: - discriminator: AIRR type: object required: - sequence_id @@ -1764,7 +3280,7 @@ sequence_id: type: string description: > - Unique query sequence identifier for the Rearrangment. Most often this will be the input sequence + Unique query sequence identifier for the Rearrangement. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. When downloaded from an AIRR Data Commons repository, this will usually be a universally unique @@ -1778,6 +3294,11 @@ The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. + quality: + type: string + description: > + The Sanger/Phred quality scores for assessment of sequence quality. + Phred quality scores from 0 to 93 are encoded using ASCII 33 to 126 (Used by Illumina from v1.8.) sequence_aa: type: string description: > @@ -1820,6 +3341,7 @@ - TRB - TRD - TRG + - null description: > Gene locus (chain type). Note that this field uses a controlled vocabulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific @@ -1830,7 +3352,7 @@ nullable: true adc-query-support: true name: Gene locus - format: controlled vocabulary + format: controlled_vocabulary v_call: type: string description: > @@ -1899,6 +3421,11 @@ Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement. + quality_alignment: + type: string + description: > + Sanger/Phred quality scores for assessment of sequence_alignment quality. + Phred quality scores from 0 to 93 are encoded using ASCII 33 to 126 (Used by Illumina from v1.8.) sequence_alignment_aa: type: string description: > @@ -2200,6 +3727,32 @@ description: > End position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). + c_sequence_start: + type: integer + description: > + Start position of the C gene in the query sequence (1-based closed interval). + c_sequence_end: + type: integer + description: > + End position of the C gene in the query sequence (1-based closed interval). + c_germline_start: + type: integer + description: > + Alignment start position in the C gene reference sequence (1-based closed interval). + c_germline_end: + type: integer + description: > + Alignment end position in the C gene reference sequence (1-based closed interval). + c_alignment_start: + type: integer + description: > + Start position of the C gene alignment in both the sequence_alignment and germline_alignment + fields (1-based closed interval). + c_alignment_end: + type: integer + description: > + End position of the C gene alignment in both the sequence_alignment and germline_alignment + fields (1-based closed interval). cdr1_start: type: integer description: CDR1 start position in the query sequence (1-based closed interval). @@ -2386,18 +3939,37 @@ p5j_length: type: integer description: Number of palindromic nucleotides 5' of the J gene alignment. + v_frameshift: + type: boolean + description: > + True if the V gene in the query nucleotide sequence contains a translational + frameshift relative to the frame of the V gene reference sequence. + j_frameshift: + type: boolean + description: > + True if the J gene in the query nucleotide sequence contains a translational + frameshift relative to the frame of the J gene reference sequence. + d_frame: + type: integer + description: > + Numerical reading frame (1, 2, 3) of the first or only D gene in the query nucleotide sequence, + where frame 1 is relative to the first codon of D gene reference sequence. + d2_frame: + type: integer + description: > + Numerical reading frame (1, 2, 3) of the second D gene in the query nucleotide sequence, + where frame 1 is relative to the first codon of D gene reference sequence. consensus_count: type: integer description: > - Number of reads contributing to the (UMI) consensus for this sequence. + Number of reads contributing to the UMI consensus or contig assembly for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence. duplicate_count: type: integer description: > Copy number or number of duplicate observations for the query sequence. - For example, the number of UMIs sharing an identical sequence or the number - of identical observations of this sequence absent UMIs. + For example, the number of identical reads observed for this sequence. title: Read count example: 123 x-airr: @@ -2406,6 +3978,12 @@ set: 6 subset: data (processed sequence) name: Read count + umi_count: + type: integer + description: > + Number of distinct UMIs represented by this sequence. + For example, the total number of UMIs that contribute to + the contig assembly for the query sequence. cell_id: type: string description: > @@ -2489,7 +4067,6 @@ # A unique inferred clone object that has been constructed within a single data processing # for a single repertoire and a subset of its sequences and/or rearrangements. Clone: - discriminator: AIRR type: object required: - clone_id @@ -2498,6 +4075,8 @@ clone_id: type: string description: Identifier for the clone. + x-airr: + identifier: true repertoire_id: type: string description: Identifier to the associated repertoire in study metadata. @@ -2592,16 +4171,24 @@ junction_end: type: integer description: Junction region end position in the alignment (1-based closed interval). - sequence_count: + umi_count: + type: integer + description: > + Number of distinct UMIs observed across all sequences (Rearrangement records) in this clone. + clone_count: type: integer - description: Number of Rearrangement records (sequences) included in this clone + description: > + Absolute count of the size (number of members) of this clone in the repertoire. + This could simply be the number of sequences (Rearrangement records) observed in this clone, + the number of distinct cell barcodes (unique cell_id values), + or a more sophisticated calculation appropriate to the experimental protocol. + Absolute count is provided versus a frequency so that downstream analysis tools can perform their own normalization. seed_id: type: string description: sequence_id of the seed sequence. Empty string (or null) if there is no seed sequence. # 1-to-n relationship for a clone to its trees. Tree: - discriminator: AIRR type: object required: - tree_id @@ -2611,6 +4198,8 @@ tree_id: type: string description: Identifier for the tree. + x-airr: + identifier: true clone_id: type: string description: Identifier for the clone. @@ -2625,7 +4214,6 @@ # 1-to-n relationship between a tree and its nodes Node: - discriminator: AIRR type: object required: - sequence_id @@ -2635,6 +4223,8 @@ description: > Identifier for this node that matches the identifier in the newick string and, where possible, the sequence_id in the source repertoire. + x-airr: + identifier: true sequence_alignment: type: string description: > @@ -2653,10 +4243,9 @@ # The cell object acts as point of reference for all data that can be related # to an individual cell, either by direct observation or inference. Cell: - discriminator: AIRR type: object required: - - cell_id #redefined cell_id > how to centralize it in the yaml + - cell_id - rearrangements - repertoire_id - virtual_pairing @@ -2668,6 +4257,7 @@ title: Cell index example: W06_046_091 x-airr: + identifier: true miairr: defined nullable: false adc-query-support: true @@ -2675,7 +4265,7 @@ rearrangements: type: array description: > - Array of sequence identifiers defined for the Rearrangement object + Array of sequence identifiers defined for the Rearrangement object title: Cell-associated rearrangements items: type: string @@ -2688,7 +4278,7 @@ receptors: type: array description: > - Array of receptor identifiers defined for the Receptor object + Array of receptor identifiers defined for the Receptor object title: Cell-associated receptors items: type: string @@ -2719,47 +4309,31 @@ expression_study_method: type: string enum: - flow cytometry - single-cell transcriptome + - flow_cytometry + - single-cell_transcriptome + - null description: > - keyword describing the methodology used to assess expression. This values for this field MUST come from a controlled vocabulary + Keyword describing the methodology used to assess expression. This values for this field MUST + come from a controlled vocabulary. x-airr: miairr: defined nullable: true - adc-api-optional: true + adc-query-support: true expression_raw_doi: type: string description: > - DOI of raw data set containing the current event + DOI of raw data set containing the current event x-airr: miairr: defined nullable: true - adc-api-optional: true + adc-query-support: true expression_index: type: string description: > - Index addressing the current event within the raw data set. + Index addressing the current event within the raw data set. x-airr: miairr: defined nullable: true - adc-api-optional: true - expression_tabular: - type: array - description: > - Expression definitions for single-cell - items: - type: object - properties: - expression_marker: - type: string - description: > - standardized designation of the transcript or epitope - example: CD27 - expression_value: - type: integer - description: > - transformed and normalized expression level. - example: 14567 virtual_pairing: type: boolean description: > @@ -2767,6 +4341,372 @@ title: Virtual pairing x-airr: miairr: defined - nullable: true # assuming only done for sc experiments, otherwise does not exist + nullable: true adc-query-support: true name: Virtual pairing + +# The CellExpression object acts as a container to hold a single expression level measurement from +# an experiment. Expression data is associated with a cell_id and the related repertoire_id and +# data_processing_id as cell_id is not guaranteed to be unique outside the data processing for +# a single repertoire. +CellExpression: + type: object + required: + - expression_id + - repertoire_id + - data_processing_id + - cell_id + - property + - property_type + - value + properties: + expression_id: + type: string + description: > + Identifier of this expression property measurement. + title: Expression property measurement identifier + x-airr: + identifier: true + miairr: defined + nullable: false + adc-query-support: true + name: Expression measurement identifier + cell_id: + type: string + description: > + Identifier of the cell to which this expression data is related. + title: Cell identifier + example: W06_046_091 + x-airr: + miairr: defined + nullable: false + adc-query-support: true + name: Cell identifier + repertoire_id: + type: string + description: Identifier for the associated repertoire in study metadata. + title: Parental repertoire of cell + x-airr: + miairr: defined + nullable: true + adc-query-support: true + name: Parental repertoire of cell + data_processing_id: + type: string + description: Identifier of the data processing object in the repertoire metadata for this clone. + title: Data processing for cell + x-airr: + miairr: defined + nullable: true + adc-query-support: true + name: Data processing for cell + property_type: + type: string + description: > + Keyword describing the property type and detection method used to measure the property value. + The following keywords are recommended, but custom property types are also valid: + "mrna_expression_by_read_count", + "protein_expression_by_fluorescence_intensity", "antigen_bait_binding_by_fluorescence_intensity", + "protein_expression_by_dna_barcode_count" and "antigen_bait_binding_by_dna_barcode_count". + title: Property type and detection method + x-airr: + miairr: defined + nullable: false + adc-query-support: true + name: Property type and detection method + property: + $ref: '#/Ontology' + title: Property information + description: > + Name of the property observed, typically a gene or antibody identifier (and label) from a + canonical resource such as Ensembl (e.g. ENSG00000275747, IGHV3-79) or + Antibody Registry (ABREG:1236456, Purified anti-mouse/rat/human CD27 antibody). + example: + id: ENSG:ENSG00000275747 + label: IGHV3-79 + x-airr: + miairr: defined + adc-query-support: true + format: ontology + name: Property information + value: + type: number + description: Level at which the property was observed in the experiment (non-normalized). + title: Property value + example: 3 + x-airr: + miairr: defined + nullable: true + adc-query-support: true + name: Property value + + +# The Receptor object hold information about a receptor and its reactivity. +# +Receptor: + type: object + required: + - receptor_id + - receptor_hash + - receptor_type + - receptor_variable_domain_1_aa + - receptor_variable_domain_1_locus + - receptor_variable_domain_2_aa + - receptor_variable_domain_2_locus + properties: + receptor_id: + type: string + description: ID of the current Receptor object, unique within the local repository. + title: Receptor ID + example: TCR-MM-012345 + x-airr: + identifier: true + nullable: false + adc-query-support: true + receptor_hash: + type: string + description: > + The SHA256 hash of the receptor amino acid sequence, calculated on the concatenated + ``receptor_variable_domain_*_aa`` sequences and represented as base16-encoded string. + title: Receptor hash ID + example: aa1c4b77a6f4927611ab39f5267415beaa0ba07a952c233d803b07e52261f026 + x-airr: + nullable: false + adc-query-support: true + receptor_type: + type: string + enum: + - Ig + - TCR + description: The top-level receptor type, either Immunoglobulin (Ig) or T Cell Receptor (TCR). + x-airr: + nullable: false + adc-query-support: true + receptor_variable_domain_1_aa: + type: string + description: > + Complete amino acid sequence of the mature variable domain of the Ig heavy, TCR beta or TCR delta chain. + The mature variable domain is defined as encompassing all AA from and including first AA after the the + signal peptide to and including the last AA that is completely encoded by the J gene. + example: > + QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLEWIGRIDPNSGGTKYNEKFKSKATLTVDKPSSTAYMQLSSLTSEDSAVYYCARYDYYGSSYFDYWGQGTTLTVSS + x-airr: + nullable: false + adc-query-support: true + receptor_variable_domain_1_locus: + type: string + enum: + - IGH + - TRB + - TRD + description: Locus from which the variable domain in receptor_variable_domain_1_aa originates + example: IGH + x-airr: + nullable: false + adc-query-support: true + receptor_variable_domain_2_aa: + type: string + description: > + Complete amino acid sequence of the mature variable domain of the Ig light, TCR alpha or TCR gamma chain. + The mature variable domain is defined as encompassing all AA from and including first AA after the the + signal peptide to and including the last AA that is completely encoded by the J gene. + example: > + QAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYANWVQEKPDHLFTGLIGGTNNRAPGVPARFSGSLIGDKAALTITGAQTEDEAIYFCALWYSNHWVFGGGTKLTVL + x-airr: + nullable: false + adc-query-support: true + receptor_variable_domain_2_locus: + type: string + enum: + - IGI + - IGK + - IGL + - TRA + - TRG + description: Locus from which the variable domain in receptor_variable_domain_2_aa originates + example: IGL + x-airr: + nullable: false + adc-query-support: true + receptor_ref: + type: array + description: Array of receptor identifiers defined for the Receptor object + title: Receptor cross-references + items: + type: string + example: ["IEDB_RECEPTOR:10"] + x-airr: + nullable: true + adc-query-support: true + reactivity_measurements: + type: array + description: Records of reactivity measurement + items: + $ref: '#/ReceptorReactivity' + x-airr: + nullable: true + + +ReceptorReactivity: + type: object + required: + - ligand_type + - antigen_type + - antigen + - reactivity_method + - reactivity_readout + - reactivity_value + - reactivity_unit + properties: + ligand_type: + type: string + enum: + - "MHC:peptide" + - "MHC:non-peptide" + - protein + - peptide + - non-peptidic + description: Classification of ligand binding to receptor + example: non-peptide + x-airr: + nullable: false + antigen_type: + type: string + enum: + - protein + - peptide + - non-peptidic + description: > + The type of antigen before processing by the immune system. + example: protein + x-airr: + nullable: false + antigen: + $ref: '#/Ontology' + description: > + The substance against which the receptor was tested. This can be any substance that + stimulates an adaptive immune response in the host, either through antibody production + or by T cell activation after presentation via an MHC molecule. + title: Antigen + example: + id: UNIPROT:P19597 + label: Circumsporozoite protein + x-airr: + nullable: false + adc-query-support: true + format: ontology + antigen_source_species: + $ref: '#/Ontology' + description: The species from which the antigen was isolated + title: Source species of antigen + example: + id: NCBITAXON:5843 + label: Plasmodium falciparum NF54 + x-airr: + nullable: true + format: ontology + ontology: + draft: true + top_node: + id: NCBITAXON:1 + label: root + peptide_start: + type: integer + description: Start position of the peptide within the reference protein sequence + x-airr: + nullable: true + peptide_end: + type: integer + description: End position of the peptide within the reference protein sequence + x-airr: + nullable: true + mhc_class: + type: string + enum: + - MHC-I + - MHC-II + - MHC-nonclassical + - null + description: Class of MHC molecule, only present for MHC:x ligand types + example: MHC-II + x-airr: + nullable: true + mhc_gene_1: + $ref: '#/Ontology' + description: The MHC gene to which the mhc_allele_1 belongs + title: MHC gene 1 + example: + id: MRO:0000055 + label: HLA-DRA + x-airr: + nullable: true + format: ontology + ontology: + draft: true + top_node: + id: MRO:0000004 + label: MHC gene + mhc_allele_1: + type: string + description: Allele designation of the MHC alpha chain + example: HLA-DRA + x-airr: + nullable: true + mhc_gene_2: + $ref: '#/Ontology' + description: The MHC gene to which the mhc_allele_2 belongs + title: MHC gene 2 + example: + id: MRO:0000057 + label: HLA-DRB1 + x-airr: + nullable: true + format: ontology + ontology: + draft: true + top_node: + id: MRO:0000004 + label: MHC gene + mhc_allele_2: + type: string + description: > + Allele designation of the MHC class II beta chain or the invariant beta2-microglobin chain + example: HLA-DRB1*04:01 + x-airr: + nullable: true + reactivity_method: + type: string + enum: + - SPR + - ITC + - ELISA + - cytometry + - biological_activity + description: The methodology used to assess expression (assay implemented in experiment) + x-airr: + nullable: false + reactivity_readout: + type: string + enum: + - binding_strength + - cytokine_release + - dissociation_constant_kd + - on_rate + - off_rate + - pathogen_inhibition + description: Reactivity measurement read-out + example: cytokine release + x-airr: + nullable: false + reactivity_value: + type: number + description: The absolute (processed) value of the measurement + example: 162.26 + x-airr: + nullable: false + reactivity_unit: + type: string + description: The unit of the measurement + example: pg/ml + x-airr: + nullable: false diff -Nru python-airr-1.3.1/airr/specs/blank.airr.yaml python-airr-1.5.0/airr/specs/blank.airr.yaml --- python-airr-1.3.1/airr/specs/blank.airr.yaml 2020-10-13 21:31:46.000000000 +0000 +++ python-airr-1.5.0/airr/specs/blank.airr.yaml 1970-01-01 00:00:00.000000000 +0000 @@ -1,120 +0,0 @@ -# -# blank metadata template -# - -Repertoire: - - repertoire_id: null - study: - study_id: null - study_title: null - study_type: - id: null - label: null - study_description: null - inclusion_exclusion_criteria: null - lab_name: null - lab_address: null - submitted_by: null - collected_by: null - grants: null - pub_ids: null - keywords_study: null - subject: - subject_id: null - synthetic: false - species: - id: null - label: null - sex: null - age_min: null - age_max: null - age_unit: - id: null - label: null - age_event: null - ancestry_population: null - ethnicity: null - race: null - strain_name: null - linked_subjects: null - link_type: null - diagnosis: - - study_group_description: null - disease_diagnosis: - id: null - label: null - disease_length: null - disease_stage: null - prior_therapies: null - immunogen: null - intervention: null - medical_history: null - sample: - - sample_processing_id: null - sample_id: null - sample_type: null - tissue: - id: null - label: null - anatomic_site: null - disease_state_sample: null - collection_time_point_relative: null - collection_time_point_reference: null - biomaterial_provider: null - # cell processing - tissue_processing: null - cell_subset: - id: null - label: null - cell_phenotype: null - cell_species: - id: null - label: null - single_cell: false - cell_number: null - cells_per_reaction: null - cell_storage: false - cell_quality: null - cell_isolation: null - cell_processing_protocol: null - # nucleic acid processing - template_class: "" - template_quality: null - template_amount: null - library_generation_method: "" - library_generation_protocol: null - library_generation_kit_version: null - pcr_target: - - pcr_target_locus: null - forward_pcr_primer_target_location: null - reverse_pcr_primer_target_location: null - complete_sequences: "partial" - physical_linkage: "none" - # sequencing run - sequencing_run_id: null - total_reads_passing_qc_filter: null - sequencing_platform: null - sequencing_facility: null - sequencing_run_date: null - sequencing_kit: null - # raw data - sequencing_files: - file_type: null - filename: null - read_direction: null - read_length: null - paired_filename: null - paired_read_direction: null - paired_read_length: null - data_processing: - - data_processing_id: null - primary_annotation: false - software_versions: null - paired_reads_assembly: null - quality_thresholds: null - primer_match_cutoffs: null - collapsing_method: null - data_processing_protocols: null - data_processing_files: null - germline_database: null - analysis_provenance_id: null diff -Nru python-airr-1.3.1/airr/tools.py python-airr-1.5.0/airr/tools.py --- python-airr-1.3.1/airr/tools.py 2019-11-03 00:00:33.000000000 +0000 +++ python-airr-1.5.0/airr/tools.py 2022-08-28 17:34:43.000000000 +0000 @@ -20,6 +20,7 @@ # System imports import argparse import sys +from warnings import warn # Local imports from airr import __version__ @@ -43,7 +44,7 @@ return airr.interface.merge_rearrangement(out_file, airr_files, drop=drop, debug=debug) # internal wrapper function before calling validate interface method -def validate_cmd(airr_files, debug=True): +def validate_rearrangement_cmd(airr_files, debug=True): """ Validates one or more AIRR rearrangements files @@ -54,12 +55,46 @@ Returns: boolean: True if all files passed validation, otherwise False """ - try: - valid = [airr.interface.validate_rearrangement(f, debug=debug) for f in airr_files] - return all(valid) - except Exception as err: - sys.stderr.write('Error occurred while validating AIRR rearrangement files: ' + str(err) + '\n') - return False + valid = [] + for f in airr_files: + try: + v = airr.interface.validate_rearrangement(f, debug=debug) + valid.append(v) + except Exception as e: + sys.stderr.write('%s\n' % e) + sys.stderr.write('Validation failed for file: %s\n\n' % f) + valid.append(False) + else: + if not v: sys.stderr.write('Validation failed for file: %s\n\n' % f) + + return all(valid) + +def validate_airr_cmd(airr_files, debug=True): + """ + Validates one or more AIRR Data Model files + + Arguments: + airr_files (list): list of input files to validate. + debug (bool): debug flag. If True print debugging information to standard error. + + Returns: + boolean: True if all files passed validation, otherwise False + """ + valid = [] + for f in airr_files: + if debug: sys.stderr.write('Validating: %s\n' % f) + try: + data = airr.interface.read_airr(f, validate=False, debug=debug) + v = airr.interface.validate_airr(data, debug=debug) + valid.append(v) + except Exception as e: + sys.stderr.write('%s\n' % e) + sys.stderr.write('Validation failed for file: %s\n\n' % f) + valid.append(False) + + return all(valid) + +#### Deprecated #### # internal wrapper function before calling validate interface method def validate_repertoire_cmd(airr_files, debug=True): @@ -73,12 +108,21 @@ Returns: boolean: True if all files passed validation, otherwise False """ - try: - valid = [airr.interface.validate_repertoire(f, debug=debug) for f in airr_files] - return all(valid) - except Exception as err: - sys.stderr.write('Error occurred while validating AIRR repertoire metadata files: ' + str(err) + '\n') - return False + # Deprecation + warn('validate_repertoire_cmd is deprecated and will be removed in a future release.\nUse =validate_airr_cmd instead.\n', + DeprecationWarning, stacklevel=2) + + valid = [] + for f in airr_files: + try: + v = airr.interface.validate_repertoire(f, debug=debug) + valid.append(v) + except Exception as e: + sys.stderr.write('%s\n' % e) + sys.stderr.write('Validation failed for file: %s\n\n' % f) + valid.append(False) + + return all(valid) def define_args(): """ @@ -137,21 +181,11 @@ # Subparser to validate files parser_validate = subparsers.add_parser('validate', parents=[common_parser], add_help=False, - help='Validate AIRR files.', - description='Validate AIRR files.') + help='Validate files for AIRR Standards compliance.', + description='Validate files for AIRR Standards compliance.') validate_subparser = parser_validate.add_subparsers(title='subcommands', metavar='', help='Database operation') - # Subparser to validate repertoire files - parser_validate = validate_subparser.add_parser('repertoire', parents=[common_parser], - add_help=False, - help='Validate AIRR repertoire metadata files.', - description='Validate AIRR repertoire metadata files.') - group_validate = parser_validate.add_argument_group('validate arguments') - group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, - help='A list of AIRR repertoire metadata files.') - parser_validate.set_defaults(func=validate_repertoire_cmd) - # Subparser to validate rearrangement files parser_validate = validate_subparser.add_parser('rearrangement', parents=[common_parser], add_help=False, @@ -160,7 +194,27 @@ group_validate = parser_validate.add_argument_group('validate arguments') group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, help='A list of AIRR rearrangement files.') - parser_validate.set_defaults(func=validate_cmd) + parser_validate.set_defaults(func=validate_rearrangement_cmd) + + # Subparser to validate AIRR Data Model files + parser_validate = validate_subparser.add_parser('airr', parents=[common_parser], + add_help=False, + help='Validate AIRR Data Model files.', + description='Validate AIRR Data Model files.') + group_validate = parser_validate.add_argument_group('validate arguments') + group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, + help='A list of AIRR Data Model files.') + parser_validate.set_defaults(func=validate_airr_cmd) + + # Subparser to validate repertoire files + parser_validate = validate_subparser.add_parser('repertoire', parents=[common_parser], + add_help=False, + help='Validate AIRR repertoire metadata files.', + description='Validate AIRR repertoire metadata files.') + group_validate = parser_validate.add_argument_group('validate arguments') + group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, + help='A list of AIRR repertoire metadata files.') + parser_validate.set_defaults(func=validate_repertoire_cmd) return parser @@ -181,6 +235,11 @@ del args_dict['command'] del args_dict['func'] + # Deprecation warnings + if args.func is validate_repertoire_cmd: + print('The "validate repertoire" subcommand is deprecated and will be removed in a future release.', + '\nUse the "validate airr" subcommand instead.\n') + # Call tool function result = args.func(**args_dict) diff -Nru python-airr-1.3.1/debian/changelog python-airr-1.5.0/debian/changelog --- python-airr-1.3.1/debian/changelog 2020-12-07 16:28:55.000000000 +0000 +++ python-airr-1.5.0/debian/changelog 2023-12-22 13:42:56.000000000 +0000 @@ -1,3 +1,21 @@ +python-airr (1.5.0-1) unstable; urgency=medium + + * Team upload. + + [ Debian Janitor ] + * Remove constraints unnecessary since buster (oldstable): + + Build-Depends: Drop versioned constraint on python3-pandas and + python3-yaml. + + [ Andreas Tille ] + * Set DPT as maintainer + * Standards-Version: 4.6.2 (routine-update) + * Build-Depends: s/dh-python/dh-sequence-python3/ (routine-update) + * Replace SafeConfigParser deprecated in Python3.12 + Closes: #1058307 + + -- Andreas Tille Fri, 22 Dec 2023 14:42:56 +0100 + python-airr (1.3.1-1) unstable; urgency=medium [ Ondřej Nový ] diff -Nru python-airr-1.3.1/debian/control python-airr-1.5.0/debian/control --- python-airr-1.3.1/debian/control 2020-12-07 16:28:55.000000000 +0000 +++ python-airr-1.5.0/debian/control 2023-12-22 13:42:56.000000000 +0000 @@ -1,17 +1,17 @@ Source: python-airr -Maintainer: Debian Science Maintainers +Maintainer: Debian Python Team Uploaders: Steffen Moeller Section: science Testsuite: autopkgtest-pkg-python Priority: optional Build-Depends: debhelper-compat (= 13), - dh-python, + dh-sequence-python3, python3-all, python3-setuptools, - python3-pandas (>= 0.18.0), - python3-yaml (>=3.12), + python3-pandas, + python3-yaml, python3-yamlordereddictloader -Standards-Version: 4.5.1 +Standards-Version: 4.6.2 Vcs-Browser: https://salsa.debian.org/python-team/packages/airr Vcs-Git: https://salsa.debian.org/python-team/packages/airr.git Homepage: https://docs.airr-community.org/en/latest/packages/airr-python/overview.html diff -Nru python-airr-1.3.1/debian/patches/python3.12.patch python-airr-1.5.0/debian/patches/python3.12.patch --- python-airr-1.3.1/debian/patches/python3.12.patch 1970-01-01 00:00:00.000000000 +0000 +++ python-airr-1.5.0/debian/patches/python3.12.patch 2023-12-22 13:42:56.000000000 +0000 @@ -0,0 +1,19 @@ +Description: Replace SafeConfigParser deprecated in Python3.12 +Bug-Debian: https://bugs.debian.org/1058307 +Author: Andreas Tille +Last-Update: Fri, 22 Dec 2023 07:07:29 +0100 + +--- a/versioneer.py ++++ b/versioneer.py +@@ -339,9 +339,9 @@ def get_config_from_root(root): + # configparser.NoOptionError (if it lacks "VCS="). See the docstring at + # the top of versioneer.py for instructions on writing your setup.cfg . + setup_cfg = os.path.join(root, "setup.cfg") +- parser = configparser.SafeConfigParser() ++ parser = configparser.ConfigParser() + with open(setup_cfg, "r") as f: +- parser.readfp(f) ++ parser.read_file(f) + VCS = parser.get("versioneer", "VCS") # mandatory + + def get(parser, name): diff -Nru python-airr-1.3.1/debian/patches/series python-airr-1.5.0/debian/patches/series --- python-airr-1.3.1/debian/patches/series 1970-01-01 00:00:00.000000000 +0000 +++ python-airr-1.5.0/debian/patches/series 2023-12-22 13:42:56.000000000 +0000 @@ -0,0 +1 @@ +python3.12.patch diff -Nru python-airr-1.3.1/debian/rules python-airr-1.5.0/debian/rules --- python-airr-1.3.1/debian/rules 2020-12-07 16:28:39.000000000 +0000 +++ python-airr-1.5.0/debian/rules 2023-12-22 13:42:56.000000000 +0000 @@ -4,7 +4,7 @@ export PYBUILD_NAME=airr %: - dh $@ --with python3 --buildsystem=pybuild + dh $@ --buildsystem=pybuild override_dh_auto_test: ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS))) diff -Nru python-airr-1.3.1/requirements.txt python-airr-1.5.0/requirements.txt --- python-airr-1.3.1/requirements.txt 2018-12-07 20:49:15.000000000 +0000 +++ python-airr-1.5.0/requirements.txt 2021-05-02 19:43:41.000000000 +0000 @@ -1,4 +1,4 @@ -pandas>= 0.18.0 +pandas>=0.24.0 pyyaml>=3.12 yamlordereddictloader>=0.4.0 setuptools>=2.0 diff -Nru python-airr-1.3.1/setup.cfg python-airr-1.5.0/setup.cfg --- python-airr-1.3.1/setup.cfg 2020-10-13 21:51:22.000000000 +0000 +++ python-airr-1.5.0/setup.cfg 2023-08-31 18:00:52.852955300 +0000 @@ -1,5 +1,5 @@ [versioneer] -vcs = git +VCS = git style = pep440 versionfile_source = airr/_version.py versionfile_build = airr/_version.py diff -Nru python-airr-1.3.1/setup.py python-airr-1.5.0/setup.py --- python-airr-1.3.1/setup.py 2020-05-27 18:43:54.000000000 +0000 +++ python-airr-1.5.0/setup.py 2023-08-07 21:09:52.000000000 +0000 @@ -2,7 +2,6 @@ AIRR community formats for adaptive immune receptor data. """ import sys -import os import versioneer try: @@ -14,11 +13,7 @@ long_description = ip.read() # Parse requirements -if os.environ.get('READTHEDOCS', None) == 'True': - # Set empty install_requires to get install to work on readthedocs - install_requires = [] -else: - with open('requirements.txt') as req: +with open('requirements.txt') as req: install_requires = req.read().splitlines() # Setup @@ -41,6 +36,5 @@ classifiers=['Intended Audience :: Science/Research', 'Natural Language :: English', 'Operating System :: OS Independent', - 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', 'Topic :: Scientific/Engineering :: Bio-Informatics']) diff -Nru python-airr-1.3.1/tests/test_interface.py python-airr-1.5.0/tests/test_interface.py --- python-airr-1.3.1/tests/test_interface.py 2020-10-13 21:31:46.000000000 +0000 +++ python-airr-1.5.0/tests/test_interface.py 2023-08-07 21:34:29.000000000 +0000 @@ -5,8 +5,10 @@ import os import time import unittest +import jsondiff +import sys -# Load imports +# airr imports import airr from airr.schema import ValidationError @@ -20,10 +22,21 @@ print('-------> %s()' % self.id()) # Test data - self.data_good = os.path.join(data_path, 'good_data.tsv') - self.data_bad = os.path.join(data_path, 'bad_data.tsv') - self.rep_good = os.path.join(data_path, 'good_repertoire.airr.yaml') - self.rep_bad = os.path.join(data_path, 'bad_repertoire.airr.yaml') + self.rearrangement_good = os.path.join(data_path, 'good_rearrangement.tsv') + self.rearrangement_bad = os.path.join(data_path, 'bad_rearrangement.tsv') + self.rep_good = os.path.join(data_path, 'good_repertoire.yaml') + self.rep_bad = os.path.join(data_path, 'bad_repertoire.yaml') + self.germline_good = os.path.join(data_path, 'good_germline_set.json') + self.germline_bad = os.path.join(data_path, 'bad_germline_set.json') + self.genotype_good = os.path.join(data_path, 'good_genotype_set.json') + self.genotype_bad = os.path.join(data_path, 'bad_genotype_set.json') + self.combined_yaml = os.path.join(data_path, 'good_combined_airr.yaml') + self.combined_json = os.path.join(data_path, 'good_combined_airr.json') + + # Output data + self.output_rep = os.path.join(data_path, 'output_rep.json') + self.output_good = os.path.join(data_path, 'output_data.json') + self.output_blank = os.path.join(data_path, 'output_blank.json') # Expected output self.shape_good = (9, 44) @@ -37,58 +50,321 @@ print('<- %.3f %s()' % (t, self.id())) # @unittest.skip('-> load(): skipped\n') - def test_load(self): + def test_load_rearrangement(self): # Good data - result = airr.load_rearrangement(self.data_good) + result = airr.load_rearrangement(self.rearrangement_good) self.assertTupleEqual(result.shape, self.shape_good, 'load(): good data failed') # Bad data - result = airr.load_rearrangement(self.data_bad) + result = airr.load_rearrangement(self.rearrangement_bad) self.assertTupleEqual(result.shape, self.shape_bad, 'load(): bad data failed') # @unittest.skip('-> repertoire_template(): skipped\n') def test_repertoire_template(self): try: - rep = airr.repertoire_template() - result = airr.schema.RepertoireSchema.validate_object(rep) - self.assertTrue(result, 'repertoire_template(): repertoire template failed validation') + with self.assertWarns(DeprecationWarning, msg='repertoire_template(): failed to issue DeprecationWarning'): + rep = airr.repertoire_template() + airr.write_airr(self.output_blank, {'Repertoire': rep}, validate=False, debug=True) except: - self.assertTrue(False, 'repertoire_template(): repertoire template failed validation') + pass + + # @unittest.skip('-> schema.template(): skipped\n') + def test_schema_template(self): + # Repertoire template + try: + data = airr.schema.RepertoireSchema.template() + valid = airr.schema.RepertoireSchema.validate_object(data) + self.assertTrue(valid, 'Schema.template("Repertoire"): repertoire template failed validation') + except: + self.assertTrue(False, 'Schema.template("Repertoire"): repertoire template failed validation') + + # GermlineSet template + try: + data = airr.schema.GermlineSetSchema.template() + valid = airr.schema.GermlineSetSchema.validate_object(data) + self.assertTrue(valid, 'Schema.template("GermlineSet"): repertoire template failed validation') + except: + self.assertTrue(False, 'Schema.template("GermlineSet"): repertoire template failed validation') + + # GenotypeSet template + try: + data = airr.schema.GenotypeSetSchema.template() + valid = airr.schema.GenotypeSetSchema.validate_object(data) + self.assertTrue(valid, 'Schema.template("GenotypeSet"): repertoire template failed validation') + except: + self.assertTrue(False, 'Schema.template("GenotypeSet"): repertoire template failed validation') # @unittest.skip('-> validate(): skipped\n') - def test_validate(self): + def test_validate_rearrangement(self): # Good data try: - result = airr.validate_rearrangement(self.data_good) + result = airr.validate_rearrangement(self.rearrangement_good) self.assertTrue(result, 'validate(): good data failed') except: self.assertTrue(False, 'validate(): good data failed') # Bad data try: - result = airr.validate_rearrangement(self.data_bad) + result = airr.validate_rearrangement(self.rearrangement_bad) self.assertFalse(result, 'validate(): bad data failed') except Exception as inst: print(type(inst)) raise inst + # @unittest.skip('-> read_airr(): skipped\n') + def test_read_airr(self): + # Good data + print('--> Good data') + try: + data = airr.read_airr(self.rep_good, validate=True, debug=True) + except: + self.fail('read_airr(): good data failed') + + # Bad data + print('--> Bad data') + with self.assertRaises(ValidationError, msg="read_airr(): bad data passed validation"): + data = airr.read_airr(self.rep_bad, validate=True, debug=True) + + # Combined yaml + print('--> Combined YAML') + try: + data_yaml = airr.read_airr(self.combined_yaml, validate=True, debug=True) + except: + self.fail('read_airr(): combined yaml failed') + + # Combined json + print('--> Combined JSON') + try: + data_json = airr.read_airr(self.combined_json, validate=True, debug=True) + except: + self.fail('read_airr(): combined json failed') + + # Check equality of yaml and json + self.assertDictEqual(data_yaml, data_json, msg="read_airr(): yaml and json imports are not equal") + + + # @unittest.skip('-> validate_airr(): skipped\n') + def test_validate_airr(self): + # Good data + print('--> Good data') + # As array + try: + data = airr.read_airr(self.rep_good, validate=True, debug=True) + valid = airr.validate_airr(data, debug=True) + self.assertTrue(valid, 'validate_airr(): good data array failed') + except: + self.assertTrue(False, 'validate_airr(): good data array failed') + + # As dict + try: + array = airr.read_airr(self.rep_good, validate=False, debug=False) + data = {'Repertoire': {x['repertoire_id']: x for x in array['Repertoire']}} + valid = airr.validate_airr(data, debug=True) + self.assertTrue(valid, 'validate_airr(): good data dict failed') + except: + self.assertTrue(False, 'validate_airr(): good data dict failed') + + # Bad data + print('--> Bad data') + # As array + try: + data = airr.read_airr(self.rep_bad, validate=True, debug=True) + valid = airr.validate_airr(data, debug=True) + self.assertFalse(valid, 'validate_airr(): bad data array failed') + except ValidationError: + pass + except Exception as inst: + print(type(inst)) + raise inst + + # As dict + try: + array = airr.read_airr(self.rep_bad, validate=False, debug=False) + data = {'Repertoire': {x['repertoire_id']: x for x in array['Repertoire']}} + valid = airr.validate_airr(data, debug=True) + self.assertFalse(valid, 'validate_airr(): bad data dict failed') + except ValidationError: + pass + except Exception as inst: + print(type(inst)) + raise inst + # @unittest.skip('-> load_repertoire(): skipped\n') def test_load_repertoire(self): # Good data try: - data = airr.load_repertoire(self.rep_good, validate=True) + with self.assertWarns(DeprecationWarning, msg='load_repertoire(): failed to issue DeprecationWarning'): + data = airr.load_repertoire(self.rep_good, validate=True, debug=True) except: self.assertTrue(False, 'load_repertoire(): good data failed') # Bad data try: - data = airr.load_repertoire(self.rep_bad, validate=True, debug=True) - self.assertFalse(True, 'load_repertoire(): bad data failed') + with self.assertWarns(DeprecationWarning, msg='load_repertoire(): failed to issue DeprecationWarning'): + data = airr.load_repertoire(self.rep_bad, validate=True, debug=True) + self.assertFalse(True, 'load_repertoire(): bad data passed') except ValidationError: pass except Exception as inst: print(type(inst)) raise inst + # @unittest.skip('-> write_repertoire(): skipped\n') + def test_write_repertoire(self): + # Good data + try: + with self.assertWarns(DeprecationWarning, msg='load_repertoire(): failed to issue DeprecationWarning'): + data = airr.load_repertoire(self.rep_good, validate=True, debug=True) + with self.assertWarns(DeprecationWarning, msg='write_repertoire(): failed to issue DeprecationWarning'): + result = airr.write_repertoire(self.output_rep, data['Repertoire'], debug=True) + with self.assertWarns(DeprecationWarning, msg='load_repertoire(): failed to issue DeprecationWarning'): + # verify we can read it + obj = airr.load_repertoire(self.output_rep, validate=True, debug=True) + + # is the data identical? + if jsondiff.diff(obj['Repertoire'], data['Repertoire']) != {}: + print('Output data does not match', file=sys.stderr) + print(jsondiff.diff(obj, data), file=sys.stderr) + self.assertTrue(False, 'write_repertoire(): Output data does not match') + except: + self.assertTrue(False, 'write_repertoire(): good data failed') + + # @unittest.skip('-> load_germline(): skipped\n') + def test_read_germline(self): + # Good data + try: + result = airr.read_airr(self.germline_good, validate=True, debug=True) + except ValidationError: + self.assertTrue(False, 'load_germline(): good data failed') + + # Bad data + try: + result = airr.read_airr(self.germline_bad, validate=True, debug=True) + self.assertFalse(True, 'load_germline(): bad data succeeded') + except ValidationError: + pass + + # @unittest.skip('-> validate_germline(): skipped\n') + def test_validate_germline(self): + # Good data + print('--> Good data') + try: + result = airr.read_airr(self.germline_good, validate=True, debug=True) + valid = airr.validate_airr(result, debug=True) + self.assertTrue(valid, 'validate_germline(): good data failed') + except ValidationError: + self.assertTrue(False, 'validate_germline(): good data failed') + + # Bad data + print('--> Bad data') + try: + result = airr.read_airr(self.germline_bad, validate=True, debug=True) + valid = airr.validate_airr(result, debug=True) + self.assertFalse(valid, 'validate_germline(): bad data succeeded') + except ValidationError: + pass + + # @unittest.skip('-> load_genotype(): skipped\n') + def test_read_genotype(self): + # Good data + print('--> Good data') + try: + result = airr.read_airr(self.genotype_good, validate=True, debug=True) + except ValidationError: + self.assertTrue(False, 'load_genotype(): good data failed') + + # Bad data + print('--> Bad data') + try: + result = airr.read_airr(self.genotype_bad, validate=True, debug=True) + self.assertFalse(True, 'load_genotype(): bad data succeeded') + except ValidationError: + pass + + # @unittest.skip('-> validate_genotype(): skipped\n') + def test_validate_genotype(self): + # Good data + print('--> Good data') + try: + result = airr.read_airr(self.genotype_good, validate=True, debug=True) + valid = airr.validate_airr(result, debug=True) + self.assertTrue(valid, 'validate_genotype(): good data failed') + except ValidationError: + self.assertTrue(False, 'validate_genotype(): good data failed') + + # Bad data + print('--> Bad data') + try: + result = airr.read_airr(self.genotype_bad, validate=True, debug=True) + valid = airr.validate_airr(result, debug=True) + self.assertFalse(valid, 'validate_genotype(): bad data succeeded') + except ValidationError: + pass + + # @unittest.skip('-> load_genotype(): skipped\n') + def test_write_airr(self): + # Good data as array + try: + repertoire_data = airr.read_airr(self.rep_good, validate=True, debug=True) + germline_data = airr.read_airr(self.germline_good, validate=True, debug=True) + genotype_data = airr.read_airr(self.genotype_good, validate=True, debug=True) + + # combine together and write + obj = {} + obj['Repertoire'] = repertoire_data['Repertoire'] + obj['GermlineSet'] = germline_data['GermlineSet'] + obj['GenotypeSet'] = genotype_data['GenotypeSet'] + airr.write_airr(self.output_good, obj, validate=True, debug=True) + + # verify we can read it + data = airr.read_airr(self.output_good, validate=True, debug=True) + + # is the data identical? + del data['Info'] + if jsondiff.diff(obj, data) != {}: + print('Output data does not match', file=sys.stderr) + print(jsondiff.diff(obj, data), file=sys.stderr) + self.assertTrue(False, 'write_airr_data(): Output data does not match') + + except Exception as inst: + self.assertTrue(False, 'write_airr_data(): good data failed') + print(type(inst)) + raise inst + + # Good data as dict + try: + # Load data + repertoire_array = airr.read_airr(self.rep_good, validate=True, debug=True) + germline_array = airr.read_airr(self.germline_good, validate=True, debug=True) + genotype_array = airr.read_airr(self.genotype_good, validate=True, debug=True) + + # Build keyed representation + repertoire_data = {'Repertoire': {x['repertoire_id']: x for x in repertoire_array['Repertoire']}} + germline_data = {'GermlineSet': {x['germline_set_id']: x for x in germline_array['GermlineSet']}} + genotype_data = {'GenotypeSet': {x['receptor_genotype_set_id']: x for x in genotype_array['GenotypeSet']}} + + # combine together and write + obj = {} + obj['Repertoire'] = repertoire_data['Repertoire'] + obj['GermlineSet'] = germline_data['GermlineSet'] + obj['GenotypeSet'] = genotype_data['GenotypeSet'] + airr.write_airr(self.output_good, obj, validate=True, debug=True) + + # verify we can read it + data = airr.read_airr(self.output_good, validate=True, debug=True) + + # is the data identical? + del data['Info'] + if jsondiff.diff(obj, data) != {}: + print('Output data does not match', file=sys.stderr) + print(jsondiff.diff(obj, data), file=sys.stderr) + self.assertTrue(False, 'write_airr_data(): Output data does not match') + + except Exception as inst: + self.assertTrue(False, 'write_airr_data(): good data failed') + print(type(inst)) + raise inst + + if __name__ == '__main__': unittest.main() diff -Nru python-airr-1.3.1/tests/test_io.py python-airr-1.5.0/tests/test_io.py --- python-airr-1.3.1/tests/test_io.py 2020-10-13 21:31:46.000000000 +0000 +++ python-airr-1.5.0/tests/test_io.py 2022-08-28 17:34:43.000000000 +0000 @@ -19,9 +19,9 @@ print('-------> %s()' % self.id()) # Test data - self.data_good = os.path.join(data_path, 'good_data.tsv') - self.data_bad = os.path.join(data_path, 'bad_data.tsv') - self.data_extra = os.path.join(data_path, 'extra_data.tsv') + self.data_good = os.path.join(data_path, 'good_rearrangement.tsv') + self.data_bad = os.path.join(data_path, 'bad_rearrangement.tsv') + self.data_extra = os.path.join(data_path, 'extra_rearrangement.tsv') # Start timer self.start = time.time() @@ -67,5 +67,6 @@ print(type(inst)) raise inst + if __name__ == '__main__': unittest.main()