Welcome to PyRice’s documentation!

Installation

Install via pip

Install PyRice package (the newest version) with pip:

$ pip install pyrice

To choose older version of Pyrice package using pip:

$ pip install pyrice==0.2.0
Now there are 2 versions available (should use the latest version):
  • Version 0.2.0: Update reference for gene from Oryzabase. Add 2 new databases PlantTFDB for PyRice
    • If you install PyRice on your local machine, please follow these steps:

      • Please check carefully the current version of Chrome on your computer before downloading

      • Download the Chrome driver

      • After downloading, fill the file path lead to Chrome driver before querying:

        >>> from pyrice import utils
        >>> utils.chrome_path = "the path of your Chrome driver"
        
  • Version 0.1.9: PyRice on Google Colab or other cloud platform. Updating the change output format.

  • Version 0.1.8: Addition of crawling JavaScript data with Selenium.

  • Version 0.1.5: Crawl data without Selenium (unsupported).

IN PROCESS: If you want to install the newest demo of PyRice using pip:

$ pip install -i https://test.pypi.org/simple/ pyrice

Install via Github

Clone project on PyRice (the newest version) with git:

$ git clone https://github.com/SouthGreenPlatform/PyRice.git

To choose older version of Pyrice package, choose release pyrice-0.1.8:

Download package

Download and extract the compressed archive from PyPI.

Instruction

Before start using package

If you want to use Selenium with a Chrome to crawl JavaScript data, please fill the path lead to Chrome driver before querying:

>>> from pyrice import utils

>>> utils.chrome_path = "the path of your Chrome driver"

Using the multi_query module

The core of PyRice package based on the multi_query module. Class MultiQuery in multi_query module is the main object for query informations of gene in many databases. Users can create instances of this class in several ways.

Search gene on chromosome

To search gene on chromosome, use the search_gene() function in the MultiQuery class:

>>> from pyrice.multi_query import MultiQuery

>>> query = MultiQuery()
>>> result = query.search_on_chromosome(chro="chr01", start_pos="1", end_pos="20000",
                                        number_process = 4, dbs="all", save_path="./result/")

The function returns output in form of a dictionary. In addition, to save data on file (in term of .csv), you can set a destination through save_path argument

>>> print("Output database:", result)
Output database:
{'OsNippo01g010050': {
        'msu7Name': {'LOC_Os01g01010'},
        'raprepName': {'Os01g0100100'},
        'contig': 'chr01', 'fmin': 2982,
        'fmax': 10815},
'OsNippo01g010150': {
        'msu7Name': {'LOC_Os01g01019'},
        'raprepName': {'Os01g0100200'},
        'contig': 'chr01',
        'fmin': 11217,
        'fmax': 12435},
...
'OsNippo01g010300': {
        'msu7Name': {'LOC_Os01g01040'},
        'raprepName': {'Os01g0100500'},
        'contig': 'chr01',
        'fmin': 16398,
        'fmax': 20144}
}

Search informations gene using chromosome

PyRice package supports users to search following the start and end position of genes on a chromosome. Use the query_by_chromosome() function in the MultiQuery class:

>>> from pyrice.multi_query import MultiQuery

>>> query = MultiQuery()
>>> result = query.query_by_chromosome(chro="chr01", start_pos="1", end_pos="20000",
                                       number_process = 4, multi_processing=True,
                                       multi_threading=True, dbs="all")

This function returns an dictionary.

>>> print("Output database:", result)
Output database:
{'OsNippo01g010050': {
        'rapdb': {
                'Locus_ID': 'Os01g0100100',
                'Description': 'RabGAP/TBC domain containing protein.',
                        'Oryzabase Gene Name Synonym(s)': 'Molecular Function: Rab GTPase activator activity (GO:0005097)',
                        ...},
        'gramene': {
                '_id': 'Os01g0100100',
                'name': 'Os01g0100100',
                'biotype': 'protein_coding',
                ...},
        ...},
'OsNippo01g010150': {
        'rapdb': {...},
        'gramene': {...},
        ...},
...
}

To save the result, package uses the save() function in the MultiQuery with different types of file html, pkl, json, csv:

>>> query.save(result, save_path="./result/",
               format=["csv", "html", "json", "pkl"], hyper_link=False)

Search informations gene by IDs

PyRice package supports searching gene information follow three identifications of gene: IDs on Oryzabase, locus on MSU and iric_name on SNP-SEEK. The query_by_ids() function in the MultiQuery class is used following:

>>> from pyrice.multi_query import MultiQuery

>>> query = MultiQuery()
>>> result = query.query_by_ids(ids=["Os08g0164400", "Os07g0586200"],
                                locs=["LOC_Os10g01006", "LOC_Os07g39750"],
                                irics=["OsNippo01g010050", "OsNippo01g010300"],
                                number_process = 4, multi_processing=True, multi_threading=True, dbs="all")

This function returns a dictionary where the key is iric_name:

>>> print("Output database:",result)
Output database:
{'OsNippo01g010050': {
        'rapdb': {
                'Locus_ID': 'Os01g0100100',
                'Description': 'RabGAP/TBC domain containing protein.',
                'Position': '',
                ...},
        'ic4r': {
                'Anther_Normal': {'expression_value': '0.699962'},
                'Anther_WT': {'expression_value': '13.9268'},
                ...},
        ...},
'OsNippo01g010300': {
        'rapdb': {...},
        'ic4r': {...},
        ...},
...
}

To save the result, package uses the save() function in the MultiQuery with different types of file html, pkl, json, csv.:

>>> query.save(result, save_path = "./result/",
               format=["csv", "html", "json", "pkl"], hyper_link=False)

Using the build_dictionary module

PyRice package saves 2 databases: Oryzabase and RapDB as local; three dictionaries of identifications of gene. Therefore, it also has functions to update regularly gene use the update_gene_dictionary() function and update_rapdb_oryzabase() function in the build_dictionary module:

>>> from pyrice.build_dictionary import update_gene_dictionary, update_rapdb_oryzabase

>>> update_gene_dictionary()
>>> update_local_database(rapdb_url, oryzabase_url)

Using the search function and query SQL

PyRice package has a function to support searching text on result file after using query functions. Use the search() function in the utils module:

>>> from pyrice import utils
>>> import pandas as pd

>>> df1 = pd.read_pickle("./result1/data/db.pkl")
>>> df2 = pd.read_pickle("./result2/data/db.pkl")
>>> df = pd.concat([df1,df2])
>>> result = utils.search(df,"Amino acid ")

You can execute a SQL query over a pandas dataframe. You have to install package pandasql. Next, follow the code below to run SQL query:

>>> import pandas as pd
>>> from pandasql import sqldf

>>> data = pd.read_pickle("./result/data/db.pkl")
>>> data = data.astype(str)
>>> sql = "SELECT * FROM data WHERE `oryzabase.CGSNL Gene Symbol` = 'TLP27' or `gramene.system_name` = 'oryza_sativa'"
>>> pysqldf = lambda q: sqldf(q, globals())
>>> print(pysqldf(sql))

Note

You have to save file as .pkl and re-load it again to use search() function.

The variable name is same with the table name in SQL query.

Structure of file database wrapper

PyRice package contains a file which includes all database wrapper (database_description.xml) to manage all information of databases:

<database dbname="name of the database" type="Type of the response" method="GET or POST">
        <link stern="the link section before the query" aft="section behind the query"/>
        <headers>
                <header type="">Column number 1</header>
                <header type="">Column number 2</header>
                etc.
        </headers>
        <fields>
                <field>Query argument number 1</field>
        </fields>
        <data_struct indicator="indicator of return data segment" identifier="the attribute to identify data section" identification_string="value of said identifier" line_separator="indicator of a line of data" cell_separator="indicator of a cell of data"/>
        <prettify>Regular expression of unwanted character</prettify>
</database>

Example: here is a Oryzabase database:

<database dbname="oryzabase" type="text/html" method="POST">
        <link stern="https://shigen.nig.ac.jp/rice/oryzabase/gene/advanced/list"/>
        <headers>
                <header type="">CGSNL Gene Symbol</header>
                <header type="">Gene symbol synonym(s)</header>
                <header type="">CGSNL Gene Name</header>
                <header type="">Gene name synonym(s)</header>
                <header type="">Chr. No.</header>
                <header type="">Trait Class</header>
                <header type="">Gene Ontology</header>
                <header type="">Trait Ontology</header>
                <header type="">Plant Ontology</header>
                <header type="">RAP ID</header>
                <header type="">Mutant Image</header>
        </headers>
        <fields>
                <field>rapId</field>
        </fields>
        <data_struct indicator="table" identifier="class" identification_string="table_summery_list table_nowrapTh max_width_element" line_separator="tr" cell_separator="td"/>
        <prettify>\n>LOC_.*\n|\n|\r|\t</prettify>
</database>

Search new attributes on new databases

Add new database

PyRice package supports queries on new databases by adding its description manually in database_description.xml. Using JSON format, here is an example with SNP-SEEK database with API: https://snp-seek.irri.org/ws/genomics/gene/osnippo/chr01?start=1&end=15000&model=iric:

<database dbname="snpseek" type="text/JSON" method="GET" normalize="false">
    <link stern="https://snp-seek.irri.org/ws/genomics/gene/osnippo/" aft=""/>
    <fields>
        <field></field>
        <field op="=">start</field>
        <field op="=">end</field>
        <field op="=">model</field>
    </fields>
</database>
For more details:
  • dbname : database name
  • type : the result returned by API
  • method : GET/POST (default GET)
  • normalize : normalize name of database true/false (default false)
  • stern : URL of API
  • op : parameters (see on API above)

For example, with an API from Planteome: http://browser.planteome.org/api/search/annotation?bioentity=AT4G32150:

<database dbname="planteome" type="text/JSON" method="GET" normalize="false">
    <link stern="http://browser.planteome.org/api/search/annotation?" aft=""></link>
    <fields>
        <field op="=">bioentity</field>
    </fields>
</database>

Use new query funtion

Use the query_new_databse() function in the MultiQuery class:

>>> from pyrice.multi_query import MultiQuery

>>> query = MultiQuery()
>>> result = query.query_new_database(atts=['AT4G32150'], number_process= 4,
                                      multi_processing=True,multi_threading=True,dbs=['planteome'])

This function returns a dictionary.:

>>> print("Output database:",result)
Output database:
{'AT4G32150': {
        'planteome': {
                'service': '/api/search/annotation',
                'status': 'success.',
                'arguments': '{}',
                'comments': ['Results found for: annotation; queries: ; filters: '],
                'data': [{...}]
                ...},
    ...}
}

To save the result, package uses the save() function in the MultiQuery with different types of file html, pkl, json, csv.:

>>> query.save(result, save_path="./result/",
               format=["csv", "html", "json", "pkl"], hyper_link=False)

Note

With APIs return results with HTML and Javascript format, it might have some problems due to the difference of GUI (Javascript) or tag (HTML). So, we are working to simplize the package on those two formats to make it easier for updating new databases.

List of supported databases

Database_name: keywords

Note

Keywords are value of arguments in query module

Result of PyRice

Here is the result of package as .html file.

_images/html.png

Note

You can click gene name to see more informations of gene.

_images/html_detail.png

Pyrice module

pyrice.build_dictionary module

pyrice.build_dictionary.gunzip_shutil(source_filepath, dest_filepath, block_size=65536)[source]

Function to unzip file

Parameters:
  • source_filepath – (str) source file path file .zip
  • dest_filepath – (str) destination file path
  • block_size – (int)
pyrice.build_dictionary.update_gene_dictionary()[source]

Update function for gene dictionary

pyrice.build_dictionary.update_local_database(rapdb_url, oryzabase_url)[source]

Update function for rapdb database and oryzabase database

Parameters:
  • rapdb_url – (str) url for download rapdb database
  • oryzabase_url – (list) url for download oryzabase database (1st: url of genes, 2nd: url of refs)

pyrice.multi_query module

class pyrice.multi_query.MultiQuery[source]

Bases: object

This class will represent query gene rice for database

query(iricname, db, qfields=[], verbose=False)[source]

Query one gene by id or loc on each database

Parameters:
  • iricname – (str) iricname or id of gene
  • db – (str) name database in 8 databases
  • qfields – (list) list of loc, id
  • verbose – (bool) if True print for debug
Returns:

a list with format: [iricname,name_db,iric_on_db]

query_by_chromosome(chro, start_pos, end_pos, number_process=1, multi_processing=False, multi_threading=True, dbs='all', query_expansion=False)[source]

Query gene by chromosome

Parameters:
  • chro – (str) chromosome (ex: “chr01”)
  • start_pos – (str) start of chromosome
  • end_pos – (str) end of chromosome
  • number_process – (int) number of process or number of threading
  • multi_processing – (bool) if True, use multi_processing
  • multi_threading – (bool) if True, use multi_threading
  • dbs – (list) list databases (support 10 available databases)
  • query_expansion – (bool) if True, find list list of associated genes
Returns:

a dictionary, format : gene:{database: attributes}

query_by_ids(ids=None, locs=None, irics=None, number_process=1, multi_processing=False, multi_threading=True, dbs='all', query_expansion=False)[source]

Query gene using id, loc or iric

Parameters:
  • ids – (list) list id of gene
  • locs – (list) list loc of gene
  • irics – (list) list iric name of gene
  • number_process – (int) number of process or number of threading
  • multi_processing – (bool) if True use multi_processing
  • multi_threading – (bool) if True use multi_threading
  • dbs – (list) databases (support 10 available databases)
  • query_expansion – (bool) if True, find list list of associated genes
Returns:

a dictionary, format: gene:{database: attribute}

query_expansion(ids=None, locs=None, irics=None, number_process=1)[source]

Query gene using id, loc or iric

Parameters:
  • ids – (list) list id of gene
  • locs – (list) list loc of gene
  • irics – (list) list iric name of gene
  • number_process – (int) number of process or number of threading
Returns:

a dictionary, format: gene:{database: attribute}

query_multi_threading(list_key, list_dbs, list_ids, number_threading=2)[source]

Query function when using both of multi_processing and multi_threading

Parameters:
  • list_key – (list) list of iricname
  • list_dbs – (list) list of database
  • list_ids – (list) list of id or locus
  • number_threading – (int) number threading per core
Returns:

a dictionary, format: gene:{database:attributes}

query_new_database(atts, number_process=1, multi_processing=False, multi_threading=True, dbs=None)[source]

Query for new attributes on new databases

Parameters:
  • atts – (list) list of new attributes
  • number_process – (int) number of process or number of threading
  • multi_processing – (bool) if True use multi_processing
  • multi_threading – (bool) if True use multi_threading
  • dbs – (list) list of new databases
Returns:

dictionary, format : attribute:{database: information of attribute}

save(result, save_path, format=None, hyper_link=False)[source]

Save result of query with differents types of files

Parameters:
  • result – (dictionary) get after query with query functions
  • save_path – (str) path to save result after call function
  • format – (list) 4 format: html, csv, json, pkl
  • hyper_link – (bool) hyper_link in csv file
search_on_chromosome(chro, start_pos, end_pos, number_process=1, save_path=None, dbs='all')[source]

Search gene by potision on chromosome

Parameters:
  • chro – (str) chromosome (ex: “chr01”)
  • start_pos – (str) start of chromosome
  • end_pos – (str) end of chromosome
  • number_process – (int) number of threading
  • dbs – (list) list databases (support 3 available databases)
  • save_path – (str) path to save result after call function
Returns:

a dictionary, format: iricname:{{msu7Name:LOC_Os..},{raprepName:Os..},{contig:chr0..},{fmin:12..},{fmax:22…}}

pyrice.utils module

pyrice.utils.connection_error(link, data='', type=None, db=None, gene_id=None)[source]
Get result with request post or get; with JavaScript
Parameters:
  • link – (str) url
  • data – (str) data to give to the form
  • type – (str) use with JavaScript format
  • db – (str) database name - use with JavaScript format
  • gene_id – (str) gene id - use with JavaScript format
Returns:

object of requests

pyrice.utils.execute_query(db, qfields=[], verbose=False)[source]

Get url and result of api databases

Parameters:
  • db – (str) name of database
  • qfields – (list) list of loc,id
  • verbose – (bool) if True print for debug
Returns:

information of gene after send request to url api

pyrice.utils.search(df, text)[source]

Search function on result (file .pkl)

Parameters:
  • df – (dataframe) dataframe of pandas
  • text – (str) text
Returns:

a dataframe of pandas that include text

Module contents

Indices and tables