Welcome to PyRice’s documentation!¶
Installation¶
Install via pip¶
Install PyRice package (the newest version) with pip:
$ pip install pyrice
To choose older version of Pyrice package using pip:
$ pip install pyrice==0.2.0
- Now there are 2 versions available (should use the latest version):
- Version 0.2.0: Update reference for gene from Oryzabase. Add 2 new databases PlantTFDB for PyRice
If you install PyRice on your local machine, please follow these steps:
Please check carefully the current version of Chrome on your computer before downloading
Download the Chrome driver
After downloading, fill the file path lead to Chrome driver before querying:
>>> from pyrice import utils >>> utils.chrome_path = "the path of your Chrome driver"
Version 0.1.9: PyRice on Google Colab or other cloud platform. Updating the change output format.
Version 0.1.8: Addition of crawling JavaScript data with Selenium.
Version 0.1.5: Crawl data without Selenium (unsupported).
IN PROCESS: If you want to install the newest demo of PyRice using pip:
$ pip install -i https://test.pypi.org/simple/ pyrice
Install via Github¶
Clone project on PyRice (the newest version) with git:
$ git clone https://github.com/SouthGreenPlatform/PyRice.git
To choose older version of Pyrice package, choose release pyrice-0.1.8:
Download package¶
Download and extract the compressed archive from PyPI.
Instruction¶
Before start using package¶
If you want to use Selenium with a Chrome to crawl JavaScript data, please fill the path lead to Chrome driver before querying:
>>> from pyrice import utils
>>> utils.chrome_path = "the path of your Chrome driver"
Using the multi_query module¶
The core of PyRice package based on the multi_query
module.
Class MultiQuery
in multi_query
module is the main object for query informations of gene in many databases.
Users can create instances of this class in several ways.
Search gene on chromosome¶
To search gene on chromosome, use the search_gene()
function
in the MultiQuery
class:
>>> from pyrice.multi_query import MultiQuery
>>> query = MultiQuery()
>>> result = query.search_on_chromosome(chro="chr01", start_pos="1", end_pos="20000",
number_process = 4, dbs="all", save_path="./result/")
The function returns output in form of a dictionary
.
In addition, to save data on file (in term of .csv), you can set a destination through save_path argument
>>> print("Output database:", result)
Output database:
{'OsNippo01g010050': {
'msu7Name': {'LOC_Os01g01010'},
'raprepName': {'Os01g0100100'},
'contig': 'chr01', 'fmin': 2982,
'fmax': 10815},
'OsNippo01g010150': {
'msu7Name': {'LOC_Os01g01019'},
'raprepName': {'Os01g0100200'},
'contig': 'chr01',
'fmin': 11217,
'fmax': 12435},
...
'OsNippo01g010300': {
'msu7Name': {'LOC_Os01g01040'},
'raprepName': {'Os01g0100500'},
'contig': 'chr01',
'fmin': 16398,
'fmax': 20144}
}
Search informations gene using chromosome¶
PyRice package supports users to search following the start and end position of genes on a chromosome. Use the query_by_chromosome()
function in the MultiQuery
class:
>>> from pyrice.multi_query import MultiQuery
>>> query = MultiQuery()
>>> result = query.query_by_chromosome(chro="chr01", start_pos="1", end_pos="20000",
number_process = 4, multi_processing=True,
multi_threading=True, dbs="all")
This function returns an dictionary
.
>>> print("Output database:", result)
Output database:
{'OsNippo01g010050': {
'rapdb': {
'Locus_ID': 'Os01g0100100',
'Description': 'RabGAP/TBC domain containing protein.',
'Oryzabase Gene Name Synonym(s)': 'Molecular Function: Rab GTPase activator activity (GO:0005097)',
...},
'gramene': {
'_id': 'Os01g0100100',
'name': 'Os01g0100100',
'biotype': 'protein_coding',
...},
...},
'OsNippo01g010150': {
'rapdb': {...},
'gramene': {...},
...},
...
}
To save the result, package uses the save()
function in the MultiQuery
with different types of file html, pkl, json, csv:
>>> query.save(result, save_path="./result/",
format=["csv", "html", "json", "pkl"], hyper_link=False)
Search informations gene by IDs¶
PyRice package supports searching gene information follow three identifications of gene: IDs on Oryzabase, locus on MSU and iric_name on SNP-SEEK.
The query_by_ids()
function in the MultiQuery
class is used following:
>>> from pyrice.multi_query import MultiQuery
>>> query = MultiQuery()
>>> result = query.query_by_ids(ids=["Os08g0164400", "Os07g0586200"],
locs=["LOC_Os10g01006", "LOC_Os07g39750"],
irics=["OsNippo01g010050", "OsNippo01g010300"],
number_process = 4, multi_processing=True, multi_threading=True, dbs="all")
This function returns a dictionary
where the key is iric_name:
>>> print("Output database:",result)
Output database:
{'OsNippo01g010050': {
'rapdb': {
'Locus_ID': 'Os01g0100100',
'Description': 'RabGAP/TBC domain containing protein.',
'Position': '',
...},
'ic4r': {
'Anther_Normal': {'expression_value': '0.699962'},
'Anther_WT': {'expression_value': '13.9268'},
...},
...},
'OsNippo01g010300': {
'rapdb': {...},
'ic4r': {...},
...},
...
}
To save the result, package uses the save()
function in the MultiQuery
with different types of file html, pkl, json, csv.:
>>> query.save(result, save_path = "./result/",
format=["csv", "html", "json", "pkl"], hyper_link=False)
Using the build_dictionary module¶
PyRice package saves 2 databases: Oryzabase and RapDB as local; three dictionaries of identifications of gene.
Therefore, it also has functions to update regularly gene use the update_gene_dictionary()
function
and update_rapdb_oryzabase()
function in the build_dictionary
module:
>>> from pyrice.build_dictionary import update_gene_dictionary, update_rapdb_oryzabase
>>> update_gene_dictionary()
>>> update_local_database(rapdb_url, oryzabase_url)
Using the search function and query SQL¶
PyRice package has a function to support searching text on result file after using query functions.
Use the search()
function in the utils
module:
>>> from pyrice import utils
>>> import pandas as pd
>>> df1 = pd.read_pickle("./result1/data/db.pkl")
>>> df2 = pd.read_pickle("./result2/data/db.pkl")
>>> df = pd.concat([df1,df2])
>>> result = utils.search(df,"Amino acid ")
You can execute a SQL query over a pandas dataframe. You have to install package pandasql. Next, follow the code below to run SQL query:
>>> import pandas as pd
>>> from pandasql import sqldf
>>> data = pd.read_pickle("./result/data/db.pkl")
>>> data = data.astype(str)
>>> sql = "SELECT * FROM data WHERE `oryzabase.CGSNL Gene Symbol` = 'TLP27' or `gramene.system_name` = 'oryza_sativa'"
>>> pysqldf = lambda q: sqldf(q, globals())
>>> print(pysqldf(sql))
Note
You have to save file as .pkl and re-load it again to use search()
function.
The variable name is same with the table name in SQL query.
Structure of file database wrapper¶
PyRice package contains a file which includes all database wrapper (database_description.xml) to manage all information of databases:
<database dbname="name of the database" type="Type of the response" method="GET or POST">
<link stern="the link section before the query" aft="section behind the query"/>
<headers>
<header type="">Column number 1</header>
<header type="">Column number 2</header>
etc.
</headers>
<fields>
<field>Query argument number 1</field>
</fields>
<data_struct indicator="indicator of return data segment" identifier="the attribute to identify data section" identification_string="value of said identifier" line_separator="indicator of a line of data" cell_separator="indicator of a cell of data"/>
<prettify>Regular expression of unwanted character</prettify>
</database>
Example: here is a Oryzabase database:
<database dbname="oryzabase" type="text/html" method="POST">
<link stern="https://shigen.nig.ac.jp/rice/oryzabase/gene/advanced/list"/>
<headers>
<header type="">CGSNL Gene Symbol</header>
<header type="">Gene symbol synonym(s)</header>
<header type="">CGSNL Gene Name</header>
<header type="">Gene name synonym(s)</header>
<header type="">Chr. No.</header>
<header type="">Trait Class</header>
<header type="">Gene Ontology</header>
<header type="">Trait Ontology</header>
<header type="">Plant Ontology</header>
<header type="">RAP ID</header>
<header type="">Mutant Image</header>
</headers>
<fields>
<field>rapId</field>
</fields>
<data_struct indicator="table" identifier="class" identification_string="table_summery_list table_nowrapTh max_width_element" line_separator="tr" cell_separator="td"/>
<prettify>\n>LOC_.*\n|\n|\r|\t</prettify>
</database>
Search new attributes on new databases¶
Add new database¶
PyRice package supports queries on new databases by adding its description manually in database_description.xml. Using JSON format, here is an example with SNP-SEEK database with API: https://snp-seek.irri.org/ws/genomics/gene/osnippo/chr01?start=1&end=15000&model=iric:
<database dbname="snpseek" type="text/JSON" method="GET" normalize="false">
<link stern="https://snp-seek.irri.org/ws/genomics/gene/osnippo/" aft=""/>
<fields>
<field></field>
<field op="=">start</field>
<field op="=">end</field>
<field op="=">model</field>
</fields>
</database>
- For more details:
- dbname : database name
- type : the result returned by API
- method : GET/POST (default GET)
- normalize : normalize name of database true/false (default false)
- stern : URL of API
- op : parameters (see on API above)
For example, with an API from Planteome: http://browser.planteome.org/api/search/annotation?bioentity=AT4G32150:
<database dbname="planteome" type="text/JSON" method="GET" normalize="false">
<link stern="http://browser.planteome.org/api/search/annotation?" aft=""></link>
<fields>
<field op="=">bioentity</field>
</fields>
</database>
Use new query funtion¶
Use the query_new_databse()
function in the MultiQuery
class:
>>> from pyrice.multi_query import MultiQuery
>>> query = MultiQuery()
>>> result = query.query_new_database(atts=['AT4G32150'], number_process= 4,
multi_processing=True,multi_threading=True,dbs=['planteome'])
This function returns a dictionary
.:
>>> print("Output database:",result)
Output database:
{'AT4G32150': {
'planteome': {
'service': '/api/search/annotation',
'status': 'success.',
'arguments': '{}',
'comments': ['Results found for: annotation; queries: ; filters: '],
'data': [{...}]
...},
...}
}
To save the result, package uses the save()
function in the MultiQuery
with different types of file html, pkl, json, csv.:
>>> query.save(result, save_path="./result/",
format=["csv", "html", "json", "pkl"], hyper_link=False)
Note
With APIs return results with HTML and Javascript format, it might have some problems due to the difference of GUI (Javascript) or tag (HTML). So, we are working to simplize the package on those two formats to make it easier for updating new databases.
List of supported databases¶
Database_name: keywords
- Oryzabase : oryzabase
- RapDB : rapdb
- Gramene : gramene
- IC4R : ic4r
- SNP-Seek : snpseek
- Funricegene : funricegene_genekeywords, funricegene_faminfo, funricegene_geneinfo
- MSU : msu
- EMBL-EBI Expression Atlas : embl_ebi
- GWAS-ATLAS : gwas_atlas
- Planteome : planteome
- PlantTFDB : plantfdb_tf, plantfdb_target_gene
Note
Keywords are value of arguments in query module
Result of PyRice¶
Here is the result of package as .html file.

Note
You can click gene name to see more informations of gene.

Pyrice module¶
pyrice.build_dictionary module¶
pyrice.multi_query module¶
-
class
pyrice.multi_query.
MultiQuery
[source]¶ Bases:
object
This class will represent query gene rice for database
-
query
(iricname, db, qfields=[], verbose=False)[source]¶ Query one gene by id or loc on each database
Parameters: - iricname – (str) iricname or id of gene
- db – (str) name database in 8 databases
- qfields – (list) list of loc, id
- verbose – (bool) if True print for debug
Returns: a list with format: [iricname,name_db,iric_on_db]
-
query_by_chromosome
(chro, start_pos, end_pos, number_process=1, multi_processing=False, multi_threading=True, dbs='all', query_expansion=False)[source]¶ Query gene by chromosome
Parameters: - chro – (str) chromosome (ex: “chr01”)
- start_pos – (str) start of chromosome
- end_pos – (str) end of chromosome
- number_process – (int) number of process or number of threading
- multi_processing – (bool) if True, use multi_processing
- multi_threading – (bool) if True, use multi_threading
- dbs – (list) list databases (support 10 available databases)
- query_expansion – (bool) if True, find list list of associated genes
Returns: a dictionary, format : gene:{database: attributes}
-
query_by_ids
(ids=None, locs=None, irics=None, number_process=1, multi_processing=False, multi_threading=True, dbs='all', query_expansion=False)[source]¶ Query gene using id, loc or iric
Parameters: - ids – (list) list id of gene
- locs – (list) list loc of gene
- irics – (list) list iric name of gene
- number_process – (int) number of process or number of threading
- multi_processing – (bool) if True use multi_processing
- multi_threading – (bool) if True use multi_threading
- dbs – (list) databases (support 10 available databases)
- query_expansion – (bool) if True, find list list of associated genes
Returns: a dictionary, format: gene:{database: attribute}
-
query_expansion
(ids=None, locs=None, irics=None, number_process=1)[source]¶ Query gene using id, loc or iric
Parameters: - ids – (list) list id of gene
- locs – (list) list loc of gene
- irics – (list) list iric name of gene
- number_process – (int) number of process or number of threading
Returns: a dictionary, format: gene:{database: attribute}
-
query_multi_threading
(list_key, list_dbs, list_ids, number_threading=2)[source]¶ Query function when using both of multi_processing and multi_threading
Parameters: - list_key – (list) list of iricname
- list_dbs – (list) list of database
- list_ids – (list) list of id or locus
- number_threading – (int) number threading per core
Returns: a dictionary, format: gene:{database:attributes}
-
query_new_database
(atts, number_process=1, multi_processing=False, multi_threading=True, dbs=None)[source]¶ Query for new attributes on new databases
Parameters: - atts – (list) list of new attributes
- number_process – (int) number of process or number of threading
- multi_processing – (bool) if True use multi_processing
- multi_threading – (bool) if True use multi_threading
- dbs – (list) list of new databases
Returns: dictionary, format : attribute:{database: information of attribute}
-
save
(result, save_path, format=None, hyper_link=False)[source]¶ Save result of query with differents types of files
Parameters: - result – (dictionary) get after query with query functions
- save_path – (str) path to save result after call function
- format – (list) 4 format: html, csv, json, pkl
- hyper_link – (bool) hyper_link in csv file
-
search_on_chromosome
(chro, start_pos, end_pos, number_process=1, save_path=None, dbs='all')[source]¶ Search gene by potision on chromosome
Parameters: - chro – (str) chromosome (ex: “chr01”)
- start_pos – (str) start of chromosome
- end_pos – (str) end of chromosome
- number_process – (int) number of threading
- dbs – (list) list databases (support 3 available databases)
- save_path – (str) path to save result after call function
Returns: a dictionary, format: iricname:{{msu7Name:LOC_Os..},{raprepName:Os..},{contig:chr0..},{fmin:12..},{fmax:22…}}
-
pyrice.utils module¶
-
pyrice.utils.
connection_error
(link, data='', type=None, db=None, gene_id=None)[source]¶ - Get result with request post or get; with JavaScript
Parameters: - link – (str) url
- data – (str) data to give to the form
- type – (str) use with JavaScript format
- db – (str) database name - use with JavaScript format
- gene_id – (str) gene id - use with JavaScript format
Returns: object of requests