Retrieve Uniprot data using python

In this Mini-tool I will show you to retrieve data from Uniprot using a PDB ID or an Uniprot ID directly on python. I wrote this function based on the Uniprot programmatic access for python. Despite in essence is the same idea, I did minor modifications to retrieve the data as a plain text from which would be easier to extract useful data.

Let's put the hands on it.

The python funtion

In [1]:
import urllib
from bs4 import BeautifulSoup

def get_uniprot (query='',query_type='PDB_ID'):
    #query_type must be: "PDB_ID" or "ACC"
    url = 'https://www.uniprot.org/uploadlists/' #This is the webser to retrieve the Uniprot data
    params = {
    'from':query_type,
    'to':'ACC',
    'format':'txt',
    'query':query
    }

    data = urllib.parse.urlencode(params)
    data = data.encode('ascii')
    request = urllib.request.Request(url, data)
    with urllib.request.urlopen(request) as response:
        res = response.read()
        page=BeautifulSoup(res).get_text()
        page=page.splitlines()
    return page

This very simple function will allow us to save Uniprot data for a PDB entry or Uniprot entry into a list

PDB entry

In [2]:
x=get_uniprot(query='1eve',query_type='PDB_ID')
x[:10] #Just get the first 10 lines of data
Out[2]:
['ID   ACES_TETCF              Reviewed;         586 AA.',
 'AC   P04058;',
 'DT   01-NOV-1986, integrated into UniProtKB/Swiss-Prot.',
 'DT   01-JUN-1994, sequence version 2.',
 'DT   16-OCT-2019, entry version 170.',
 'DE   RecName: Full=Acetylcholinesterase;',
 'DE            Short=AChE;',
 'DE            EC=3.1.1.7;',
 'DE   Flags: Precursor;',
 'GN   Name=ache;']

Uniprot entry

In [3]:
y=get_uniprot(query='P04058',query_type='ACC')
y[:10] #Just get the first 10 lines of data
Out[3]:
['ID   ACES_TETCF              Reviewed;         586 AA.',
 'AC   P04058;',
 'DT   01-NOV-1986, integrated into UniProtKB/Swiss-Prot.',
 'DT   01-JUN-1994, sequence version 2.',
 'DT   16-OCT-2019, entry version 170.',
 'DE   RecName: Full=Acetylcholinesterase;',
 'DE            Short=AChE;',
 'DE            EC=3.1.1.7;',
 'DE   Flags: Precursor;',
 'GN   Name=ache;']

Another examples

Because all the data info is available through a list, it is very easy to find the info that we are interested in, the annotated Gene Ontology for instance.

In [4]:
for line in y:
    if 'DR   GO;' in line:
        print (line)
DR   GO; GO:0031225; C:anchored component of membrane; IEA:UniProtKB-KW.
DR   GO; GO:0030054; C:cell junction; IEA:UniProtKB-KW.
DR   GO; GO:0005886; C:plasma membrane; IEA:UniProtKB-SubCell.
DR   GO; GO:0043083; C:synaptic cleft; IEA:GOC.
DR   GO; GO:0003990; F:acetylcholinesterase activity; IEA:UniProtKB-EC.
DR   GO; GO:0001507; P:acetylcholine catabolic process in synaptic cleft; IEA:InterPro.

Or all reported PDB's for desired protein including experimental methodology, resolution, and length.

In [5]:
for line in y:
    if 'DR   PDB;' in line:
        print (line)
DR   PDB; 1ACJ; X-ray; 2.80 A; A=22-556.
DR   PDB; 1ACL; X-ray; 2.80 A; A=22-556.
DR   PDB; 1AMN; X-ray; 2.80 A; A=22-558.
DR   PDB; 1AX9; X-ray; 2.80 A; A=22-558.
DR   PDB; 1CFJ; X-ray; 2.60 A; A=22-558.
DR   PDB; 1DX6; X-ray; 2.30 A; A=22-564.
DR   PDB; 1E3Q; X-ray; 2.85 A; A=22-564.
DR   PDB; 1E66; X-ray; 2.10 A; A=22-564.
DR   PDB; 1EA5; X-ray; 1.80 A; A=22-558.
DR   PDB; 1EEA; X-ray; 4.50 A; A=22-555.
DR   PDB; 1EVE; X-ray; 2.50 A; A=22-564.
DR   PDB; 1FSS; X-ray; 3.00 A; A=22-558.
DR   PDB; 1GPK; X-ray; 2.10 A; A=22-558.
DR   PDB; 1GPN; X-ray; 2.35 A; A=22-558.
DR   PDB; 1GQR; X-ray; 2.20 A; A=25-556.
DR   PDB; 1GQS; X-ray; 3.00 A; A=25-556.
DR   PDB; 1H22; X-ray; 2.15 A; A=22-564.
DR   PDB; 1H23; X-ray; 2.15 A; A=22-564.
DR   PDB; 1HBJ; X-ray; 2.50 A; A=22-564.
DR   PDB; 1JGA; Model; -; A=1-586.
DR   PDB; 1JGB; Model; -; A=1-586.
DR   PDB; 1JJB; X-ray; 2.30 A; A=25-556.
DR   PDB; 1OCE; X-ray; 2.70 A; A=22-558.
DR   PDB; 1ODC; X-ray; 2.20 A; A=22-564.
DR   PDB; 1QID; X-ray; 2.05 A; A=22-558.
DR   PDB; 1QIE; X-ray; 2.10 A; A=22-558.
DR   PDB; 1QIF; X-ray; 2.10 A; A=22-558.
DR   PDB; 1QIG; X-ray; 2.30 A; A=22-558.
DR   PDB; 1QIH; X-ray; 2.50 A; A=22-558.
DR   PDB; 1QII; X-ray; 2.65 A; A=22-558.
DR   PDB; 1QIJ; X-ray; 2.80 A; A=22-558.
DR   PDB; 1QIK; X-ray; 2.90 A; A=22-558.
DR   PDB; 1QIM; X-ray; 3.00 A; A=22-558.
DR   PDB; 1QTI; X-ray; 2.50 A; A=22-558.
DR   PDB; 1SOM; X-ray; 2.20 A; A=22-564.
DR   PDB; 1U65; X-ray; 2.61 A; A=22-564.
DR   PDB; 1UT6; X-ray; 2.40 A; A=22-556.
DR   PDB; 1VOT; X-ray; 2.50 A; A=22-558.
DR   PDB; 1VXO; X-ray; 2.40 A; A=22-558.
DR   PDB; 1VXR; X-ray; 2.20 A; A=22-558.
DR   PDB; 1W4L; X-ray; 2.16 A; A=22-564.
DR   PDB; 1W6R; X-ray; 2.05 A; A=22-564.
DR   PDB; 1W75; X-ray; 2.40 A; A/B=22-564.
DR   PDB; 1W76; X-ray; 2.30 A; A/B=22-564.
DR   PDB; 1ZGB; X-ray; 2.30 A; A=22-564.
DR   PDB; 1ZGC; X-ray; 2.10 A; A/B=22-564.
DR   PDB; 2ACE; X-ray; 2.50 A; A=22-558.
DR   PDB; 2ACK; X-ray; 2.40 A; A=22-558.
DR   PDB; 2BAG; X-ray; 2.40 A; A=22-564.
DR   PDB; 2C4H; X-ray; 2.15 A; A=22-558.
DR   PDB; 2C58; X-ray; 2.30 A; A=22-558.
DR   PDB; 2C5F; X-ray; 2.60 A; A=22-558.
DR   PDB; 2C5G; X-ray; 1.95 A; A=22-558.
DR   PDB; 2CEK; X-ray; 2.20 A; A=22-556.
DR   PDB; 2CKM; X-ray; 2.15 A; A=22-564.
DR   PDB; 2CMF; X-ray; 2.50 A; A=22-564.
DR   PDB; 2DFP; X-ray; 2.30 A; A=23-556.
DR   PDB; 2J3D; X-ray; 2.60 A; A=22-564.
DR   PDB; 2J3Q; X-ray; 2.80 A; A=22-564.
DR   PDB; 2J4F; X-ray; 2.80 A; A=22-564.
DR   PDB; 2V96; X-ray; 2.40 A; A/B=22-558.
DR   PDB; 2V97; X-ray; 2.40 A; A/B=22-558.
DR   PDB; 2V98; X-ray; 3.00 A; A/B=22-558.
DR   PDB; 2VA9; X-ray; 2.40 A; A/B=22-558.
DR   PDB; 2VJA; X-ray; 2.30 A; A/B=22-558.
DR   PDB; 2VJB; X-ray; 2.39 A; A/B=22-558.
DR   PDB; 2VJC; X-ray; 2.20 A; A/B=22-558.
DR   PDB; 2VJD; X-ray; 2.30 A; A/B=22-558.
DR   PDB; 2VQ6; X-ray; 2.71 A; A=22-564.
DR   PDB; 2VT6; X-ray; 2.40 A; A/B=22-558.
DR   PDB; 2VT7; X-ray; 2.20 A; A/B=22-558.
DR   PDB; 2W6C; X-ray; 2.69 A; X=1-586.
DR   PDB; 2WFZ; X-ray; 1.95 A; A=22-558.
DR   PDB; 2WG0; X-ray; 2.20 A; A=22-558.
DR   PDB; 2WG1; X-ray; 2.20 A; A=22-558.
DR   PDB; 2WG2; X-ray; 1.95 A; A=22-558.
DR   PDB; 2XI4; X-ray; 2.30 A; A/B=22-558.
DR   PDB; 3ACE; Model; -; A=22-558.
DR   PDB; 3GEL; X-ray; 2.39 A; A=25-556.
DR   PDB; 3I6M; X-ray; 2.26 A; A=23-556.
DR   PDB; 3I6Z; X-ray; 2.19 A; A=23-556.
DR   PDB; 3M3D; X-ray; 2.34 A; A=22-564.
DR   PDB; 3ZV7; X-ray; 2.26 A; A=22-564.
DR   PDB; 4ACE; Model; -; A=22-558.
DR   PDB; 4TVK; X-ray; 2.30 A; A=23-556.
DR   PDB; 4W63; X-ray; 2.80 A; A=23-556.
DR   PDB; 4X3C; X-ray; 2.60 A; A=23-556.
DR   PDB; 5BWB; X-ray; 2.57 A; A=22-558.
DR   PDB; 5BWC; X-ray; 2.45 A; A=22-558.
DR   PDB; 5DLP; X-ray; 2.70 A; A=22-564.
DR   PDB; 5E2I; X-ray; 2.65 A; A=25-556.
DR   PDB; 5E4J; X-ray; 2.54 A; A=25-556.
DR   PDB; 5E4T; X-ray; 2.43 A; A=22-564.
DR   PDB; 5EHX; X-ray; 2.10 A; A=25-556.
DR   PDB; 5EI5; X-ray; 2.10 A; A=23-556.
DR   PDB; 5IH7; X-ray; 2.40 A; A=23-556.
DR   PDB; 5NAP; X-ray; 2.17 A; A=22-564.
DR   PDB; 5NAU; X-ray; 2.25 A; A=22-564.
DR   PDB; 5NUU; X-ray; 2.50 A; A=22-564.
DR   PDB; 6EUC; X-ray; 2.22 A; A=25-556.
DR   PDB; 6EUE; X-ray; 2.00 A; A=24-556.
DR   PDB; 6EWK; X-ray; 2.22 A; A=25-556.
DR   PDB; 6EZG; X-ray; 2.20 A; A/B=22-558.
DR   PDB; 6EZH; X-ray; 2.60 A; A/B=22-558.
DR   PDB; 6FLD; X-ray; 2.40 A; A=25-556.
DR   PDB; 6FQN; X-ray; 2.30 A; A=25-556.
DR   PDB; 6G17; X-ray; 2.20 A; A=22-558.
DR   PDB; 6G1U; X-ray; 1.79 A; A/B=22-586.
DR   PDB; 6G1V; X-ray; 1.82 A; A/B=22-586.
DR   PDB; 6G1W; X-ray; 1.90 A; A/B=22-586.
DR   PDB; 6G4M; X-ray; 2.63 A; A/B=22-558.
DR   PDB; 6G4N; X-ray; 2.90 A; A/B=22-558.
DR   PDB; 6G4O; X-ray; 2.78 A; A/B=22-558.
DR   PDB; 6G4P; X-ray; 2.83 A; A/B=22-558.
DR   PDB; 6H12; X-ray; 2.20 A; A/B=22-586.
DR   PDB; 6H13; X-ray; 2.80 A; A/B=22-586.
DR   PDB; 6H14; X-ray; 1.86 A; A/B=22-586.

The main use for which I created this function is to store data of a list of proteins (PDB entries or Uniprot entries) into a single table. For example:

In [6]:
import pandas as pd #To create our table
In [7]:
prots=['P40926','O43175','Q9UM73']
In [8]:
table=pd.DataFrame()
for index,entry in enumerate(prots):
    pdbs=[]
    funtions=[]
    process=[]
    organism=[]
    data=get_uniprot(query=entry,query_type='ACC')
    
    table.loc[index,'Uniprot_entry']=entry
    
    for line in data:
        if 'OS   ' in line:
            line=line.strip().replace('OS   ','').replace('.','')
            organism.append(line)
            table.loc[index,'Organism']=(", ".join(list(set(organism))))

        if 'DR   PDB;' in line:
            line=line.strip().replace('DR   ','').replace(';','')
            pdbs.append ((line.split()[1]+':'+line.split()[3]))
            table.loc[index,'PDB:Resol']=(", ".join(list(set(pdbs))))

        if 'DR   GO; GO:' in line:
            line=line.strip().replace('DR   GO; GO:','').replace(';','').split(':')
            if 'F' in line[0]:
                funtions.append(line[1])
                table.loc[index,'GO_funtion']=(", ".join(list(set(funtions))))
            else:
                process.append (line[1])
                table.loc[index,'GO_process']=(", ".join(list(set(process))))
In [9]:
table
Out[9]:
Uniprot_entry Organism PDB:Resol GO_process GO_funtion
0 P40926 Homo sapiens (Human) 2DFD:1.90, 4WLV:2.40, 4WLF:2.20, 4WLU:2.14, 4W... tricarboxylic acid cycle IBA, aerobic respirat... L-malate dehydrogenase activity IDA, malate de...
1 O43175 Homo sapiens (Human) 6RJ5:1.89, 6RIH:2.15, 6RJ3:1.42, 6RJ2:2.00, 5N... glycine metabolic process IEA, L-serine biosyn... phosphoglycerate dehydrogenase activity IBA, e...
2 Q9UM73 Homo sapiens (Human) 3L9P:1.80, 4CMT:1.73, 4FNY:2.45, 5AAB:2.20, 2X... activation of MAPK activity TAS, adult behavio... ATP binding IEA, identical protein binding IPI...

Saving the table

In [10]:
table.to_csv('Uniprot_search.csv')

Comments

comments powered by Disqus