Retrieve Uniprot data using python
In this Mini-tool I will show you to retrieve data from Uniprot using a PDB ID or an Uniprot ID directly on python. I wrote this function based on the Uniprot programmatic access for python. Despite in essence is the same idea, I did minor modifications to retrieve the data as a plain text from which would be easier to extract useful data.
Let’s put the hands on it.
The python funtion
[1]:
import urllib
from bs4 import BeautifulSoup
def get_uniprot (query='',query_type='PDB_ID'):
#query_type must be: "PDB_ID" or "ACC"
url = 'https://www.uniprot.org/uploadlists/' #This is the webser to retrieve the Uniprot data
params = {
'from':query_type,
'to':'ACC',
'format':'txt',
'query':query
}
data = urllib.parse.urlencode(params)
data = data.encode('ascii')
request = urllib.request.Request(url, data)
with urllib.request.urlopen(request) as response:
res = response.read()
page=BeautifulSoup(res).get_text()
page=page.splitlines()
return page
This very simple function will allow us to save Uniprot data for a PDB entry or Uniprot entry into a list
PDB entry
[2]:
x=get_uniprot(query='1eve',query_type='PDB_ID')
x[:10] #Just get the first 10 lines of data
[2]:
['ID ACES_TETCF Reviewed; 586 AA.',
'AC P04058;',
'DT 01-NOV-1986, integrated into UniProtKB/Swiss-Prot.',
'DT 01-JUN-1994, sequence version 2.',
'DT 16-OCT-2019, entry version 170.',
'DE RecName: Full=Acetylcholinesterase;',
'DE Short=AChE;',
'DE EC=3.1.1.7;',
'DE Flags: Precursor;',
'GN Name=ache;']
Uniprot entry
[3]:
y=get_uniprot(query='P04058',query_type='ACC')
y[:10] #Just get the first 10 lines of data
[3]:
['ID ACES_TETCF Reviewed; 586 AA.',
'AC P04058;',
'DT 01-NOV-1986, integrated into UniProtKB/Swiss-Prot.',
'DT 01-JUN-1994, sequence version 2.',
'DT 16-OCT-2019, entry version 170.',
'DE RecName: Full=Acetylcholinesterase;',
'DE Short=AChE;',
'DE EC=3.1.1.7;',
'DE Flags: Precursor;',
'GN Name=ache;']
Another examples
Because all the data info is available through a list, it is very easy to find the info that we are interested in, the annotated Gene Ontology for instance.
[4]:
for line in y:
if 'DR GO;' in line:
print (line)
DR GO; GO:0031225; C:anchored component of membrane; IEA:UniProtKB-KW.
DR GO; GO:0030054; C:cell junction; IEA:UniProtKB-KW.
DR GO; GO:0005886; C:plasma membrane; IEA:UniProtKB-SubCell.
DR GO; GO:0043083; C:synaptic cleft; IEA:GOC.
DR GO; GO:0003990; F:acetylcholinesterase activity; IEA:UniProtKB-EC.
DR GO; GO:0001507; P:acetylcholine catabolic process in synaptic cleft; IEA:InterPro.
Or all reported PDB’s for desired protein including experimental methodology, resolution, and length.
[5]:
for line in y:
if 'DR PDB;' in line:
print (line)
DR PDB; 1ACJ; X-ray; 2.80 A; A=22-556.
DR PDB; 1ACL; X-ray; 2.80 A; A=22-556.
DR PDB; 1AMN; X-ray; 2.80 A; A=22-558.
DR PDB; 1AX9; X-ray; 2.80 A; A=22-558.
DR PDB; 1CFJ; X-ray; 2.60 A; A=22-558.
DR PDB; 1DX6; X-ray; 2.30 A; A=22-564.
DR PDB; 1E3Q; X-ray; 2.85 A; A=22-564.
DR PDB; 1E66; X-ray; 2.10 A; A=22-564.
DR PDB; 1EA5; X-ray; 1.80 A; A=22-558.
DR PDB; 1EEA; X-ray; 4.50 A; A=22-555.
DR PDB; 1EVE; X-ray; 2.50 A; A=22-564.
DR PDB; 1FSS; X-ray; 3.00 A; A=22-558.
DR PDB; 1GPK; X-ray; 2.10 A; A=22-558.
DR PDB; 1GPN; X-ray; 2.35 A; A=22-558.
DR PDB; 1GQR; X-ray; 2.20 A; A=25-556.
DR PDB; 1GQS; X-ray; 3.00 A; A=25-556.
DR PDB; 1H22; X-ray; 2.15 A; A=22-564.
DR PDB; 1H23; X-ray; 2.15 A; A=22-564.
DR PDB; 1HBJ; X-ray; 2.50 A; A=22-564.
DR PDB; 1JGA; Model; -; A=1-586.
DR PDB; 1JGB; Model; -; A=1-586.
DR PDB; 1JJB; X-ray; 2.30 A; A=25-556.
DR PDB; 1OCE; X-ray; 2.70 A; A=22-558.
DR PDB; 1ODC; X-ray; 2.20 A; A=22-564.
DR PDB; 1QID; X-ray; 2.05 A; A=22-558.
DR PDB; 1QIE; X-ray; 2.10 A; A=22-558.
DR PDB; 1QIF; X-ray; 2.10 A; A=22-558.
DR PDB; 1QIG; X-ray; 2.30 A; A=22-558.
DR PDB; 1QIH; X-ray; 2.50 A; A=22-558.
DR PDB; 1QII; X-ray; 2.65 A; A=22-558.
DR PDB; 1QIJ; X-ray; 2.80 A; A=22-558.
DR PDB; 1QIK; X-ray; 2.90 A; A=22-558.
DR PDB; 1QIM; X-ray; 3.00 A; A=22-558.
DR PDB; 1QTI; X-ray; 2.50 A; A=22-558.
DR PDB; 1SOM; X-ray; 2.20 A; A=22-564.
DR PDB; 1U65; X-ray; 2.61 A; A=22-564.
DR PDB; 1UT6; X-ray; 2.40 A; A=22-556.
DR PDB; 1VOT; X-ray; 2.50 A; A=22-558.
DR PDB; 1VXO; X-ray; 2.40 A; A=22-558.
DR PDB; 1VXR; X-ray; 2.20 A; A=22-558.
DR PDB; 1W4L; X-ray; 2.16 A; A=22-564.
DR PDB; 1W6R; X-ray; 2.05 A; A=22-564.
DR PDB; 1W75; X-ray; 2.40 A; A/B=22-564.
DR PDB; 1W76; X-ray; 2.30 A; A/B=22-564.
DR PDB; 1ZGB; X-ray; 2.30 A; A=22-564.
DR PDB; 1ZGC; X-ray; 2.10 A; A/B=22-564.
DR PDB; 2ACE; X-ray; 2.50 A; A=22-558.
DR PDB; 2ACK; X-ray; 2.40 A; A=22-558.
DR PDB; 2BAG; X-ray; 2.40 A; A=22-564.
DR PDB; 2C4H; X-ray; 2.15 A; A=22-558.
DR PDB; 2C58; X-ray; 2.30 A; A=22-558.
DR PDB; 2C5F; X-ray; 2.60 A; A=22-558.
DR PDB; 2C5G; X-ray; 1.95 A; A=22-558.
DR PDB; 2CEK; X-ray; 2.20 A; A=22-556.
DR PDB; 2CKM; X-ray; 2.15 A; A=22-564.
DR PDB; 2CMF; X-ray; 2.50 A; A=22-564.
DR PDB; 2DFP; X-ray; 2.30 A; A=23-556.
DR PDB; 2J3D; X-ray; 2.60 A; A=22-564.
DR PDB; 2J3Q; X-ray; 2.80 A; A=22-564.
DR PDB; 2J4F; X-ray; 2.80 A; A=22-564.
DR PDB; 2V96; X-ray; 2.40 A; A/B=22-558.
DR PDB; 2V97; X-ray; 2.40 A; A/B=22-558.
DR PDB; 2V98; X-ray; 3.00 A; A/B=22-558.
DR PDB; 2VA9; X-ray; 2.40 A; A/B=22-558.
DR PDB; 2VJA; X-ray; 2.30 A; A/B=22-558.
DR PDB; 2VJB; X-ray; 2.39 A; A/B=22-558.
DR PDB; 2VJC; X-ray; 2.20 A; A/B=22-558.
DR PDB; 2VJD; X-ray; 2.30 A; A/B=22-558.
DR PDB; 2VQ6; X-ray; 2.71 A; A=22-564.
DR PDB; 2VT6; X-ray; 2.40 A; A/B=22-558.
DR PDB; 2VT7; X-ray; 2.20 A; A/B=22-558.
DR PDB; 2W6C; X-ray; 2.69 A; X=1-586.
DR PDB; 2WFZ; X-ray; 1.95 A; A=22-558.
DR PDB; 2WG0; X-ray; 2.20 A; A=22-558.
DR PDB; 2WG1; X-ray; 2.20 A; A=22-558.
DR PDB; 2WG2; X-ray; 1.95 A; A=22-558.
DR PDB; 2XI4; X-ray; 2.30 A; A/B=22-558.
DR PDB; 3ACE; Model; -; A=22-558.
DR PDB; 3GEL; X-ray; 2.39 A; A=25-556.
DR PDB; 3I6M; X-ray; 2.26 A; A=23-556.
DR PDB; 3I6Z; X-ray; 2.19 A; A=23-556.
DR PDB; 3M3D; X-ray; 2.34 A; A=22-564.
DR PDB; 3ZV7; X-ray; 2.26 A; A=22-564.
DR PDB; 4ACE; Model; -; A=22-558.
DR PDB; 4TVK; X-ray; 2.30 A; A=23-556.
DR PDB; 4W63; X-ray; 2.80 A; A=23-556.
DR PDB; 4X3C; X-ray; 2.60 A; A=23-556.
DR PDB; 5BWB; X-ray; 2.57 A; A=22-558.
DR PDB; 5BWC; X-ray; 2.45 A; A=22-558.
DR PDB; 5DLP; X-ray; 2.70 A; A=22-564.
DR PDB; 5E2I; X-ray; 2.65 A; A=25-556.
DR PDB; 5E4J; X-ray; 2.54 A; A=25-556.
DR PDB; 5E4T; X-ray; 2.43 A; A=22-564.
DR PDB; 5EHX; X-ray; 2.10 A; A=25-556.
DR PDB; 5EI5; X-ray; 2.10 A; A=23-556.
DR PDB; 5IH7; X-ray; 2.40 A; A=23-556.
DR PDB; 5NAP; X-ray; 2.17 A; A=22-564.
DR PDB; 5NAU; X-ray; 2.25 A; A=22-564.
DR PDB; 5NUU; X-ray; 2.50 A; A=22-564.
DR PDB; 6EUC; X-ray; 2.22 A; A=25-556.
DR PDB; 6EUE; X-ray; 2.00 A; A=24-556.
DR PDB; 6EWK; X-ray; 2.22 A; A=25-556.
DR PDB; 6EZG; X-ray; 2.20 A; A/B=22-558.
DR PDB; 6EZH; X-ray; 2.60 A; A/B=22-558.
DR PDB; 6FLD; X-ray; 2.40 A; A=25-556.
DR PDB; 6FQN; X-ray; 2.30 A; A=25-556.
DR PDB; 6G17; X-ray; 2.20 A; A=22-558.
DR PDB; 6G1U; X-ray; 1.79 A; A/B=22-586.
DR PDB; 6G1V; X-ray; 1.82 A; A/B=22-586.
DR PDB; 6G1W; X-ray; 1.90 A; A/B=22-586.
DR PDB; 6G4M; X-ray; 2.63 A; A/B=22-558.
DR PDB; 6G4N; X-ray; 2.90 A; A/B=22-558.
DR PDB; 6G4O; X-ray; 2.78 A; A/B=22-558.
DR PDB; 6G4P; X-ray; 2.83 A; A/B=22-558.
DR PDB; 6H12; X-ray; 2.20 A; A/B=22-586.
DR PDB; 6H13; X-ray; 2.80 A; A/B=22-586.
DR PDB; 6H14; X-ray; 1.86 A; A/B=22-586.
The main use for which I created this function is to store data of a list of proteins (PDB entries or Uniprot entries) into a single table. For example:
[6]:
import pandas as pd #To create our table
[7]:
prots=['P40926','O43175','Q9UM73']
[8]:
table=pd.DataFrame()
for index,entry in enumerate(prots):
pdbs=[]
funtions=[]
process=[]
organism=[]
data=get_uniprot(query=entry,query_type='ACC')
table.loc[index,'Uniprot_entry']=entry
for line in data:
if 'OS ' in line:
line=line.strip().replace('OS ','').replace('.','')
organism.append(line)
table.loc[index,'Organism']=(", ".join(list(set(organism))))
if 'DR PDB;' in line:
line=line.strip().replace('DR ','').replace(';','')
pdbs.append ((line.split()[1]+':'+line.split()[3]))
table.loc[index,'PDB:Resol']=(", ".join(list(set(pdbs))))
if 'DR GO; GO:' in line:
line=line.strip().replace('DR GO; GO:','').replace(';','').split(':')
if 'F' in line[0]:
funtions.append(line[1])
table.loc[index,'GO_funtion']=(", ".join(list(set(funtions))))
else:
process.append (line[1])
table.loc[index,'GO_process']=(", ".join(list(set(process))))
[9]:
table
[9]:
Uniprot_entry | Organism | PDB:Resol | GO_process | GO_funtion | |
---|---|---|---|---|---|
0 | P40926 | Homo sapiens (Human) | 2DFD:1.90, 4WLV:2.40, 4WLF:2.20, 4WLU:2.14, 4W... | tricarboxylic acid cycle IBA, aerobic respirat... | L-malate dehydrogenase activity IDA, malate de... |
1 | O43175 | Homo sapiens (Human) | 6RJ5:1.89, 6RIH:2.15, 6RJ3:1.42, 6RJ2:2.00, 5N... | glycine metabolic process IEA, L-serine biosyn... | phosphoglycerate dehydrogenase activity IBA, e... |
2 | Q9UM73 | Homo sapiens (Human) | 3L9P:1.80, 4CMT:1.73, 4FNY:2.45, 5AAB:2.20, 2X... | activation of MAPK activity TAS, adult behavio... | ATP binding IEA, identical protein binding IPI... |
Saving the table
[10]:
table.to_csv('Uniprot_search.csv')