{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Retrieve Uniprot data using python \n", "\n", "In this Mini-tool I will show you to retrieve data from Uniprot using a PDB ID or an Uniprot ID directly on python. I wrote this function based on the Uniprot [programmatic access for python](https://www.uniprot.org/help/api_idmapping). Despite in essence is the same idea, I did minor modifications to retrieve the data as a plain text from which would be easier to extract useful data.\n", "\n", "Let's put the hands on it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The python funtion" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import urllib\n", "from bs4 import BeautifulSoup\n", "\n", "def get_uniprot (query='',query_type='PDB_ID'):\n", " #query_type must be: \"PDB_ID\" or \"ACC\"\n", " url = 'https://www.uniprot.org/uploadlists/' #This is the webser to retrieve the Uniprot data\n", " params = {\n", " 'from':query_type,\n", " 'to':'ACC',\n", " 'format':'txt',\n", " 'query':query\n", " }\n", "\n", " data = urllib.parse.urlencode(params)\n", " data = data.encode('ascii')\n", " request = urllib.request.Request(url, data)\n", " with urllib.request.urlopen(request) as response:\n", " res = response.read()\n", " page=BeautifulSoup(res).get_text()\n", " page=page.splitlines()\n", " return page" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This very simple function will allow us to save Uniprot data for a PDB entry or Uniprot entry into a list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PDB entry" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ID ACES_TETCF Reviewed; 586 AA.',\n", " 'AC P04058;',\n", " 'DT 01-NOV-1986, integrated into UniProtKB/Swiss-Prot.',\n", " 'DT 01-JUN-1994, sequence version 2.',\n", " 'DT 16-OCT-2019, entry version 170.',\n", " 'DE RecName: Full=Acetylcholinesterase;',\n", " 'DE Short=AChE;',\n", " 'DE EC=3.1.1.7;',\n", " 'DE Flags: Precursor;',\n", " 'GN Name=ache;']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x=get_uniprot(query='1eve',query_type='PDB_ID')\n", "x[:10] #Just get the first 10 lines of data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Uniprot entry" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ID ACES_TETCF Reviewed; 586 AA.',\n", " 'AC P04058;',\n", " 'DT 01-NOV-1986, integrated into UniProtKB/Swiss-Prot.',\n", " 'DT 01-JUN-1994, sequence version 2.',\n", " 'DT 16-OCT-2019, entry version 170.',\n", " 'DE RecName: Full=Acetylcholinesterase;',\n", " 'DE Short=AChE;',\n", " 'DE EC=3.1.1.7;',\n", " 'DE Flags: Precursor;',\n", " 'GN Name=ache;']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y=get_uniprot(query='P04058',query_type='ACC')\n", "y[:10] #Just get the first 10 lines of data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Another examples\n", "\n", "Because all the data info is available through a list, it is very easy to find the info that we are interested in, the annotated Gene Ontology for instance." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DR GO; GO:0031225; C:anchored component of membrane; IEA:UniProtKB-KW.\n", "DR GO; GO:0030054; C:cell junction; IEA:UniProtKB-KW.\n", "DR GO; GO:0005886; C:plasma membrane; IEA:UniProtKB-SubCell.\n", "DR GO; GO:0043083; C:synaptic cleft; IEA:GOC.\n", "DR GO; GO:0003990; F:acetylcholinesterase activity; IEA:UniProtKB-EC.\n", "DR GO; GO:0001507; P:acetylcholine catabolic process in synaptic cleft; IEA:InterPro.\n" ] } ], "source": [ "for line in y:\n", " if 'DR GO;' in line:\n", " print (line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or all reported PDB's for desired protein including experimental methodology, resolution, and length." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DR PDB; 1ACJ; X-ray; 2.80 A; A=22-556.\n", "DR PDB; 1ACL; X-ray; 2.80 A; A=22-556.\n", "DR PDB; 1AMN; X-ray; 2.80 A; A=22-558.\n", "DR PDB; 1AX9; X-ray; 2.80 A; A=22-558.\n", "DR PDB; 1CFJ; X-ray; 2.60 A; A=22-558.\n", "DR PDB; 1DX6; X-ray; 2.30 A; A=22-564.\n", "DR PDB; 1E3Q; X-ray; 2.85 A; A=22-564.\n", "DR PDB; 1E66; X-ray; 2.10 A; A=22-564.\n", "DR PDB; 1EA5; X-ray; 1.80 A; A=22-558.\n", "DR PDB; 1EEA; X-ray; 4.50 A; A=22-555.\n", "DR PDB; 1EVE; X-ray; 2.50 A; A=22-564.\n", "DR PDB; 1FSS; X-ray; 3.00 A; A=22-558.\n", "DR PDB; 1GPK; X-ray; 2.10 A; A=22-558.\n", "DR PDB; 1GPN; X-ray; 2.35 A; A=22-558.\n", "DR PDB; 1GQR; X-ray; 2.20 A; A=25-556.\n", "DR PDB; 1GQS; X-ray; 3.00 A; A=25-556.\n", "DR PDB; 1H22; X-ray; 2.15 A; A=22-564.\n", "DR PDB; 1H23; X-ray; 2.15 A; A=22-564.\n", "DR PDB; 1HBJ; X-ray; 2.50 A; A=22-564.\n", "DR PDB; 1JGA; Model; -; A=1-586.\n", "DR PDB; 1JGB; Model; -; A=1-586.\n", "DR PDB; 1JJB; X-ray; 2.30 A; A=25-556.\n", "DR PDB; 1OCE; X-ray; 2.70 A; A=22-558.\n", "DR PDB; 1ODC; X-ray; 2.20 A; A=22-564.\n", "DR PDB; 1QID; X-ray; 2.05 A; A=22-558.\n", "DR PDB; 1QIE; X-ray; 2.10 A; A=22-558.\n", "DR PDB; 1QIF; X-ray; 2.10 A; A=22-558.\n", "DR PDB; 1QIG; X-ray; 2.30 A; A=22-558.\n", "DR PDB; 1QIH; X-ray; 2.50 A; A=22-558.\n", "DR PDB; 1QII; X-ray; 2.65 A; A=22-558.\n", "DR PDB; 1QIJ; X-ray; 2.80 A; A=22-558.\n", "DR PDB; 1QIK; X-ray; 2.90 A; A=22-558.\n", "DR PDB; 1QIM; X-ray; 3.00 A; A=22-558.\n", "DR PDB; 1QTI; X-ray; 2.50 A; A=22-558.\n", "DR PDB; 1SOM; X-ray; 2.20 A; A=22-564.\n", "DR PDB; 1U65; X-ray; 2.61 A; A=22-564.\n", "DR PDB; 1UT6; X-ray; 2.40 A; A=22-556.\n", "DR PDB; 1VOT; X-ray; 2.50 A; A=22-558.\n", "DR PDB; 1VXO; X-ray; 2.40 A; A=22-558.\n", "DR PDB; 1VXR; X-ray; 2.20 A; A=22-558.\n", "DR PDB; 1W4L; X-ray; 2.16 A; A=22-564.\n", "DR PDB; 1W6R; X-ray; 2.05 A; A=22-564.\n", "DR PDB; 1W75; X-ray; 2.40 A; A/B=22-564.\n", "DR PDB; 1W76; X-ray; 2.30 A; A/B=22-564.\n", "DR PDB; 1ZGB; X-ray; 2.30 A; A=22-564.\n", "DR PDB; 1ZGC; X-ray; 2.10 A; A/B=22-564.\n", "DR PDB; 2ACE; X-ray; 2.50 A; A=22-558.\n", "DR PDB; 2ACK; X-ray; 2.40 A; A=22-558.\n", "DR PDB; 2BAG; X-ray; 2.40 A; A=22-564.\n", "DR PDB; 2C4H; X-ray; 2.15 A; A=22-558.\n", "DR PDB; 2C58; X-ray; 2.30 A; A=22-558.\n", "DR PDB; 2C5F; X-ray; 2.60 A; A=22-558.\n", "DR PDB; 2C5G; X-ray; 1.95 A; A=22-558.\n", "DR PDB; 2CEK; X-ray; 2.20 A; A=22-556.\n", "DR PDB; 2CKM; X-ray; 2.15 A; A=22-564.\n", "DR PDB; 2CMF; X-ray; 2.50 A; A=22-564.\n", "DR PDB; 2DFP; X-ray; 2.30 A; A=23-556.\n", "DR PDB; 2J3D; X-ray; 2.60 A; A=22-564.\n", "DR PDB; 2J3Q; X-ray; 2.80 A; A=22-564.\n", "DR PDB; 2J4F; X-ray; 2.80 A; A=22-564.\n", "DR PDB; 2V96; X-ray; 2.40 A; A/B=22-558.\n", "DR PDB; 2V97; X-ray; 2.40 A; A/B=22-558.\n", "DR PDB; 2V98; X-ray; 3.00 A; A/B=22-558.\n", "DR PDB; 2VA9; X-ray; 2.40 A; A/B=22-558.\n", "DR PDB; 2VJA; X-ray; 2.30 A; A/B=22-558.\n", "DR PDB; 2VJB; X-ray; 2.39 A; A/B=22-558.\n", "DR PDB; 2VJC; X-ray; 2.20 A; A/B=22-558.\n", "DR PDB; 2VJD; X-ray; 2.30 A; A/B=22-558.\n", "DR PDB; 2VQ6; X-ray; 2.71 A; A=22-564.\n", "DR PDB; 2VT6; X-ray; 2.40 A; A/B=22-558.\n", "DR PDB; 2VT7; X-ray; 2.20 A; A/B=22-558.\n", "DR PDB; 2W6C; X-ray; 2.69 A; X=1-586.\n", "DR PDB; 2WFZ; X-ray; 1.95 A; A=22-558.\n", "DR PDB; 2WG0; X-ray; 2.20 A; A=22-558.\n", "DR PDB; 2WG1; X-ray; 2.20 A; A=22-558.\n", "DR PDB; 2WG2; X-ray; 1.95 A; A=22-558.\n", "DR PDB; 2XI4; X-ray; 2.30 A; A/B=22-558.\n", "DR PDB; 3ACE; Model; -; A=22-558.\n", "DR PDB; 3GEL; X-ray; 2.39 A; A=25-556.\n", "DR PDB; 3I6M; X-ray; 2.26 A; A=23-556.\n", "DR PDB; 3I6Z; X-ray; 2.19 A; A=23-556.\n", "DR PDB; 3M3D; X-ray; 2.34 A; A=22-564.\n", "DR PDB; 3ZV7; X-ray; 2.26 A; A=22-564.\n", "DR PDB; 4ACE; Model; -; A=22-558.\n", "DR PDB; 4TVK; X-ray; 2.30 A; A=23-556.\n", "DR PDB; 4W63; X-ray; 2.80 A; A=23-556.\n", "DR PDB; 4X3C; X-ray; 2.60 A; A=23-556.\n", "DR PDB; 5BWB; X-ray; 2.57 A; A=22-558.\n", "DR PDB; 5BWC; X-ray; 2.45 A; A=22-558.\n", "DR PDB; 5DLP; X-ray; 2.70 A; A=22-564.\n", "DR PDB; 5E2I; X-ray; 2.65 A; A=25-556.\n", "DR PDB; 5E4J; X-ray; 2.54 A; A=25-556.\n", "DR PDB; 5E4T; X-ray; 2.43 A; A=22-564.\n", "DR PDB; 5EHX; X-ray; 2.10 A; A=25-556.\n", "DR PDB; 5EI5; X-ray; 2.10 A; A=23-556.\n", "DR PDB; 5IH7; X-ray; 2.40 A; A=23-556.\n", "DR PDB; 5NAP; X-ray; 2.17 A; A=22-564.\n", "DR PDB; 5NAU; X-ray; 2.25 A; A=22-564.\n", "DR PDB; 5NUU; X-ray; 2.50 A; A=22-564.\n", "DR PDB; 6EUC; X-ray; 2.22 A; A=25-556.\n", "DR PDB; 6EUE; X-ray; 2.00 A; A=24-556.\n", "DR PDB; 6EWK; X-ray; 2.22 A; A=25-556.\n", "DR PDB; 6EZG; X-ray; 2.20 A; A/B=22-558.\n", "DR PDB; 6EZH; X-ray; 2.60 A; A/B=22-558.\n", "DR PDB; 6FLD; X-ray; 2.40 A; A=25-556.\n", "DR PDB; 6FQN; X-ray; 2.30 A; A=25-556.\n", "DR PDB; 6G17; X-ray; 2.20 A; A=22-558.\n", "DR PDB; 6G1U; X-ray; 1.79 A; A/B=22-586.\n", "DR PDB; 6G1V; X-ray; 1.82 A; A/B=22-586.\n", "DR PDB; 6G1W; X-ray; 1.90 A; A/B=22-586.\n", "DR PDB; 6G4M; X-ray; 2.63 A; A/B=22-558.\n", "DR PDB; 6G4N; X-ray; 2.90 A; A/B=22-558.\n", "DR PDB; 6G4O; X-ray; 2.78 A; A/B=22-558.\n", "DR PDB; 6G4P; X-ray; 2.83 A; A/B=22-558.\n", "DR PDB; 6H12; X-ray; 2.20 A; A/B=22-586.\n", "DR PDB; 6H13; X-ray; 2.80 A; A/B=22-586.\n", "DR PDB; 6H14; X-ray; 1.86 A; A/B=22-586.\n" ] } ], "source": [ "for line in y:\n", " if 'DR PDB;' in line:\n", " print (line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main use for which I created this function is to store data of a list of proteins (PDB entries or Uniprot entries) into a single table. For example:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd #To create our table" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "prots=['P40926','O43175','Q9UM73']" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "table=pd.DataFrame()\n", "for index,entry in enumerate(prots):\n", " pdbs=[]\n", " funtions=[]\n", " process=[]\n", " organism=[]\n", " data=get_uniprot(query=entry,query_type='ACC')\n", " \n", " table.loc[index,'Uniprot_entry']=entry\n", " \n", " for line in data:\n", " if 'OS ' in line:\n", " line=line.strip().replace('OS ','').replace('.','')\n", " organism.append(line)\n", " table.loc[index,'Organism']=(\", \".join(list(set(organism))))\n", "\n", " if 'DR PDB;' in line:\n", " line=line.strip().replace('DR ','').replace(';','')\n", " pdbs.append ((line.split()[1]+':'+line.split()[3]))\n", " table.loc[index,'PDB:Resol']=(\", \".join(list(set(pdbs))))\n", "\n", " if 'DR GO; GO:' in line:\n", " line=line.strip().replace('DR GO; GO:','').replace(';','').split(':')\n", " if 'F' in line[0]:\n", " funtions.append(line[1])\n", " table.loc[index,'GO_funtion']=(\", \".join(list(set(funtions))))\n", " else:\n", " process.append (line[1])\n", " table.loc[index,'GO_process']=(\", \".join(list(set(process))))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Uniprot_entryOrganismPDB:ResolGO_processGO_funtion
0P40926Homo sapiens (Human)2DFD:1.90, 4WLV:2.40, 4WLF:2.20, 4WLU:2.14, 4W...tricarboxylic acid cycle IBA, aerobic respirat...L-malate dehydrogenase activity IDA, malate de...
1O43175Homo sapiens (Human)6RJ5:1.89, 6RIH:2.15, 6RJ3:1.42, 6RJ2:2.00, 5N...glycine metabolic process IEA, L-serine biosyn...phosphoglycerate dehydrogenase activity IBA, e...
2Q9UM73Homo sapiens (Human)3L9P:1.80, 4CMT:1.73, 4FNY:2.45, 5AAB:2.20, 2X...activation of MAPK activity TAS, adult behavio...ATP binding IEA, identical protein binding IPI...
\n", "
" ], "text/plain": [ " Uniprot_entry Organism \\\n", "0 P40926 Homo sapiens (Human) \n", "1 O43175 Homo sapiens (Human) \n", "2 Q9UM73 Homo sapiens (Human) \n", "\n", " PDB:Resol \\\n", "0 2DFD:1.90, 4WLV:2.40, 4WLF:2.20, 4WLU:2.14, 4W... \n", "1 6RJ5:1.89, 6RIH:2.15, 6RJ3:1.42, 6RJ2:2.00, 5N... \n", "2 3L9P:1.80, 4CMT:1.73, 4FNY:2.45, 5AAB:2.20, 2X... \n", "\n", " GO_process \\\n", "0 tricarboxylic acid cycle IBA, aerobic respirat... \n", "1 glycine metabolic process IEA, L-serine biosyn... \n", "2 activation of MAPK activity TAS, adult behavio... \n", "\n", " GO_funtion \n", "0 L-malate dehydrogenase activity IDA, malate de... \n", "1 phosphoglycerate dehydrogenase activity IBA, e... \n", "2 ATP binding IEA, identical protein binding IPI... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving the table" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "table.to_csv('Uniprot_search.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".. disqus::" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }