No, that doesn't work :-(
Here's some blast output with the database formatted with local ids:
=====================================================================
Database: uniprot_sprot.fasta
466,739 sequences; 165,389,953 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
sp|Q4U9M9|104K_THEAN Unknown 421 e-117
sp|P15711|104K_THEPA Unknown 265 6e-70
sp|Q2SPQ2|CHED_HAHCH Unknown 33 4.2
Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix adjust.
Identities = 0/209 (0%), Positives = 0/209 (0%)
Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60
Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 120
Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 180
Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209
===========================================================================
If I tweak the fasta and change the ids from lcl to gi and re-formatdb, all works correctly:
===========================================================================
Query= test
(612 letters)
Database: uniprot_sprot.fasta
466,739 sequences; 165,389,953 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theile... 421 e-117
sp|P15711|104K_THEPA 104 kDa microneme/rhoptry antigen OS=Theile... 265 6e-70
sp|Q2SPQ2|CHED_HAHCH Probable chemoreceptor glutamine deamidase ... 33 4.2
sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425
PE=3 SV=1
Length = 893
Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix adjust.
Identities = 201/209 (96%), Positives = 201/209 (96%)
Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60
VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED
Sbjct: 72 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 131
Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 120
QYLA IHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD
Sbjct: 132 QYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 191
Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 180
KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY
Sbjct: 192 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 251
Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209
VATIPKLKDFAEPYHPIILDISDIDYVNF
Sbjct: 252 VATIPKLKDFAEPYHPIILDISDIDYVNF 280
============================================================================
To my mind, this is a bug in formatdb but NCBI don't see it that way.
--Russell
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of bill at genenformics.com
Sent: Tuesday, 19 May 2009 12:20 p.m.
To: bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] Uniprot/Swiss accessions?
Hi, Smithies,
Using an integral local id should work as well.
A define will look like '>lcl|12345 ...'
Bill
Post by Smithies, RussellHi guys,
Thanx for your suggestions.
With the magic of awk and comm, I split the amalgamated accessions and
created lists of swissprot IDs for both the file from NCBI and the file
from Uniprot.
sp_ncbi_accessions.txt 458,377 ids
sp_uniprot_accessions.txt 466,739 ids
* The NCBI file has 95 ids that don't appear in the Uniprot list
* The Uniprot file has 8,457 ids that don't appear in the NCBI list
* There are 458,282 ids that appear on both lists.
I did a quick random sample of the 8,457 ids unique to Uniprot and none
could be found in the "protein" database at NCBI but all were in the
"gene" database as "reference sequences that belong to a specific genome
build" and all belonged to recently sequenced bacterial genomes. As none
are in the "protein" database, they don't have GI numbers.
The 95 ids that were at NCBI but not in Uniprot were usually (random
sample again) described as "putative protein" (or "very putative protein"
in one case) and are the result of gene predictions. Eg
http://www.ncbi.nlm.nih.gov/protein/48429254
So what I'll do is use the NCBI database and add in the extra 8,457 ids
unique to Uniprot and assign them fake GI numbers so I can formatdb them
with the " -o T" option.
Thanx again for your help,
Russell Smithies
Bioinformatics Applications Developer
T +64 3 489 9085
E russell.smithies at agresearch.co.nz
Invermay Research Centre
Puddle Alley,
Mosgiel,
New Zealand
T +64 3 489 3809
F +64 3 489 9174
www.agresearch.co.nz
Toitu te whenua, Toitu te tangata
Sustain the land, Sustain the people
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l