Discussion:
[Biopython-dev] Project ideas for GSoC (or other student projects)
Peter Cock
2013-02-12 17:51:15 UTC
Permalink
Hello all,

Google recently confirmed they will be running Google Summer of Code 2013,
and we (Biopython and the other Bio* projects) would hope to be accepted again
under the Open Bioinformatics Foundation as in previous years:
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html

It would be great to start coming up with potential project ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for other project
students, or 'low hanging fruit' for potential contributors to cut
their teeth on.

See also http://biopython.org/wiki/Active_projects and the ideas list there.

Regards,

Peter
Wibowo Arindrarto
2013-02-12 18:29:02 UTC
Permalink
Hi everyone,

It's more or less a 'low hanging fruit', but I've been thinking
perhaps it may be useful if we have our own interface to the HMMER3
online service? The corresponding SearchIO parsers may be written for
this as well (they return different formats for which we haven't any
parsers currently).

And I think there are more things being worked on, not yet mentioned
in the wiki:

1. Porting our docs to Sphinx[1]
2. Converting some/all of the print and compare tests to unit tests.
For example, our Bio.Seq's tests are still print and compare tests.

regards,
Bow

[1] See the original feature request here:
https://redmine.open-bio.org/issues/3221
https://redmine.open-bio.org/issues/3220
https://redmine.open-bio.org/issues/3219
Peter Cock
2013-03-21 17:29:44 UTC
Permalink
On Tue, Feb 12, 2013 at 6:29 PM, Wibowo Arindrarto
Post by Wibowo Arindrarto
Hi everyone,
It's more or less a 'low hanging fruit', but I've been thinking
perhaps it may be useful if we have our own interface to the HMMER3
online service? The corresponding SearchIO parsers may be written for
this as well (they return different formats for which we haven't any
parsers currently).
Worth adding to the projects list here (or filing an enhancement bug)
http://biopython.org/wiki/Active_projects#Project_ideas - but not
enough to base a whole GSoC project around.
Post by Wibowo Arindrarto
And I think there are more things being worked on, not yet mentioned
1. Porting our docs to Sphinx[1]
2. Converting some/all of the print and compare tests to unit tests.
For example, our Bio.Seq's tests are still print and compare tests.
regards,
Bow
https://redmine.open-bio.org/issues/3221
https://redmine.open-bio.org/issues/3220
https://redmine.open-bio.org/issues/3219
I don't think a purely documentation focused project is eligible
for GSoC. But both ideas make sense separately from GSoC.

Regards,

Peter
Peter Cock
2013-03-21 17:29:44 UTC
Permalink
On Tue, Feb 12, 2013 at 6:29 PM, Wibowo Arindrarto
Post by Wibowo Arindrarto
Hi everyone,
It's more or less a 'low hanging fruit', but I've been thinking
perhaps it may be useful if we have our own interface to the HMMER3
online service? The corresponding SearchIO parsers may be written for
this as well (they return different formats for which we haven't any
parsers currently).
Worth adding to the projects list here (or filing an enhancement bug)
http://biopython.org/wiki/Active_projects#Project_ideas - but not
enough to base a whole GSoC project around.
Post by Wibowo Arindrarto
And I think there are more things being worked on, not yet mentioned
1. Porting our docs to Sphinx[1]
2. Converting some/all of the print and compare tests to unit tests.
For example, our Bio.Seq's tests are still print and compare tests.
regards,
Bow
https://redmine.open-bio.org/issues/3221
https://redmine.open-bio.org/issues/3220
https://redmine.open-bio.org/issues/3219
I don't think a purely documentation focused project is eligible
for GSoC. But both ideas make sense separately from GSoC.

Regards,

Peter
Eric Talevich
2013-02-12 20:00:11 UTC
Permalink
Post by Peter Cock
Hello all,
Google recently confirmed they will be running Google Summer of Code 2013,
and we (Biopython and the other Bio* projects) would hope to be accepted again
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for other project
students, or 'low hanging fruit' for potential contributors to cut
their teeth on.
One interesting GSoC project would be to implement support for phylogenetic
placements. The programs pplacer and EPA (part of RAxML) can place sequence
reads from metagenomic samples onto a reference phylogeny:
http://matsen.fhcrc.org/pplacer/
http://sysbio.oxfordjournals.org/content/60/3/291

The output format of those programs has been standardized as something I
suppose we could call the "jplace" format:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0031009
http://arxiv.org/abs/1201.3397

It's based on JSON and Newick, with a small extension to Newick that
shouldn't be too hard to support. The GSoC project would be to implement a
parser for this and implement querying as well as integration with the rest
of Bio.Phylo to some reasonable extent. I would be available to mentor this.

In terms of low-hanging fruit, there are some small but important functions
that could be added to Bio.Phylo. My top three: Robinson-Foulds distance,
majority-rules consensus, draw an unrooted tree using Felsenstein's Equal
Daylight algorithm (which starts by computing the layout for a radial tree).

-Eric
Saket Choudhary
2013-02-12 20:45:46 UTC
Permalink
Hi,

I was thinking of a Synteny viewer on the lines of
GSV<http://cas-bioinfo.cas.unt.edu/gsv/homepage.php> if
it makes sense .

Saket
Post by Peter Cock
Hello all,
Google recently confirmed they will be running Google Summer of Code 2013,
and we (Biopython and the other Bio* projects) would hope to be accepted again
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for other project
students, or 'low hanging fruit' for potential contributors to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects and the ideas list there.
Regards,
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Michiel de Hoon
2013-02-13 02:08:26 UTC
Permalink
It would be great to have better support for microarray analysis in Biopython. Something like lumi/limma in R. Perhaps this is an option for the GSoC?

Best,
-Michiel.
From: Peter Cock <p.j.a.cock at googlemail.com>
Subject: [Biopython-dev] Project ideas for GSoC (or other student projects)
To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Tuesday, February 12, 2013, 12:51 PM
Hello all,
Google recently confirmed they will be running Google Summer
of Code 2013,
and we (Biopython and the other Bio* projects) would hope to
be accepted again
under the Open Bioinformatics Foundation as in previous
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project
ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for
other project
students, or 'low hanging fruit' for potential contributors
to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects
and the ideas list there.
Regards,
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Saket Choudhary
2013-03-05 17:26:57 UTC
Permalink
I had this idea of an online biopython shell on the lines of bioruby shell :
http://bioruby.open-bio.org/wiki/BioRubyOnRails
Post by Michiel de Hoon
It would be great to have better support for microarray analysis in Biopython. Something like lumi/limma in R. Perhaps this is an option for the GSoC?
Best,
-Michiel.
From: Peter Cock <p.j.a.cock at googlemail.com>
Subject: [Biopython-dev] Project ideas for GSoC (or other student projects)
To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Tuesday, February 12, 2013, 12:51 PM
Hello all,
Google recently confirmed they will be running Google Summer
of Code 2013,
and we (Biopython and the other Bio* projects) would hope to
be accepted again
under the Open Bioinformatics Foundation as in previous
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project
ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for
other project
students, or 'low hanging fruit' for potential contributors
to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects
and the ideas list there.
Regards,
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Peter Cock
2013-03-08 16:08:46 UTC
Permalink
Post by Saket Choudhary
http://bioruby.open-bio.org/wiki/BioRubyOnRails
That screenshot makes me think of http://ipython.org/ - is that similar?

Peter
Saket Choudhary
2013-03-08 18:30:03 UTC
Permalink
It is essentially an online RoR based application that allows you to
try bioruby through your browser without the need of a bioruby native
install . I was thinking of a django/flask application that would
essentially be a playground for trying out biopython


Saket
Post by Peter Cock
Post by Saket Choudhary
http://bioruby.open-bio.org/wiki/BioRubyOnRails
That screenshot makes me think of http://ipython.org/ - is that similar?
Peter
Brad Chapman
2013-03-09 16:06:34 UTC
Permalink
Saket and Peter;
What you're describing is what Ipython provides, a web-based way to edit
and interact with Python code. There are some projects that build on top
of it to provide more of a playground environment like you're describing:

http://continuum.io/wakari.html
https://github.com/Exhibitionist/Exhibitionist

Hope this helps,
Brad
Post by Saket Choudhary
It is essentially an online RoR based application that allows you to
try bioruby through your browser without the need of a bioruby native
install . I was thinking of a django/flask application that would
essentially be a playground for trying out biopython
Saket
Post by Peter Cock
Post by Saket Choudhary
http://bioruby.open-bio.org/wiki/BioRubyOnRails
That screenshot makes me think of http://ipython.org/ - is that similar?
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Brad Chapman
2013-03-09 16:06:34 UTC
Permalink
Saket and Peter;
What you're describing is what Ipython provides, a web-based way to edit
and interact with Python code. There are some projects that build on top
of it to provide more of a playground environment like you're describing:

http://continuum.io/wakari.html
https://github.com/Exhibitionist/Exhibitionist

Hope this helps,
Brad
Post by Saket Choudhary
It is essentially an online RoR based application that allows you to
try bioruby through your browser without the need of a bioruby native
install . I was thinking of a django/flask application that would
essentially be a playground for trying out biopython
Saket
Post by Peter Cock
Post by Saket Choudhary
http://bioruby.open-bio.org/wiki/BioRubyOnRails
That screenshot makes me think of http://ipython.org/ - is that similar?
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Saket Choudhary
2013-03-08 18:30:03 UTC
Permalink
It is essentially an online RoR based application that allows you to
try bioruby through your browser without the need of a bioruby native
install . I was thinking of a django/flask application that would
essentially be a playground for trying out biopython


Saket
Post by Peter Cock
Post by Saket Choudhary
http://bioruby.open-bio.org/wiki/BioRubyOnRails
That screenshot makes me think of http://ipython.org/ - is that similar?
Peter
Peter Cock
2013-03-08 16:08:46 UTC
Permalink
Post by Saket Choudhary
http://bioruby.open-bio.org/wiki/BioRubyOnRails
That screenshot makes me think of http://ipython.org/ - is that similar?

Peter
Eric Talevich
2013-03-13 18:32:25 UTC
Permalink
Post by Michiel de Hoon
It would be great to have better support for microarray analysis in
Biopython. Something like lumi/limma in R. Perhaps this is an option for
the GSoC?
Best,
-Michiel.
I like Michiel's idea, and I'll suggest two more:

1. Codon alignment & analysis:
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a
protein sequence alignment to a codon alignment. (Previously discussed)
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of
codon alignments, including validation (testing for frame shifts etc.)

2. Phylo enhancements:
2a. Tree drawing:
- A proper draw_unrooted function to perform radial layout, with an
optional "iterations" argument to use Felsenstein's Equal Daylight
algorithm -- I feel this layout approach is neglected in most libraries.
- Better matplotlib/pylab integration, so the plot components can be
tweaked using matplotlib functions.
- Other common layout approaches, e.g. circular.
2b. A "Phylo.consensus" module:
- strict consensus, like Bio.Nexus already implements.
- other consensus methods, time permitting.
2c. A "Phylo.distance" module:
- Robinson-Foulds distance -- though others might be working on this
already.
2d. Simple tree inference:
- Straightforward algorithms exist for neighbor-joining and parsimony tree
estimation. For small alignments (and perhaps medium-sized ones with PyPy),
it would be nice to run these without an external program, e.g. to
construct a guide tree for another algorithm or quickly view a phylogenetic
clustering of sequences.

Any interest in either of these? Shall I add them to the wiki?

-Eric
Post by Michiel de Hoon
From: Peter Cock <p.j.a.cock at googlemail.com>
Subject: [Biopython-dev] Project ideas for GSoC (or other student
projects)
To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Tuesday, February 12, 2013, 12:51 PM
Hello all,
Google recently confirmed they will be running Google Summer
of Code 2013,
and we (Biopython and the other Bio* projects) would hope to
be accepted again
under the Open Bioinformatics Foundation as in previous
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project
ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for
other project
students, or 'low hanging fruit' for potential contributors
to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects
and the ideas list there.
Regards,
Peter
Bartek Wilczynski
2013-03-15 23:06:57 UTC
Permalink
Hi All,
I would add one more (old) idea for a GSoC pool, i.e. adding support
for different biological ontologies to biopython.

This was already discussed some time ago
(http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no)
mostly in the context of gene ontology, and to some extent this is
addressed by the development of GOAtools
(https://github.com/tanghaibao/goatools), but I think it would be
worth to have a decent support for OBO-file-based ontologies (not only
gene ontology, I'm also interested myself in anatomical ontologies,
there are also other available at obofoundry.org) in biopython.

I think it would need to include support for IO operations on both OBO
and annotation files, as well as statistical enrichment measures and
potentially some visualisation.

Would anyone be interested in co-mentoring this project? There is one
student in my department who would be interested in applying to GSoC
for this project, but I think it would be great if other people joined
the discussion on the functionality and having more people involved is
always better...

best
Bartek Wilczynski
Post by Eric Talevich
Post by Michiel de Hoon
It would be great to have better support for microarray analysis in
Biopython. Something like lumi/limma in R. Perhaps this is an option for
the GSoC?
Best,
-Michiel.
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a
protein sequence alignment to a codon alignment. (Previously discussed)
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of
codon alignments, including validation (testing for frame shifts etc.)
- A proper draw_unrooted function to perform radial layout, with an
optional "iterations" argument to use Felsenstein's Equal Daylight
algorithm -- I feel this layout approach is neglected in most libraries.
- Better matplotlib/pylab integration, so the plot components can be
tweaked using matplotlib functions.
- Other common layout approaches, e.g. circular.
- strict consensus, like Bio.Nexus already implements.
- other consensus methods, time permitting.
- Robinson-Foulds distance -- though others might be working on this
already.
- Straightforward algorithms exist for neighbor-joining and parsimony tree
estimation. For small alignments (and perhaps medium-sized ones with PyPy),
it would be nice to run these without an external program, e.g. to
construct a guide tree for another algorithm or quickly view a phylogenetic
clustering of sequences.
Any interest in either of these? Shall I add them to the wiki?
-Eric
Post by Michiel de Hoon
From: Peter Cock <p.j.a.cock at googlemail.com>
Subject: [Biopython-dev] Project ideas for GSoC (or other student
projects)
To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Tuesday, February 12, 2013, 12:51 PM
Hello all,
Google recently confirmed they will be running Google Summer of Code 2013,
and we (Biopython and the other Bio* projects) would hope to be accepted again
under the Open Bioinformatics Foundation as in previous
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project
ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for
other project
students, or 'low hanging fruit' for potential contributors
to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects
and the ideas list there.
Regards,
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
Bartek Wilczynski
Peter Cock
2013-03-21 16:11:44 UTC
Permalink
On Fri, Mar 15, 2013 at 11:06 PM, Bartek Wilczynski
Post by Bartek Wilczynski
Hi All,
I would add one more (old) idea for a GSoC pool, i.e. adding support
for different biological ontologies to biopython.
This was already discussed some time ago
(http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no)
mostly in the context of gene ontology, and to some extent this is
addressed by the development of GOAtools
(https://github.com/tanghaibao/goatools), but I think it would be
worth to have a decent support for OBO-file-based ontologies (not only
gene ontology, I'm also interested myself in anatomical ontologies,
there are also other available at obofoundry.org) in biopython.
I think it would need to include support for IO operations on both OBO
and annotation files, as well as statistical enrichment measures and
potentially some visualisation.
Would anyone be interested in co-mentoring this project? There is one
student in my department who would be interested in applying to GSoC
for this project, but I think it would be great if other people joined
the discussion on the functionality and having more people involved is
always better...
best
Bartek Wilczynski
That's a good idea - I would have used this recently with some GO
stuff (e.g. given a GO term, is it a molecular function, biological
process, or cellular compartment - can solve this easily by traversing
up any branch of the DAG).

Right now we need to put this list of ideas on the wiki page (ready
for combining into the OBF page which will be shown to Google
to make our case for taking part in the GSoC 2013 program).
http://biopython.org/wiki/Google_Summer_of_Code

If any of you as a potential mentor want to put up an outline
proposal, even better.

Peter
Peter Cock
2013-03-21 16:29:29 UTC
Permalink
Post by Peter Cock
Right now we need to put this list of ideas on the wiki page (ready
for combining into the OBF page which will be shown to Google
to make our case for taking part in the GSoC 2013 program).
http://biopython.org/wiki/Google_Summer_of_Code
If any of you as a potential mentor want to put up an outline
proposal, even better.
I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.

I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.

The same idea applies to richer file formats too, like EMBL
and GenBank. Here lazy loading the sequence is actually
easier (the number of bases per line is strictly defined),
but you can apply the same ideas to lazy loading features
too. This means indexing both the sequence and the feature
table.

Likewise, this makes sense for GTF/GFF/GFF3 where you
would index the features, and also if present index the
embedded FASTA sequence at the end of the file. Clearly
handling this would ideally build on Lenna and Brad's work
with the underlying parser.

With what I have in mind, there are two technical sides to
this. First, the index format (binning strategies etc) for
which we should review tabix and BAM's indexing and its
planned replacement CSI (able to handle longer references).

Second, to avoid code duplication, this would mean some
re-factoring of the existing parser code to ensure that if
a record is loaded in full via the traditional API, it would
go though the same code as if it were loaded via the new
lazy loading approach. Potentially the existing parsers
could optionally also become lazy loaders (contingent
on this requiring ownership of the file handle as it will
use seek and tell to move the file pointer). That in theory
could make our parsers much faster (depending on the
overheads) for tasks where only a minority of the data
is ever used. I've had some fun chats with Pjotr Prins
from BioRuby about this at a CodeFest/BOSC meeting.

Brad and Lenna, I've CC'd you explicitly as I'm guessing
from the GFF work you are most likely to have considered
some of these issues.

Does this sound like something worth exploring further,
and worth proposing as an outline GSoC project? I think
it would be quite a challenging project - but like last year,
it is something I would like to try myself if I had the time.

Regards,

Peter
Peter Cock
2013-03-21 17:36:24 UTC
Permalink
Post by Peter Cock
Post by Peter Cock
Right now we need to put this list of ideas on the wiki page (ready
for combining into the OBF page which will be shown to Google
to make our case for taking part in the GSoC 2013 program).
http://biopython.org/wiki/Google_Summer_of_Code
If any of you as a potential mentor want to put up an outline
proposal, even better.
I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.
I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.
The same idea applies to richer file formats too, like EMBL
and GenBank. ...
Likewise, this makes sense for GTF/GFF/GFF3 ...
P.S. An example use case, http://www.biostars.org/p/64363/

Part of this work could include enhancements to the SeqRecord
handling of SeqFeatures - offering more than just the current
simple list - for example lookup by ID, dbxref, or position. That
would be nice to have now with the current in-memory parsers.

An old but still relevant example usecase:
http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

Regards,

Peter
Brad Chapman
2013-03-22 12:48:34 UTC
Permalink
Peter;
Post by Peter Cock
I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.
I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.
This sounds incredibly useful. It's definitely worthwhile writing up if
you'll have time this summer to mentor it.
Post by Peter Cock
Likewise, this makes sense for GTF/GFF/GFF3 where you
would index the features, and also if present index the
embedded FASTA sequence at the end of the file.
I'm cc'ing Ryan, who has been thinking about similar work as part of
gffutils. We're planning now on an approach that takes the BCBio.GFF
parsing and rolls it into gffutils so we can parse, index in a SQLite
database and expose as Biopython objects. Here is some initial
discussion and planning:

https://github.com/daler/gffutils/issues/2
https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing

Brad
Ryan Dale
2013-03-22 16:20:45 UTC
Permalink
Hi Brad & Peter -
Post by Brad Chapman
Peter;
Post by Peter Cock
I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.
I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.
This sounds incredibly useful. It's definitely worthwhile writing up if
you'll have time this summer to mentor it.
Agreed - a general, lazy-loading/lazy-parsing, indexed mechanism for
accessing data annotation-like file formats would be fantastic.
Post by Brad Chapman
Post by Peter Cock
Likewise, this makes sense for GTF/GFF/GFF3 where you
would index the features, and also if present index the
embedded FASTA sequence at the end of the file.
I'm cc'ing Ryan, who has been thinking about similar work as part of
gffutils. We're planning now on an approach that takes the BCBio.GFF
parsing and rolls it into gffutils so we can parse, index in a SQLite
database and expose as Biopython objects. Here is some initial
https://github.com/daler/gffutils/issues/2
https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing
As Peter pointed out on the GitHub issues page, what he has in mind is
more general than just GFF/GTF, and I see gffutils as extending upon a
specific subset of the functionality he proposes.

For example, there are common use-cases that I think make sense for a
GFF/GTF-only library (say, adding new annotations for introns, as
inferred from the isoform + exon annotations) that might not be readily
generalizable to all annotation-like file formats. But if this general
indexing approach were already available, then gffutils could just be a
wrapper around that, adding the specific GFF/GTF functionality as
another layer.

Then again . . . currently gffutils imports GFF data into a sqlite3
database, so data are persistent and both read/write. For the
intron-inferring example, we simply add new records to the db, but with
an indexing approach, the file would presumably have to be re-indexed
before reading again. So how you'd like to use your GFF files
(read-only vs read/write) would influence which strategy you'd chooses.

So I think there's actually smaller-than-expected overlap between
gffutils and Peter's general indexing idea, and in the context of GSoC,
I'm not sure you'd have to take gffutils into account. But gffutils
would certainly benefit from general indexing, especially when
retrieving sequences for features!

-ryan
Ryan Dale
2013-03-22 16:20:45 UTC
Permalink
Hi Brad & Peter -
Post by Brad Chapman
Peter;
Post by Peter Cock
I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.
I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.
This sounds incredibly useful. It's definitely worthwhile writing up if
you'll have time this summer to mentor it.
Agreed - a general, lazy-loading/lazy-parsing, indexed mechanism for
accessing data annotation-like file formats would be fantastic.
Post by Brad Chapman
Post by Peter Cock
Likewise, this makes sense for GTF/GFF/GFF3 where you
would index the features, and also if present index the
embedded FASTA sequence at the end of the file.
I'm cc'ing Ryan, who has been thinking about similar work as part of
gffutils. We're planning now on an approach that takes the BCBio.GFF
parsing and rolls it into gffutils so we can parse, index in a SQLite
database and expose as Biopython objects. Here is some initial
https://github.com/daler/gffutils/issues/2
https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing
As Peter pointed out on the GitHub issues page, what he has in mind is
more general than just GFF/GTF, and I see gffutils as extending upon a
specific subset of the functionality he proposes.

For example, there are common use-cases that I think make sense for a
GFF/GTF-only library (say, adding new annotations for introns, as
inferred from the isoform + exon annotations) that might not be readily
generalizable to all annotation-like file formats. But if this general
indexing approach were already available, then gffutils could just be a
wrapper around that, adding the specific GFF/GTF functionality as
another layer.

Then again . . . currently gffutils imports GFF data into a sqlite3
database, so data are persistent and both read/write. For the
intron-inferring example, we simply add new records to the db, but with
an indexing approach, the file would presumably have to be re-indexed
before reading again. So how you'd like to use your GFF files
(read-only vs read/write) would influence which strategy you'd chooses.

So I think there's actually smaller-than-expected overlap between
gffutils and Peter's general indexing idea, and in the context of GSoC,
I'm not sure you'd have to take gffutils into account. But gffutils
would certainly benefit from general indexing, especially when
retrieving sequences for features!

-ryan

Peter Cock
2013-03-21 17:36:24 UTC
Permalink
Post by Peter Cock
Post by Peter Cock
Right now we need to put this list of ideas on the wiki page (ready
for combining into the OBF page which will be shown to Google
to make our case for taking part in the GSoC 2013 program).
http://biopython.org/wiki/Google_Summer_of_Code
If any of you as a potential mentor want to put up an outline
proposal, even better.
I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.
I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.
The same idea applies to richer file formats too, like EMBL
and GenBank. ...
Likewise, this makes sense for GTF/GFF/GFF3 ...
P.S. An example use case, http://www.biostars.org/p/64363/

Part of this work could include enhancements to the SeqRecord
handling of SeqFeatures - offering more than just the current
simple list - for example lookup by ID, dbxref, or position. That
would be nice to have now with the current in-memory parsers.

An old but still relevant example usecase:
http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

Regards,

Peter
Brad Chapman
2013-03-22 12:48:34 UTC
Permalink
Peter;
Post by Peter Cock
I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.
I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.
This sounds incredibly useful. It's definitely worthwhile writing up if
you'll have time this summer to mentor it.
Post by Peter Cock
Likewise, this makes sense for GTF/GFF/GFF3 where you
would index the features, and also if present index the
embedded FASTA sequence at the end of the file.
I'm cc'ing Ryan, who has been thinking about similar work as part of
gffutils. We're planning now on an approach that takes the BCBio.GFF
parsing and rolls it into gffutils so we can parse, index in a SQLite
database and expose as Biopython objects. Here is some initial
discussion and planning:

https://github.com/daler/gffutils/issues/2
https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing

Brad
Peter Cock
2013-03-21 16:29:29 UTC
Permalink
Post by Peter Cock
Right now we need to put this list of ideas on the wiki page (ready
for combining into the OBF page which will be shown to Google
to make our case for taking part in the GSoC 2013 program).
http://biopython.org/wiki/Google_Summer_of_Code
If any of you as a potential mentor want to put up an outline
proposal, even better.
I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.

I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.

The same idea applies to richer file formats too, like EMBL
and GenBank. Here lazy loading the sequence is actually
easier (the number of bases per line is strictly defined),
but you can apply the same ideas to lazy loading features
too. This means indexing both the sequence and the feature
table.

Likewise, this makes sense for GTF/GFF/GFF3 where you
would index the features, and also if present index the
embedded FASTA sequence at the end of the file. Clearly
handling this would ideally build on Lenna and Brad's work
with the underlying parser.

With what I have in mind, there are two technical sides to
this. First, the index format (binning strategies etc) for
which we should review tabix and BAM's indexing and its
planned replacement CSI (able to handle longer references).

Second, to avoid code duplication, this would mean some
re-factoring of the existing parser code to ensure that if
a record is loaded in full via the traditional API, it would
go though the same code as if it were loaded via the new
lazy loading approach. Potentially the existing parsers
could optionally also become lazy loaders (contingent
on this requiring ownership of the file handle as it will
use seek and tell to move the file pointer). That in theory
could make our parsers much faster (depending on the
overheads) for tasks where only a minority of the data
is ever used. I've had some fun chats with Pjotr Prins
from BioRuby about this at a CodeFest/BOSC meeting.

Brad and Lenna, I've CC'd you explicitly as I'm guessing
from the GFF work you are most likely to have considered
some of these issues.

Does this sound like something worth exploring further,
and worth proposing as an outline GSoC project? I think
it would be quite a challenging project - but like last year,
it is something I would like to try myself if I had the time.

Regards,

Peter
Peter Cock
2013-03-21 16:11:44 UTC
Permalink
On Fri, Mar 15, 2013 at 11:06 PM, Bartek Wilczynski
Post by Bartek Wilczynski
Hi All,
I would add one more (old) idea for a GSoC pool, i.e. adding support
for different biological ontologies to biopython.
This was already discussed some time ago
(http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no)
mostly in the context of gene ontology, and to some extent this is
addressed by the development of GOAtools
(https://github.com/tanghaibao/goatools), but I think it would be
worth to have a decent support for OBO-file-based ontologies (not only
gene ontology, I'm also interested myself in anatomical ontologies,
there are also other available at obofoundry.org) in biopython.
I think it would need to include support for IO operations on both OBO
and annotation files, as well as statistical enrichment measures and
potentially some visualisation.
Would anyone be interested in co-mentoring this project? There is one
student in my department who would be interested in applying to GSoC
for this project, but I think it would be great if other people joined
the discussion on the functionality and having more people involved is
always better...
best
Bartek Wilczynski
That's a good idea - I would have used this recently with some GO
stuff (e.g. given a GO term, is it a molecular function, biological
process, or cellular compartment - can solve this easily by traversing
up any branch of the DAG).

Right now we need to put this list of ideas on the wiki page (ready
for combining into the OBF page which will be shown to Google
to make our case for taking part in the GSoC 2013 program).
http://biopython.org/wiki/Google_Summer_of_Code

If any of you as a potential mentor want to put up an outline
proposal, even better.

Peter
Peter Cock
2013-03-21 17:01:51 UTC
Permalink
Already up on the wiki :)
- A proper draw_unrooted function to perform radial layout, with an optional
"iterations" argument to use Felsenstein's Equal Daylight algorithm -- I
feel this layout approach is neglected in most libraries.
- Better matplotlib/pylab integration, so the plot components can be tweaked
using matplotlib functions.
- Other common layout approaches, e.g. circular.
- strict consensus, like Bio.Nexus already implements.
- other consensus methods, time permitting.
- Robinson-Foulds distance -- though others might be working on this
already.
- Straightforward algorithms exist for neighbor-joining and parsimony tree
estimation. For small alignments (and perhaps medium-sized ones with PyPy),
it would be nice to run these without an external program, e.g. to construct
a guide tree for another algorithm or quickly view a phylogenetic clustering
of sequences.
One more idea for a sub-task?

2e. Using multiple trees for bootstrapping a master tree. Take the master
tree and for each edge you have a partition of the leaves, which can be
used as a dictionary hash (e.g. as a binary representation). Then for
each of the bootstrap runs, look at each edge, compute the hash for
that split of the leaves, and increment the count. Then at the end, you
have a dictionary of counts which are the branch bootstrap supports.

I wrote that once in Python some time back, and used it to take a set
of boot strap trees generated on a cluster and give the support values
to the master tree.
Any interest in either of these? Shall I add them to the wiki?
They both seem worth posting on the wiki, although we may not have
enough mentors for both to go ahead :(

Peter
Peter Cock
2013-03-21 16:55:30 UTC
Permalink
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a protein
sequence alignment to a codon alignment. (Previously discussed)
e.g. https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of
codon alignments, including validation (testing for frame shifts etc.)
http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis

I see you've started fleshing this idea out on the wiki, which is great.
Right now it seems a little on the light weight side - or is that deliberate
(to see if a student can take this idea and come up with a solid
project proposal in this area)? Things like model selection might
be a fun extension - I can think of a local expert who would be
great to get involved on the science side if he's interested.

Alternatively this could include doing some more general work
on the alignment object - for instance per-column-annotation
for things like a consensus sequence - or an array-of-char
implementation as an alternative to the list-of-SeqRecords
we have now (with its poor column access speed).

Peter
Eric Talevich
2013-03-21 17:42:19 UTC
Permalink
Post by Eric Talevich
Post by Eric Talevich
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a
protein
Post by Eric Talevich
sequence alignment to a codon alignment. (Previously discussed)
e.g.
https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
Well, check you out. Would you be interested in mentoring this project?
Post by Eric Talevich
Post by Eric Talevich
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage
of
Post by Eric Talevich
codon alignments, including validation (testing for frame shifts etc.)
http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis
I see you've started fleshing this idea out on the wiki, which is great.
Right now it seems a little on the light weight side - or is that deliberate
(to see if a student can take this idea and come up with a solid
project proposal in this area)? Things like model selection might
be a fun extension - I can think of a local expert who would be
great to get involved on the science side if he's interested.
I put up a quick sketch to avoid locking the wiki page for too long, but
also deliberately left it vague to see where the applicants take it. Model
selection would be cool, I added it. Local expert, also great.
Post by Eric Talevich
Alternatively this could include doing some more general work
on the alignment object - for instance per-column-annotation
for things like a consensus sequence - or an array-of-char
implementation as an alternative to the list-of-SeqRecords
we have now (with its poor column access speed).
Peter
I wonder if that's something we could just do incrementally -- change the
MultipleSeqAlignment class to store a list-of-lists-of chars (or
list-of-strings), a list of SeqRecord-like husks (all the annotations, but
without the Seq itself) for each row, a list of column annotations, and a
single alphabet for the whole alignment.

How do you suppose the speed of that would compare to the current
list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be
a significant enough speed improvement to justify both replacing the
current implementation, and to make the NumPy approach less tempting (given
PyPy's progress toward including a compliant implementation)?
Alternatively, we could post a GSoC project for creating a separate
TurboAlignment class/module based on NumPy which would be mostly
interchangeable and interconvertible with the pure-Python version in the
Biopython core.

Speaking of which, should we also post the idea of storing sequences as an
efficient byte array, BioJava-style?

-Eric
Peter Cock
2013-03-21 17:59:10 UTC
Permalink
On Thu, Mar 21, 2013 at 12:55 PM, Peter Cock <p.j.a.cock at googlemail.com>
Post by Peter Cock
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a protein
sequence alignment to a codon alignment. (Previously discussed)
e.g.
https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
Well, check you out. Would you be interested in mentoring this project?
If I'm not primary mentor on another project, I'd be open to co-mentoring
something on the alignment side.
Post by Peter Cock
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of
codon alignments, including validation (testing for frame shifts etc.)
http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis
I see you've started fleshing this idea out on the wiki, which is great.
Right now it seems a little on the light weight side - or is that deliberate
(to see if a student can take this idea and come up with a solid
project proposal in this area)? Things like model selection might
be a fun extension - I can think of a local expert who would be
great to get involved on the science side if he's interested.
I put up a quick sketch to avoid locking the wiki page for too long, but
also deliberately left it vague to see where the applicants take it. Model
selection would be cool, I added it. Local expert, also great.
If he's available and willing, yes. I've not mentioned this to him
yet so no promises - the idea only occurred to me while writing
that email ;)
Post by Peter Cock
Alternatively this could include doing some more general work
on the alignment object - for instance per-column-annotation
for things like a consensus sequence - or an array-of-char
implementation as an alternative to the list-of-SeqRecords
we have now (with its poor column access speed).
Peter
I wonder if that's something we could just do incrementally -- change the
MultipleSeqAlignment class to store a list-of-lists-of chars (or
list-of-strings), a list of SeqRecord-like husks (all the annotations, but
without the Seq itself) for each row, a list of column annotations, and a
single alphabet for the whole alignment.
How do you suppose the speed of that would compare to the current
list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be
a significant enough speed improvement to justify both replacing the current
implementation, and to make the NumPy approach less tempting (given PyPy's
progress toward including a compliant implementation)? Alternatively, we
could post a GSoC project for creating a separate TurboAlignment
class/module based on NumPy which would be mostly interchangeable and
interconvertible with the pure-Python version in the Biopython core.
When I said array-of-char I did have NumPy in mind, and PyPy does now
cope with two or more dimensional arrays in NumPyPy. Note that NumPy
handles both row and column orientated arrays with a simple class init
option, so this can easily be setup to favour row or column access.

Last time I did anything with the alignment object where column access
was a bottleneck (calculating mutual information between columns), I
just loaded all the columns into memory as a list of strings, and computed
on that. It worked very nicely.
Speaking of which, should we also post the idea of storing sequences as an
efficient byte array, BioJava-style?
I'd wondered about that (in combination with the discussion about strict
alphabet checking), but is there enough for a whole GSoC project?
Related to this one could look at something with k-mer hashes...

(Its good to see lots of possible project ideas bouncing around)

Peter
Peter Cock
2013-03-21 17:59:10 UTC
Permalink
On Thu, Mar 21, 2013 at 12:55 PM, Peter Cock <p.j.a.cock at googlemail.com>
Post by Peter Cock
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a protein
sequence alignment to a codon alignment. (Previously discussed)
e.g.
https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
Well, check you out. Would you be interested in mentoring this project?
If I'm not primary mentor on another project, I'd be open to co-mentoring
something on the alignment side.
Post by Peter Cock
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of
codon alignments, including validation (testing for frame shifts etc.)
http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis
I see you've started fleshing this idea out on the wiki, which is great.
Right now it seems a little on the light weight side - or is that deliberate
(to see if a student can take this idea and come up with a solid
project proposal in this area)? Things like model selection might
be a fun extension - I can think of a local expert who would be
great to get involved on the science side if he's interested.
I put up a quick sketch to avoid locking the wiki page for too long, but
also deliberately left it vague to see where the applicants take it. Model
selection would be cool, I added it. Local expert, also great.
If he's available and willing, yes. I've not mentioned this to him
yet so no promises - the idea only occurred to me while writing
that email ;)
Post by Peter Cock
Alternatively this could include doing some more general work
on the alignment object - for instance per-column-annotation
for things like a consensus sequence - or an array-of-char
implementation as an alternative to the list-of-SeqRecords
we have now (with its poor column access speed).
Peter
I wonder if that's something we could just do incrementally -- change the
MultipleSeqAlignment class to store a list-of-lists-of chars (or
list-of-strings), a list of SeqRecord-like husks (all the annotations, but
without the Seq itself) for each row, a list of column annotations, and a
single alphabet for the whole alignment.
How do you suppose the speed of that would compare to the current
list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be
a significant enough speed improvement to justify both replacing the current
implementation, and to make the NumPy approach less tempting (given PyPy's
progress toward including a compliant implementation)? Alternatively, we
could post a GSoC project for creating a separate TurboAlignment
class/module based on NumPy which would be mostly interchangeable and
interconvertible with the pure-Python version in the Biopython core.
When I said array-of-char I did have NumPy in mind, and PyPy does now
cope with two or more dimensional arrays in NumPyPy. Note that NumPy
handles both row and column orientated arrays with a simple class init
option, so this can easily be setup to favour row or column access.

Last time I did anything with the alignment object where column access
was a bottleneck (calculating mutual information between columns), I
just loaded all the columns into memory as a list of strings, and computed
on that. It worked very nicely.
Speaking of which, should we also post the idea of storing sequences as an
efficient byte array, BioJava-style?
I'd wondered about that (in combination with the discussion about strict
alphabet checking), but is there enough for a whole GSoC project?
Related to this one could look at something with k-mer hashes...

(Its good to see lots of possible project ideas bouncing around)

Peter
Eric Talevich
2013-03-21 17:42:19 UTC
Permalink
Post by Eric Talevich
Post by Eric Talevich
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a
protein
Post by Eric Talevich
sequence alignment to a codon alignment. (Previously discussed)
e.g.
https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
Well, check you out. Would you be interested in mentoring this project?
Post by Eric Talevich
Post by Eric Talevich
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage
of
Post by Eric Talevich
codon alignments, including validation (testing for frame shifts etc.)
http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis
I see you've started fleshing this idea out on the wiki, which is great.
Right now it seems a little on the light weight side - or is that deliberate
(to see if a student can take this idea and come up with a solid
project proposal in this area)? Things like model selection might
be a fun extension - I can think of a local expert who would be
great to get involved on the science side if he's interested.
I put up a quick sketch to avoid locking the wiki page for too long, but
also deliberately left it vague to see where the applicants take it. Model
selection would be cool, I added it. Local expert, also great.
Post by Eric Talevich
Alternatively this could include doing some more general work
on the alignment object - for instance per-column-annotation
for things like a consensus sequence - or an array-of-char
implementation as an alternative to the list-of-SeqRecords
we have now (with its poor column access speed).
Peter
I wonder if that's something we could just do incrementally -- change the
MultipleSeqAlignment class to store a list-of-lists-of chars (or
list-of-strings), a list of SeqRecord-like husks (all the annotations, but
without the Seq itself) for each row, a list of column annotations, and a
single alphabet for the whole alignment.

How do you suppose the speed of that would compare to the current
list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be
a significant enough speed improvement to justify both replacing the
current implementation, and to make the NumPy approach less tempting (given
PyPy's progress toward including a compliant implementation)?
Alternatively, we could post a GSoC project for creating a separate
TurboAlignment class/module based on NumPy which would be mostly
interchangeable and interconvertible with the pure-Python version in the
Biopython core.

Speaking of which, should we also post the idea of storing sequences as an
efficient byte array, BioJava-style?

-Eric
Bartek Wilczynski
2013-03-15 23:06:57 UTC
Permalink
Hi All,
I would add one more (old) idea for a GSoC pool, i.e. adding support
for different biological ontologies to biopython.

This was already discussed some time ago
(http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no)
mostly in the context of gene ontology, and to some extent this is
addressed by the development of GOAtools
(https://github.com/tanghaibao/goatools), but I think it would be
worth to have a decent support for OBO-file-based ontologies (not only
gene ontology, I'm also interested myself in anatomical ontologies,
there are also other available at obofoundry.org) in biopython.

I think it would need to include support for IO operations on both OBO
and annotation files, as well as statistical enrichment measures and
potentially some visualisation.

Would anyone be interested in co-mentoring this project? There is one
student in my department who would be interested in applying to GSoC
for this project, but I think it would be great if other people joined
the discussion on the functionality and having more people involved is
always better...

best
Bartek Wilczynski
Post by Eric Talevich
Post by Michiel de Hoon
It would be great to have better support for microarray analysis in
Biopython. Something like lumi/limma in R. Perhaps this is an option for
the GSoC?
Best,
-Michiel.
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a
protein sequence alignment to a codon alignment. (Previously discussed)
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of
codon alignments, including validation (testing for frame shifts etc.)
- A proper draw_unrooted function to perform radial layout, with an
optional "iterations" argument to use Felsenstein's Equal Daylight
algorithm -- I feel this layout approach is neglected in most libraries.
- Better matplotlib/pylab integration, so the plot components can be
tweaked using matplotlib functions.
- Other common layout approaches, e.g. circular.
- strict consensus, like Bio.Nexus already implements.
- other consensus methods, time permitting.
- Robinson-Foulds distance -- though others might be working on this
already.
- Straightforward algorithms exist for neighbor-joining and parsimony tree
estimation. For small alignments (and perhaps medium-sized ones with PyPy),
it would be nice to run these without an external program, e.g. to
construct a guide tree for another algorithm or quickly view a phylogenetic
clustering of sequences.
Any interest in either of these? Shall I add them to the wiki?
-Eric
Post by Michiel de Hoon
From: Peter Cock <p.j.a.cock at googlemail.com>
Subject: [Biopython-dev] Project ideas for GSoC (or other student
projects)
To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Tuesday, February 12, 2013, 12:51 PM
Hello all,
Google recently confirmed they will be running Google Summer of Code 2013,
and we (Biopython and the other Bio* projects) would hope to be accepted again
under the Open Bioinformatics Foundation as in previous
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project
ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for
other project
students, or 'low hanging fruit' for potential contributors
to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects
and the ideas list there.
Regards,
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
Bartek Wilczynski
Peter Cock
2013-03-21 17:01:51 UTC
Permalink
Already up on the wiki :)
- A proper draw_unrooted function to perform radial layout, with an optional
"iterations" argument to use Felsenstein's Equal Daylight algorithm -- I
feel this layout approach is neglected in most libraries.
- Better matplotlib/pylab integration, so the plot components can be tweaked
using matplotlib functions.
- Other common layout approaches, e.g. circular.
- strict consensus, like Bio.Nexus already implements.
- other consensus methods, time permitting.
- Robinson-Foulds distance -- though others might be working on this
already.
- Straightforward algorithms exist for neighbor-joining and parsimony tree
estimation. For small alignments (and perhaps medium-sized ones with PyPy),
it would be nice to run these without an external program, e.g. to construct
a guide tree for another algorithm or quickly view a phylogenetic clustering
of sequences.
One more idea for a sub-task?

2e. Using multiple trees for bootstrapping a master tree. Take the master
tree and for each edge you have a partition of the leaves, which can be
used as a dictionary hash (e.g. as a binary representation). Then for
each of the bootstrap runs, look at each edge, compute the hash for
that split of the leaves, and increment the count. Then at the end, you
have a dictionary of counts which are the branch bootstrap supports.

I wrote that once in Python some time back, and used it to take a set
of boot strap trees generated on a cluster and give the support values
to the master tree.
Any interest in either of these? Shall I add them to the wiki?
They both seem worth posting on the wiki, although we may not have
enough mentors for both to go ahead :(

Peter
Peter Cock
2013-03-21 16:55:30 UTC
Permalink
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a protein
sequence alignment to a codon alignment. (Previously discussed)
e.g. https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of
codon alignments, including validation (testing for frame shifts etc.)
http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis

I see you've started fleshing this idea out on the wiki, which is great.
Right now it seems a little on the light weight side - or is that deliberate
(to see if a student can take this idea and come up with a solid
project proposal in this area)? Things like model selection might
be a fun extension - I can think of a local expert who would be
great to get involved on the science side if he's interested.

Alternatively this could include doing some more general work
on the alignment object - for instance per-column-annotation
for things like a consensus sequence - or an array-of-char
implementation as an alternative to the list-of-SeqRecords
we have now (with its poor column access speed).

Peter
Saket Choudhary
2013-03-05 17:26:57 UTC
Permalink
I had this idea of an online biopython shell on the lines of bioruby shell :
http://bioruby.open-bio.org/wiki/BioRubyOnRails
Post by Michiel de Hoon
It would be great to have better support for microarray analysis in Biopython. Something like lumi/limma in R. Perhaps this is an option for the GSoC?
Best,
-Michiel.
From: Peter Cock <p.j.a.cock at googlemail.com>
Subject: [Biopython-dev] Project ideas for GSoC (or other student projects)
To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Tuesday, February 12, 2013, 12:51 PM
Hello all,
Google recently confirmed they will be running Google Summer
of Code 2013,
and we (Biopython and the other Bio* projects) would hope to
be accepted again
under the Open Bioinformatics Foundation as in previous
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project
ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for
other project
students, or 'low hanging fruit' for potential contributors
to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects
and the ideas list there.
Regards,
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Eric Talevich
2013-03-13 18:32:25 UTC
Permalink
Post by Michiel de Hoon
It would be great to have better support for microarray analysis in
Biopython. Something like lumi/limma in R. Perhaps this is an option for
the GSoC?
Best,
-Michiel.
I like Michiel's idea, and I'll suggest two more:

1. Codon alignment & analysis:
- PAL2NAL-style conversion of unaligned nucleic acid sequences and a
protein sequence alignment to a codon alignment. (Previously discussed)
- dN/dS and the related functions needed to calculate it.
- Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of
codon alignments, including validation (testing for frame shifts etc.)

2. Phylo enhancements:
2a. Tree drawing:
- A proper draw_unrooted function to perform radial layout, with an
optional "iterations" argument to use Felsenstein's Equal Daylight
algorithm -- I feel this layout approach is neglected in most libraries.
- Better matplotlib/pylab integration, so the plot components can be
tweaked using matplotlib functions.
- Other common layout approaches, e.g. circular.
2b. A "Phylo.consensus" module:
- strict consensus, like Bio.Nexus already implements.
- other consensus methods, time permitting.
2c. A "Phylo.distance" module:
- Robinson-Foulds distance -- though others might be working on this
already.
2d. Simple tree inference:
- Straightforward algorithms exist for neighbor-joining and parsimony tree
estimation. For small alignments (and perhaps medium-sized ones with PyPy),
it would be nice to run these without an external program, e.g. to
construct a guide tree for another algorithm or quickly view a phylogenetic
clustering of sequences.

Any interest in either of these? Shall I add them to the wiki?

-Eric
Post by Michiel de Hoon
From: Peter Cock <p.j.a.cock at googlemail.com>
Subject: [Biopython-dev] Project ideas for GSoC (or other student
projects)
To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Tuesday, February 12, 2013, 12:51 PM
Hello all,
Google recently confirmed they will be running Google Summer
of Code 2013,
and we (Biopython and the other Bio* projects) would hope to
be accepted again
under the Open Bioinformatics Foundation as in previous
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project
ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for
other project
students, or 'low hanging fruit' for potential contributors
to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects
and the ideas list there.
Regards,
Peter
Peter Cock
2013-02-12 17:51:15 UTC
Permalink
Hello all,

Google recently confirmed they will be running Google Summer of Code 2013,
and we (Biopython and the other Bio* projects) would hope to be accepted again
under the Open Bioinformatics Foundation as in previous years:
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html

It would be great to start coming up with potential project ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for other project
students, or 'low hanging fruit' for potential contributors to cut
their teeth on.

See also http://biopython.org/wiki/Active_projects and the ideas list there.

Regards,

Peter
Wibowo Arindrarto
2013-02-12 18:29:02 UTC
Permalink
Hi everyone,

It's more or less a 'low hanging fruit', but I've been thinking
perhaps it may be useful if we have our own interface to the HMMER3
online service? The corresponding SearchIO parsers may be written for
this as well (they return different formats for which we haven't any
parsers currently).

And I think there are more things being worked on, not yet mentioned
in the wiki:

1. Porting our docs to Sphinx[1]
2. Converting some/all of the print and compare tests to unit tests.
For example, our Bio.Seq's tests are still print and compare tests.

regards,
Bow

[1] See the original feature request here:
https://redmine.open-bio.org/issues/3221
https://redmine.open-bio.org/issues/3220
https://redmine.open-bio.org/issues/3219
Eric Talevich
2013-02-12 20:00:11 UTC
Permalink
Post by Peter Cock
Hello all,
Google recently confirmed they will be running Google Summer of Code 2013,
and we (Biopython and the other Bio* projects) would hope to be accepted again
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for other project
students, or 'low hanging fruit' for potential contributors to cut
their teeth on.
One interesting GSoC project would be to implement support for phylogenetic
placements. The programs pplacer and EPA (part of RAxML) can place sequence
reads from metagenomic samples onto a reference phylogeny:
http://matsen.fhcrc.org/pplacer/
http://sysbio.oxfordjournals.org/content/60/3/291

The output format of those programs has been standardized as something I
suppose we could call the "jplace" format:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0031009
http://arxiv.org/abs/1201.3397

It's based on JSON and Newick, with a small extension to Newick that
shouldn't be too hard to support. The GSoC project would be to implement a
parser for this and implement querying as well as integration with the rest
of Bio.Phylo to some reasonable extent. I would be available to mentor this.

In terms of low-hanging fruit, there are some small but important functions
that could be added to Bio.Phylo. My top three: Robinson-Foulds distance,
majority-rules consensus, draw an unrooted tree using Felsenstein's Equal
Daylight algorithm (which starts by computing the layout for a radial tree).

-Eric
Saket Choudhary
2013-02-12 20:45:46 UTC
Permalink
Hi,

I was thinking of a Synteny viewer on the lines of
GSV<http://cas-bioinfo.cas.unt.edu/gsv/homepage.php> if
it makes sense .

Saket
Post by Peter Cock
Hello all,
Google recently confirmed they will be running Google Summer of Code 2013,
and we (Biopython and the other Bio* projects) would hope to be accepted again
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for other project
students, or 'low hanging fruit' for potential contributors to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects and the ideas list there.
Regards,
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Michiel de Hoon
2013-02-13 02:08:26 UTC
Permalink
It would be great to have better support for microarray analysis in Biopython. Something like lumi/limma in R. Perhaps this is an option for the GSoC?

Best,
-Michiel.
From: Peter Cock <p.j.a.cock at googlemail.com>
Subject: [Biopython-dev] Project ideas for GSoC (or other student projects)
To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Tuesday, February 12, 2013, 12:51 PM
Hello all,
Google recently confirmed they will be running Google Summer
of Code 2013,
and we (Biopython and the other Bio* projects) would hope to
be accepted again
under the Open Bioinformatics Foundation as in previous
http://lists.open-bio.org/pipermail/gsoc/2013/000196.html
It would be great to start coming up with potential project
ideas, both larger
pieces of work suitable for GSoC but also smaller tasks for
other project
students, or 'low hanging fruit' for potential contributors
to cut
their teeth on.
See also http://biopython.org/wiki/Active_projects
and the ideas list there.
Regards,
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Loading...