Discussion:
[Biopython-dev] Code review request for phyloxml branch
Eric Talevich
2009-09-24 03:48:49 UTC
Permalink
Folks,

I've fixed a couple of remaining issues in the Bio.Tree and Bio.TreeIO
modules and I'd like your opinion on what else should be done before merging
this into the mainline.

First, the wiki documentation for PhyloXML has an example pipeline showing
how to build a phylogeny in Biopython, from a raw protein sequence to a
lightly annotated phyloXML file.
http://biopython.org/wiki/PhyloXML#Example_pipeline

Does this look like right? I copied the first few steps from the official
docs.

The source code, for your review, is here:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py

Discussion:

*TreeIO*
The read, parse, write and convert functions work essentially the same as in
SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues:

(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.

(2) NexusIO.write() just doesn't seem to work. I don't understand how to
make the original Nexus module write out trees that it didn't parse itself.
Help?

*Tree
*The BaseTree module is meant to be the basis for Newick trees eventually,
so I'd like to get the design right with the minimum number of public
methods:

(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.

(2) Should find() be based on depth_first_search or breadth_first_search
(not checked in yet)? DFS would potentially find a leaf node faster, but BFS
seems more common in phylogenetics. Note that iteration can easily be
reversed with the standard reversed() function, so we don't need extra
functions for those cases.

(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?

(4) There's some mention in the literature of a relationship-matrix
representation for phylogenies. Does anyone here know how to work with this
representation, or know if it would let us perform complex calculations with
blinding speed behind the scenes? If so, should there be a function in
Bio.Tree.Utils to export a tree to a NumPy array represented this way? If
not, I'll forget about it.

*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
in the way biologists are used to seeing them. Does anyone know of existing
code that could be borrowed for this? I looked at ETE (announced on the main
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.

Best regards,
Eric
Peter
2009-09-24 09:57:12 UTC
Permalink
Post by Eric Talevich
*TreeIO*
The read, parse, write and convert functions work essentially the same as in
Great.

One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry. Easy.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).

On a more general note, you seem to be recreating the file/handle logic
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and
Bio.TreeIO.write() functions *only* and have the underlying format specific
code just use handles. This avoids the code duplication.

[In fact, as I have said before, I prefer the simplicity of just allowing
handles - and we should make TreeIO and SeqIO/AlignIO consistent]
Post by Eric Talevich
(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).

Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and distances
between nodes.
Post by Eric Talevich
(2) NexusIO.write() just doesn't seem to work. I don't understand how to
make the original Nexus module write out trees that it didn't parse itself.
Help?
To get the Newick tree, you can just call str(tree), which is basically what
you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
more complicated. You'll need to create a minimal Nexus file - have a
look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having
a hard coded nexus template, and just insert the tree as a Newick string
(and insert the list of taxa?). Perhaps Frank or Cymon can advise us.
Post by Eric Talevich
*Tree
*The BaseTree module is meant to be the basis for Newick trees eventually,
so I'd like to get the design right with the minimum number of public
(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.
I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.
Post by Eric Talevich
(2) Should find() be based on depth_first_search or breadth_first_search
(not checked in yet)? DFS would potentially find a leaf node faster, but BFS
seems more common in phylogenetics. Note that iteration can easily be
reversed with the standard reversed() function, so we don't need extra
functions for those cases.
You could do both, either via an argument or having two methods, say
depth_fist_search and breadth_first_search instead of find.
Post by Eric Talevich
(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?
A partial implementation doesn't seem helpful, and wastes memory
allocating unused properties. I would remove it from the base Node,
but a full implementation might be useful for something (would it be
possible via a subclass?).

On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.
Post by Eric Talevich
(4) There's some mention in the literature of a relationship-matrix
representation for phylogenies. Does anyone here know how to work with this
representation, or know if it would let us perform complex calculations with
blinding speed behind the scenes? If so, should there be a function in
Bio.Tree.Utils to export a tree to a NumPy array represented this way? ?If
not, I'll forget about it.
I don't know.
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
in the way biologists are used to seeing them. Does anyone know of existing
code that could be borrowed for this? I looked at ETE (announced on the main
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...

Peter
Jaime Huerta Cepas
2009-09-24 10:45:21 UTC
Permalink
Hi,

( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.

I develop a lot of code around tree handling, so if you think I can help,
please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it would
be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on the
main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Eric Talevich
2009-09-25 03:54:05 UTC
Permalink
Hello, Jaime,

Sorry I didn't respond directly to your earlier post -- I wrote half of an
e-mail, then realized I had no good suggestions on what to do so I scrapped
it.

My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.

I see these issues with integration:
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?

2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.

3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.

4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.

5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.

6. The PhylomeDB connector is cool, and browsing the source, looks like it
wouldn't require much effort at all to drop into Biopython.

Thanks for letting us know about this.

Cheers,
Eric
Post by Jaime Huerta Cepas
Hi,
( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.
I develop a lot of code around tree handling, so if you think I can help,
please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it would
be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on the
main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4 and
a
Post by Eric Talevich
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Jaime Huerta Cepas
2009-09-25 15:28:36 UTC
Permalink
Hi Eric,

Thanks for your comments,
I really see a lot of potential parts in ETE that could be used from
biopython, however, for the moment, we would rather prefer not to modify
current ETE's GPL license. As far as I know, the main difference between
GPL and BSD-like licenses is that, with the second, you could relicense the
code at any moment under any other policy, including private and close
licenses. GPL includes a protection for this by ensuring that any code based
on GPL sources must be always GPL compatible, and that's why we have chosen
it. Moreover, the use of a BSD-like license would prevent us to use a lot of
great GPL code out there.

It is not my purpose to open a debate about licenses. I just wonder if
biopython could provide any way to link/bind external software, perhaps as
addons or plugins. This would be great, since many extra features (not only
from ETE but from other sources) could be added on specific demands. This
would also mitigate the problem of very specific dependencies, since many of
them would be optional. From my side, I could work for providing bindings
between biopython and ETE's tree graphical rendering features, inline
visualization GUI, extended newick support, tree manipulation and the
methods within the ETE package.

I will be out of the office for several weeks, but if you see any way to
collaborate I will be happy to discuss this a bit more in detail...

Cheers!
Jaime
Post by Eric Talevich
Hello, Jaime,
Sorry I didn't respond directly to your earlier post -- I wrote half of an
e-mail, then realized I had no good suggestions on what to do so I scrapped
it.
My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?
2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.
3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.
4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.
5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.
6. The PhylomeDB connector is cool, and browsing the source, looks like it
wouldn't require much effort at all to drop into Biopython.
Thanks for letting us know about this.
Cheers,
Eric
Post by Jaime Huerta Cepas
Hi,
( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.
I develop a lot of code around tree handling, so if you think I can help,
please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it would
be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on the
main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4 and
a
Post by Eric Talevich
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Eric Talevich
2009-09-25 15:51:15 UTC
Permalink
Hi Jaime,

Just working on bindings would certainly be easier. The best way to transfer
tree information from Biopython to ETE would be serializing the trees in
phyloXML format (to preserve the annotations) and loading that file in ETE.
I see that ETE allows rich annotation of tree objects, but I don't see
phyloXML or NeXML listed as supported file formats -- is there another
standard format you're using to store this information? If not, I think ETE
would benefit from a phyloXML parser. Since Biopython license is
GPL-compatible (I believe), you could borrow Bio.TreeIO.PhyloXMLIO directly
and just port the Phylogeny and Clade classes to ETE's base classes instead
of Bio.Tree.BaseTree's Tree and Node classes.

Beyond that, some support for BioSQL to store sequences etc. would also help
link ETE to any of the other Bio* projects. There's some example code in
Biopython's top-level BioSQL directory, if you're interested.

Cheers,
Eric
Post by Jaime Huerta Cepas
Hi Eric,
Thanks for your comments,
I really see a lot of potential parts in ETE that could be used from
biopython, however, for the moment, we would rather prefer not to modify
current ETE's GPL license. As far as I know, the main difference between
GPL and BSD-like licenses is that, with the second, you could relicense the
code at any moment under any other policy, including private and close
licenses. GPL includes a protection for this by ensuring that any code based
on GPL sources must be always GPL compatible, and that's why we have chosen
it. Moreover, the use of a BSD-like license would prevent us to use a lot of
great GPL code out there.
It is not my purpose to open a debate about licenses. I just wonder if
biopython could provide any way to link/bind external software, perhaps as
addons or plugins. This would be great, since many extra features (not only
from ETE but from other sources) could be added on specific demands. This
would also mitigate the problem of very specific dependencies, since many of
them would be optional. From my side, I could work for providing bindings
between biopython and ETE's tree graphical rendering features, inline
visualization GUI, extended newick support, tree manipulation and the
methods within the ETE package.
I will be out of the office for several weeks, but if you see any way to
collaborate I will be happy to discuss this a bit more in detail...
Cheers!
Jaime
Post by Eric Talevich
Hello, Jaime,
Sorry I didn't respond directly to your earlier post -- I wrote half of an
e-mail, then realized I had no good suggestions on what to do so I scrapped
it.
My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?
2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.
3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.
4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.
5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.
6. The PhylomeDB connector is cool, and browsing the source, looks like it
wouldn't require much effort at all to drop into Biopython.
Thanks for letting us know about this.
Cheers,
Eric
Post by Jaime Huerta Cepas
Hi,
( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.
I develop a lot of code around tree handling, so if you think I can
help, please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it
would be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on
the main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4
and a
Post by Eric Talevich
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Jaime Huerta Cepas
2009-09-25 16:13:44 UTC
Permalink
Hi,
Post by Eric Talevich
Just working on bindings would certainly be easier. The best way to
transfer tree information from Biopython to ETE would be serializing the
trees in phyloXML format (to preserve the annotations) and loading that file
in ETE. I see that ETE allows rich annotation of tree objects, but I don't
see phyloXML or NeXML listed as supported file formats -- is there another
standard format you're using to store this information?
Extended newick (http://www.phylosoft.org/NHX/) is the only rich format
currently supported by ETE, however only text string representation of tree
node annotations are allowed by this standard. Beyond this, you should use a
cpickle approach to save complex annotated trees. I'm certainly interested
in PhyloXML and NexML support, so, for sure, this could be a nice starting
point.

If not, I think ETE would benefit from a phyloXML parser. Since Biopython
Post by Eric Talevich
license is GPL-compatible (I believe), you could borrow
Bio.TreeIO.PhyloXMLIO directly and just port the Phylogeny and Clade classes
to ETE's base classes instead of Bio.Tree.BaseTree's Tree and Node classes.
I think there is no problem in using BSD license from GPL sources, the
problem would be in the other way around. Then I will take a look at your
phyloxml code to find the best way to bind both packages through phyloXML
serialization.
Post by Eric Talevich
Beyond that, some support for BioSQL to store sequences etc. would also
help link ETE to any of the other Bio* projects. There's some example code
in Biopython's top-level BioSQL directory, if you're interested.
Ok. I'll take a look also. Thanks.

cheers,
Jaime.
Post by Eric Talevich
Cheers,
Eric
Post by Jaime Huerta Cepas
Hi Eric,
Thanks for your comments,
I really see a lot of potential parts in ETE that could be used from
biopython, however, for the moment, we would rather prefer not to modify
current ETE's GPL license. As far as I know, the main difference between
GPL and BSD-like licenses is that, with the second, you could relicense the
code at any moment under any other policy, including private and close
licenses. GPL includes a protection for this by ensuring that any code based
on GPL sources must be always GPL compatible, and that's why we have chosen
it. Moreover, the use of a BSD-like license would prevent us to use a lot of
great GPL code out there.
It is not my purpose to open a debate about licenses. I just wonder if
biopython could provide any way to link/bind external software, perhaps as
addons or plugins. This would be great, since many extra features (not only
from ETE but from other sources) could be added on specific demands. This
would also mitigate the problem of very specific dependencies, since many of
them would be optional. From my side, I could work for providing bindings
between biopython and ETE's tree graphical rendering features, inline
visualization GUI, extended newick support, tree manipulation and the
methods within the ETE package.
I will be out of the office for several weeks, but if you see any way to
collaborate I will be happy to discuss this a bit more in detail...
Cheers!
Jaime
Post by Eric Talevich
Hello, Jaime,
Sorry I didn't respond directly to your earlier post -- I wrote half of
an e-mail, then realized I had no good suggestions on what to do so I
scrapped it.
My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?
2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.
3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.
4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.
5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.
6. The PhylomeDB connector is cool, and browsing the source, looks like
it wouldn't require much effort at all to drop into Biopython.
Thanks for letting us know about this.
Cheers,
Eric
Post by Jaime Huerta Cepas
Hi,
( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more
method within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.
I develop a lot of code around tree handling, so if you think I can
help, please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner,
perhaps even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it
would be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on
the main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4
and a
Post by Eric Talevich
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Peter
2009-09-25 16:22:40 UTC
Permalink
Post by Jaime Huerta Cepas
I think there is no problem in using BSD license from GPL sources, the
problem would be in the other way around.
Yes, that way round is fine from a license point of view (taking Biopython's
BSD/MIT style licensed code and using it in a GPL project). But we can't
take your GPL code into Biopython unless you re-license it more liberally.

I can see the appeal of the (L)GPL for forcing the code to stay open, but
Biopython (like Python) went for the other option of basically letting anyone
use the code in anyway they like.

Peter
Peter
2009-09-25 16:22:40 UTC
Permalink
Post by Jaime Huerta Cepas
I think there is no problem in using BSD license from GPL sources, the
problem would be in the other way around.
Yes, that way round is fine from a license point of view (taking Biopython's
BSD/MIT style licensed code and using it in a GPL project). But we can't
take your GPL code into Biopython unless you re-license it more liberally.

I can see the appeal of the (L)GPL for forcing the code to stay open, but
Biopython (like Python) went for the other option of basically letting anyone
use the code in anyway they like.

Peter
Jaime Huerta Cepas
2009-09-25 16:13:44 UTC
Permalink
Hi,
Post by Eric Talevich
Just working on bindings would certainly be easier. The best way to
transfer tree information from Biopython to ETE would be serializing the
trees in phyloXML format (to preserve the annotations) and loading that file
in ETE. I see that ETE allows rich annotation of tree objects, but I don't
see phyloXML or NeXML listed as supported file formats -- is there another
standard format you're using to store this information?
Extended newick (http://www.phylosoft.org/NHX/) is the only rich format
currently supported by ETE, however only text string representation of tree
node annotations are allowed by this standard. Beyond this, you should use a
cpickle approach to save complex annotated trees. I'm certainly interested
in PhyloXML and NexML support, so, for sure, this could be a nice starting
point.

If not, I think ETE would benefit from a phyloXML parser. Since Biopython
Post by Eric Talevich
license is GPL-compatible (I believe), you could borrow
Bio.TreeIO.PhyloXMLIO directly and just port the Phylogeny and Clade classes
to ETE's base classes instead of Bio.Tree.BaseTree's Tree and Node classes.
I think there is no problem in using BSD license from GPL sources, the
problem would be in the other way around. Then I will take a look at your
phyloxml code to find the best way to bind both packages through phyloXML
serialization.
Post by Eric Talevich
Beyond that, some support for BioSQL to store sequences etc. would also
help link ETE to any of the other Bio* projects. There's some example code
in Biopython's top-level BioSQL directory, if you're interested.
Ok. I'll take a look also. Thanks.

cheers,
Jaime.
Post by Eric Talevich
Cheers,
Eric
Post by Jaime Huerta Cepas
Hi Eric,
Thanks for your comments,
I really see a lot of potential parts in ETE that could be used from
biopython, however, for the moment, we would rather prefer not to modify
current ETE's GPL license. As far as I know, the main difference between
GPL and BSD-like licenses is that, with the second, you could relicense the
code at any moment under any other policy, including private and close
licenses. GPL includes a protection for this by ensuring that any code based
on GPL sources must be always GPL compatible, and that's why we have chosen
it. Moreover, the use of a BSD-like license would prevent us to use a lot of
great GPL code out there.
It is not my purpose to open a debate about licenses. I just wonder if
biopython could provide any way to link/bind external software, perhaps as
addons or plugins. This would be great, since many extra features (not only
from ETE but from other sources) could be added on specific demands. This
would also mitigate the problem of very specific dependencies, since many of
them would be optional. From my side, I could work for providing bindings
between biopython and ETE's tree graphical rendering features, inline
visualization GUI, extended newick support, tree manipulation and the
methods within the ETE package.
I will be out of the office for several weeks, but if you see any way to
collaborate I will be happy to discuss this a bit more in detail...
Cheers!
Jaime
Post by Eric Talevich
Hello, Jaime,
Sorry I didn't respond directly to your earlier post -- I wrote half of
an e-mail, then realized I had no good suggestions on what to do so I
scrapped it.
My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?
2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.
3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.
4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.
5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.
6. The PhylomeDB connector is cool, and browsing the source, looks like
it wouldn't require much effort at all to drop into Biopython.
Thanks for letting us know about this.
Cheers,
Eric
Post by Jaime Huerta Cepas
Hi,
( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more
method within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.
I develop a lot of code around tree handling, so if you think I can
help, please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner,
perhaps even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it
would be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on
the main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4
and a
Post by Eric Talevich
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Hilmar Lapp
2009-09-25 20:58:36 UTC
Permalink
As far as I know, the main difference between GPL and BSD-like
licenses is that, with the second, you could relicense the code at
any moment under any other policy, including private and close
licenses.
This is not true. None of the open-source licenses that I'm aware of
allows anyone to relicense code under a license that is less liberal,
or to relicense code at all. It is the copyright owner who can
relicense code, not the distributor.

One of the differences between GPL and BSD is that GPL is viral.
Specifically, code that links to GPL-licensed code must also be GPL-
licensed *when it is distributed.*

(It is a common misconception that GPL is unconditionally viral. I can
take GPL code and link to it and keep my code closed source for as
long as I please if I never redistribute it. GPL was written with
software vendors in mind, whose business consists of distributing
software for commercial gain. GPL has therefore sometimes been called
anti-commercial. This is wrong, too, but I won't go into the details
here.)

Biopython can freely utilize GPL-licensed (or closed source, for that
matter) software if it doesn't link to it. IANAL but I think it can
also redistribute GPL-licensed code along with Biopython so long as
Biopython doesn't link to it, and it is made clear that some of the
distribution falls under a different license than BSD. (Linux
distributions mix BSD and GPL software, too.)

As for ETE itself, a BSD/MIT style license seems to be the by far most
widely used license for Python modules. If you want to facilitate
adoption of the software as a library by other programmers, GPL is
going to stand in the way of that. Also, really all that you are
accomplishing with GPL is that a software company can't take advantage
of ETE. Is that your chief concern? GPL won't prevent any scientific
lab from writing closed source code that builds on ETE and publishing
the results, so long as they don't distribute their closed source code.


-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
Brad Chapman
2009-09-25 21:48:00 UTC
Permalink
Hi all;
Hilmar -- thanks for writing up a nice summary of the license
details. Jaime, I think it's a shame we would let these issues
prevent working together. It sounds like you and Eric have some
shared goals and it would be great to see that evolve into some
useful functionality in Biopython.

Generally, the BSD-like license which Biopython uses encourages
cooperation and keeps people at both academia and industry happy. As
scientists, our goal should be to avoid letting these types of issues
preventing collaboration. Truthfully, there is very little opportunity
for exploitation of bioinformatics software; the economics are just not
there for companies to sell code.
Post by Hilmar Lapp
(It is a common misconception that GPL is unconditionally viral. I can
take GPL code and link to it and keep my code closed source for as
long as I please if I never redistribute it. GPL was written with
software vendors in mind, whose business consists of distributing
software for commercial gain. GPL has therefore sometimes been called
anti-commercial. This is wrong, too, but I won't go into the details
here.)
I agree 100%, but in practical terms it is very difficult to have this
argument at a company. Speaking from experience, GPL creates all kinds
of nasty thoughts in people's heads which prevents adoption of code in
corporate environments. For Biopython and other bioinformatics projects,
we should be actively encouraging contributions from companies as
well as academia.
Post by Hilmar Lapp
Biopython can freely utilize GPL-licensed (or closed source, for that
matter) software if it doesn't link to it. IANAL but I think it can
also redistribute GPL-licensed code along with Biopython so long as
Biopython doesn't link to it, and it is made clear that some of the
distribution falls under a different license than BSD. (Linux
distributions mix BSD and GPL software, too.)
Yes, but this complication is bad. Let's keep it simple,
Brad
Hilmar Lapp
2009-09-26 15:25:41 UTC
Permalink
Post by Brad Chapman
I agree 100%, but in practical terms it is very difficult to have this
argument at a company.
Yes, I know.
Post by Brad Chapman
For Biopython and other bioinformatics projects, we should be
actively encouraging contributions from companies as well as academia.
Having worked in commercial and private sector for almost a decade, I
couldn't agree more. There is a huge amount of open-source code
development contributed by people working in the private sector, and
which is hence sponsored by companies.

-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
Jaime Huerta Cepas
2009-09-26 17:28:02 UTC
Permalink
Hi Brad,

Jaime, I think it's a shame we would let these issues
Post by Brad Chapman
prevent working together. It sounds like you and Eric have some
shared goals and it would be great to see that evolve into some
useful functionality in Biopython.
Sure!! My only intention was to find the best way to contribute!
However, the choice of a "viral" GPL license was specifically chosen for
exactly this reason: encouraging free software and academic scientific
resources.
We have a lot shared goals, so I trust we will find a happy way to
colaborate.

Jaime.
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Hilmar Lapp
2009-09-26 15:25:41 UTC
Permalink
Post by Brad Chapman
I agree 100%, but in practical terms it is very difficult to have this
argument at a company.
Yes, I know.
Post by Brad Chapman
For Biopython and other bioinformatics projects, we should be
actively encouraging contributions from companies as well as academia.
Having worked in commercial and private sector for almost a decade, I
couldn't agree more. There is a huge amount of open-source code
development contributed by people working in the private sector, and
which is hence sponsored by companies.

-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
Jaime Huerta Cepas
2009-09-26 17:28:02 UTC
Permalink
Hi Brad,

Jaime, I think it's a shame we would let these issues
Post by Brad Chapman
prevent working together. It sounds like you and Eric have some
shared goals and it would be great to see that evolve into some
useful functionality in Biopython.
Sure!! My only intention was to find the best way to contribute!
However, the choice of a "viral" GPL license was specifically chosen for
exactly this reason: encouraging free software and academic scientific
resources.
We have a lot shared goals, so I trust we will find a happy way to
colaborate.

Jaime.
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Jaime Huerta Cepas
2009-09-26 17:12:59 UTC
Permalink
Hey! Sorry, It was not my intention to open a flame about licences nor to
sound rude. I apologize if I did.
As far as I know, the main difference between GPL and BSD-like licenses is
that, with the second, you could relicense the code at any moment under any
other policy, including private and close licenses.
This is not true. None of the open-source licenses that I'm aware of allows
anyone to relicense code under a license that is less liberal, or to
relicense code at all. It is the copyright owner who can relicense code, not
the distributor.
I'm not an expert on software licences, so I can not enter into this issue
very deeply. What I said in my previous email is what I could understand
from these info: http://www.gnu.org/philosophy/license-list.html,
http://www.gnu.org/philosophy/categories.html#Non-CopyleftedFreeSoftware
If I was wrong and modified BSD-like sources cannot be relicensed under
other less liberal licenses, then we will kindly consider a change of the
ETE license in the future.
One of the differences between GPL and BSD is that GPL is viral.
Specifically, code that links to GPL-licensed code must also be GPL-licensed
*when it is distributed.*
(It is a common misconception that GPL is unconditionally viral. I can take
GPL code and link to it and keep my code closed source for as long as I
please if I never redistribute it. GPL was written with software vendors in
mind, whose business consists of distributing software for commercial gain.
GPL has therefore sometimes been called anti-commercial. This is wrong, too,
but I won't go into the details here.)
I see, so the only problem is about distribution...


Biopython can freely utilize GPL-licensed (or closed source, for that
matter) software if it doesn't link to it. IANAL but I think it can also
redistribute GPL-licensed code along with Biopython so long as Biopython
doesn't link to it, and it is made clear that some of the distribution falls
under a different license than BSD. (Linux distributions mix BSD and GPL
software, too.)
Yes, I agree. This is what I meant as biopython addons. With this in mind,
biopython could be aware of many other software out there and benefit from
it. Is there any work around this in bipython?


As for ETE itself, a BSD/MIT style license seems to be the by far most
widely used license for Python modules. If you want to facilitate adoption
of the software as a library by other programmers, GPL is going to stand in
the way of that. Also, really all that you are accomplishing with GPL is
that a software company can't take advantage of ETE. Is that your chief
concern?
Well, our intention was that code based on ETE sources (other tools or
improvements) were distrubuted/published also as free software. We wanted
also to leave an open door to use other GPL software from ETE.
GPL won't prevent any scientific lab from writing closed source code that
builds on ETE and publishing the results, so long as they don't distribute
their closed source code.
Yes. You are right. We don't want to avoid this.

In any case, thanks for your comments. I will try to get more info about
what you say and, if we have to modify something, we do it. :)

cheers,
Jaime
-hilmar
--
===========================================================
===========================================================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Brad Chapman
2009-09-25 21:48:00 UTC
Permalink
Hi all;
Hilmar -- thanks for writing up a nice summary of the license
details. Jaime, I think it's a shame we would let these issues
prevent working together. It sounds like you and Eric have some
shared goals and it would be great to see that evolve into some
useful functionality in Biopython.

Generally, the BSD-like license which Biopython uses encourages
cooperation and keeps people at both academia and industry happy. As
scientists, our goal should be to avoid letting these types of issues
preventing collaboration. Truthfully, there is very little opportunity
for exploitation of bioinformatics software; the economics are just not
there for companies to sell code.
Post by Hilmar Lapp
(It is a common misconception that GPL is unconditionally viral. I can
take GPL code and link to it and keep my code closed source for as
long as I please if I never redistribute it. GPL was written with
software vendors in mind, whose business consists of distributing
software for commercial gain. GPL has therefore sometimes been called
anti-commercial. This is wrong, too, but I won't go into the details
here.)
I agree 100%, but in practical terms it is very difficult to have this
argument at a company. Speaking from experience, GPL creates all kinds
of nasty thoughts in people's heads which prevents adoption of code in
corporate environments. For Biopython and other bioinformatics projects,
we should be actively encouraging contributions from companies as
well as academia.
Post by Hilmar Lapp
Biopython can freely utilize GPL-licensed (or closed source, for that
matter) software if it doesn't link to it. IANAL but I think it can
also redistribute GPL-licensed code along with Biopython so long as
Biopython doesn't link to it, and it is made clear that some of the
distribution falls under a different license than BSD. (Linux
distributions mix BSD and GPL software, too.)
Yes, but this complication is bad. Let's keep it simple,
Brad
Jaime Huerta Cepas
2009-09-26 17:12:59 UTC
Permalink
Hey! Sorry, It was not my intention to open a flame about licences nor to
sound rude. I apologize if I did.
As far as I know, the main difference between GPL and BSD-like licenses is
that, with the second, you could relicense the code at any moment under any
other policy, including private and close licenses.
This is not true. None of the open-source licenses that I'm aware of allows
anyone to relicense code under a license that is less liberal, or to
relicense code at all. It is the copyright owner who can relicense code, not
the distributor.
I'm not an expert on software licences, so I can not enter into this issue
very deeply. What I said in my previous email is what I could understand
from these info: http://www.gnu.org/philosophy/license-list.html,
http://www.gnu.org/philosophy/categories.html#Non-CopyleftedFreeSoftware
If I was wrong and modified BSD-like sources cannot be relicensed under
other less liberal licenses, then we will kindly consider a change of the
ETE license in the future.
One of the differences between GPL and BSD is that GPL is viral.
Specifically, code that links to GPL-licensed code must also be GPL-licensed
*when it is distributed.*
(It is a common misconception that GPL is unconditionally viral. I can take
GPL code and link to it and keep my code closed source for as long as I
please if I never redistribute it. GPL was written with software vendors in
mind, whose business consists of distributing software for commercial gain.
GPL has therefore sometimes been called anti-commercial. This is wrong, too,
but I won't go into the details here.)
I see, so the only problem is about distribution...


Biopython can freely utilize GPL-licensed (or closed source, for that
matter) software if it doesn't link to it. IANAL but I think it can also
redistribute GPL-licensed code along with Biopython so long as Biopython
doesn't link to it, and it is made clear that some of the distribution falls
under a different license than BSD. (Linux distributions mix BSD and GPL
software, too.)
Yes, I agree. This is what I meant as biopython addons. With this in mind,
biopython could be aware of many other software out there and benefit from
it. Is there any work around this in bipython?


As for ETE itself, a BSD/MIT style license seems to be the by far most
widely used license for Python modules. If you want to facilitate adoption
of the software as a library by other programmers, GPL is going to stand in
the way of that. Also, really all that you are accomplishing with GPL is
that a software company can't take advantage of ETE. Is that your chief
concern?
Well, our intention was that code based on ETE sources (other tools or
improvements) were distrubuted/published also as free software. We wanted
also to leave an open door to use other GPL software from ETE.
GPL won't prevent any scientific lab from writing closed source code that
builds on ETE and publishing the results, so long as they don't distribute
their closed source code.
Yes. You are right. We don't want to avoid this.

In any case, thanks for your comments. I will try to get more info about
what you say and, if we have to modify something, we do it. :)

cheers,
Jaime
-hilmar
--
===========================================================
===========================================================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Eric Talevich
2009-09-25 15:51:15 UTC
Permalink
Hi Jaime,

Just working on bindings would certainly be easier. The best way to transfer
tree information from Biopython to ETE would be serializing the trees in
phyloXML format (to preserve the annotations) and loading that file in ETE.
I see that ETE allows rich annotation of tree objects, but I don't see
phyloXML or NeXML listed as supported file formats -- is there another
standard format you're using to store this information? If not, I think ETE
would benefit from a phyloXML parser. Since Biopython license is
GPL-compatible (I believe), you could borrow Bio.TreeIO.PhyloXMLIO directly
and just port the Phylogeny and Clade classes to ETE's base classes instead
of Bio.Tree.BaseTree's Tree and Node classes.

Beyond that, some support for BioSQL to store sequences etc. would also help
link ETE to any of the other Bio* projects. There's some example code in
Biopython's top-level BioSQL directory, if you're interested.

Cheers,
Eric
Post by Jaime Huerta Cepas
Hi Eric,
Thanks for your comments,
I really see a lot of potential parts in ETE that could be used from
biopython, however, for the moment, we would rather prefer not to modify
current ETE's GPL license. As far as I know, the main difference between
GPL and BSD-like licenses is that, with the second, you could relicense the
code at any moment under any other policy, including private and close
licenses. GPL includes a protection for this by ensuring that any code based
on GPL sources must be always GPL compatible, and that's why we have chosen
it. Moreover, the use of a BSD-like license would prevent us to use a lot of
great GPL code out there.
It is not my purpose to open a debate about licenses. I just wonder if
biopython could provide any way to link/bind external software, perhaps as
addons or plugins. This would be great, since many extra features (not only
from ETE but from other sources) could be added on specific demands. This
would also mitigate the problem of very specific dependencies, since many of
them would be optional. From my side, I could work for providing bindings
between biopython and ETE's tree graphical rendering features, inline
visualization GUI, extended newick support, tree manipulation and the
methods within the ETE package.
I will be out of the office for several weeks, but if you see any way to
collaborate I will be happy to discuss this a bit more in detail...
Cheers!
Jaime
Post by Eric Talevich
Hello, Jaime,
Sorry I didn't respond directly to your earlier post -- I wrote half of an
e-mail, then realized I had no good suggestions on what to do so I scrapped
it.
My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?
2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.
3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.
4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.
5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.
6. The PhylomeDB connector is cool, and browsing the source, looks like it
wouldn't require much effort at all to drop into Biopython.
Thanks for letting us know about this.
Cheers,
Eric
Post by Jaime Huerta Cepas
Hi,
( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.
I develop a lot of code around tree handling, so if you think I can
help, please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it
would be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on
the main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4
and a
Post by Eric Talevich
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Hilmar Lapp
2009-09-25 20:58:36 UTC
Permalink
As far as I know, the main difference between GPL and BSD-like
licenses is that, with the second, you could relicense the code at
any moment under any other policy, including private and close
licenses.
This is not true. None of the open-source licenses that I'm aware of
allows anyone to relicense code under a license that is less liberal,
or to relicense code at all. It is the copyright owner who can
relicense code, not the distributor.

One of the differences between GPL and BSD is that GPL is viral.
Specifically, code that links to GPL-licensed code must also be GPL-
licensed *when it is distributed.*

(It is a common misconception that GPL is unconditionally viral. I can
take GPL code and link to it and keep my code closed source for as
long as I please if I never redistribute it. GPL was written with
software vendors in mind, whose business consists of distributing
software for commercial gain. GPL has therefore sometimes been called
anti-commercial. This is wrong, too, but I won't go into the details
here.)

Biopython can freely utilize GPL-licensed (or closed source, for that
matter) software if it doesn't link to it. IANAL but I think it can
also redistribute GPL-licensed code along with Biopython so long as
Biopython doesn't link to it, and it is made clear that some of the
distribution falls under a different license than BSD. (Linux
distributions mix BSD and GPL software, too.)

As for ETE itself, a BSD/MIT style license seems to be the by far most
widely used license for Python modules. If you want to facilitate
adoption of the software as a library by other programmers, GPL is
going to stand in the way of that. Also, really all that you are
accomplishing with GPL is that a software company can't take advantage
of ETE. Is that your chief concern? GPL won't prevent any scientific
lab from writing closed source code that builds on ETE and publishing
the results, so long as they don't distribute their closed source code.


-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
Jaime Huerta Cepas
2009-09-25 15:28:36 UTC
Permalink
Hi Eric,

Thanks for your comments,
I really see a lot of potential parts in ETE that could be used from
biopython, however, for the moment, we would rather prefer not to modify
current ETE's GPL license. As far as I know, the main difference between
GPL and BSD-like licenses is that, with the second, you could relicense the
code at any moment under any other policy, including private and close
licenses. GPL includes a protection for this by ensuring that any code based
on GPL sources must be always GPL compatible, and that's why we have chosen
it. Moreover, the use of a BSD-like license would prevent us to use a lot of
great GPL code out there.

It is not my purpose to open a debate about licenses. I just wonder if
biopython could provide any way to link/bind external software, perhaps as
addons or plugins. This would be great, since many extra features (not only
from ETE but from other sources) could be added on specific demands. This
would also mitigate the problem of very specific dependencies, since many of
them would be optional. From my side, I could work for providing bindings
between biopython and ETE's tree graphical rendering features, inline
visualization GUI, extended newick support, tree manipulation and the
methods within the ETE package.

I will be out of the office for several weeks, but if you see any way to
collaborate I will be happy to discuss this a bit more in detail...

Cheers!
Jaime
Post by Eric Talevich
Hello, Jaime,
Sorry I didn't respond directly to your earlier post -- I wrote half of an
e-mail, then realized I had no good suggestions on what to do so I scrapped
it.
My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?
2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.
3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.
4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.
5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.
6. The PhylomeDB connector is cool, and browsing the source, looks like it
wouldn't require much effort at all to drop into Biopython.
Thanks for letting us know about this.
Cheers,
Eric
Post by Jaime Huerta Cepas
Hi,
( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.
I develop a lot of code around tree handling, so if you think I can help,
please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it would
be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on the
main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4 and
a
Post by Eric Talevich
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Eric Talevich
2009-09-25 03:54:05 UTC
Permalink
Hello, Jaime,

Sorry I didn't respond directly to your earlier post -- I wrote half of an
e-mail, then realized I had no good suggestions on what to do so I scrapped
it.

My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.

I see these issues with integration:
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?

2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.

3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.

4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.

5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.

6. The PhylomeDB connector is cool, and browsing the source, looks like it
wouldn't require much effort at all to drop into Biopython.

Thanks for letting us know about this.

Cheers,
Eric
Post by Jaime Huerta Cepas
Hi,
( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.
I develop a lot of code around tree handling, so if you think I can help,
please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it would
be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on the
main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4 and
a
Post by Eric Talevich
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Brad Chapman
2009-09-24 12:08:00 UTC
Permalink
Eric and Peter;
Looking forward to seeing the PhyloXML work merged into the main
branch. Eric, thanks for posting the summary of where things are at.
Post by Peter
Post by Eric Talevich
(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).
Agreed that this would be nice to have, but I'm not sure why it's
blocking getting the base TreeIO framework and all of PhyloXML into
the main branch. That's a major step forward from the format
specific phylogenetic code we had before and gets us a portion of
the way there.

Next up should be moving over Bio.Nexus to the new framework and
then conversions, but this is another project. I think we should
take this one step at a time.
Post by Peter
Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and distances
between nodes.
Agreed. As we move Nexus over, we should be sure to keep current
functionality.
Post by Peter
Post by Eric Talevich
(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.
I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.
I'm for keeping it as well, and just having the underlying
implementation of get_leaf_nodes call find with the right arguments.
This seems like an operation that should be dead obvious to do.
Post by Peter
Post by Eric Talevich
(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?
Again I agree with Peter here -- this would be best supported as a
subclass that is database aware with an identical API, similar to
how the Seq objects and BioSQL Seq objects work. This avoids any
overhead for the in-memory case, which will be more common, but
gives you a point to implement the useful database representation
code in the future. If you don't have time to work on all of this
right now, I'd leave the nested-set stuff out and keep it in mind as
a future addition.

Brad
Peter
2009-09-24 17:59:06 UTC
Permalink
Post by Brad Chapman
Eric and Peter;
Looking forward to seeing the PhyloXML work merged into the main
branch. Eric, thanks for posting the summary of where things are at.
Post by Peter
Post by Eric Talevich
(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).
Agreed that this would be nice to have, but I'm not sure why it's
blocking getting the base TreeIO framework and all of PhyloXML into
the main branch. That's a major step forward from the format
specific phylogenetic code we had before and gets us a portion of
the way there.
If the Newick/Nexus TreeIO parsers return one object type while the
PhyloXML TreeIO parser returns another *incompatible* object type,
then we don't have a unified tree input/output framework. Furthermore,
if you did release this and then later standardise on a single tree object,
you'd break backwards compatibility. All in all, best avoided.
Post by Brad Chapman
Next up should be moving over Bio.Nexus to the new framework and
then conversions, but this is another project. I think we should
take this one step at a time.
What we could do in the short term is ignore Bio.Nexus.Trees, and
just leave it as is. Instead of having the Newick/Nexus TreeIO code
calling the old Bio.Nexus.Trees code, we just write some new code
(possibly based on old code) which will use Eric's new objects.

We could then (gradually, perhaps by adding a runtime option to
the Nexus parsing API) move Bio.Nexus over from using the old
Bio.Nexus.Trees code to the new TreeIO, and eventually deprecate
and then remove Bio.Nexus.Trees.

Peter
Peter
2009-09-24 17:59:06 UTC
Permalink
Post by Brad Chapman
Eric and Peter;
Looking forward to seeing the PhyloXML work merged into the main
branch. Eric, thanks for posting the summary of where things are at.
Post by Peter
Post by Eric Talevich
(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).
Agreed that this would be nice to have, but I'm not sure why it's
blocking getting the base TreeIO framework and all of PhyloXML into
the main branch. That's a major step forward from the format
specific phylogenetic code we had before and gets us a portion of
the way there.
If the Newick/Nexus TreeIO parsers return one object type while the
PhyloXML TreeIO parser returns another *incompatible* object type,
then we don't have a unified tree input/output framework. Furthermore,
if you did release this and then later standardise on a single tree object,
you'd break backwards compatibility. All in all, best avoided.
Post by Brad Chapman
Next up should be moving over Bio.Nexus to the new framework and
then conversions, but this is another project. I think we should
take this one step at a time.
What we could do in the short term is ignore Bio.Nexus.Trees, and
just leave it as is. Instead of having the Newick/Nexus TreeIO code
calling the old Bio.Nexus.Trees code, we just write some new code
(possibly based on old code) which will use Eric's new objects.

We could then (gradually, perhaps by adding a runtime option to
the Nexus parsing API) move Bio.Nexus over from using the old
Bio.Nexus.Trees code to the new TreeIO, and eventually deprecate
and then remove Bio.Nexus.Trees.

Peter
Eric Talevich
2009-09-25 04:34:17 UTC
Permalink
Hi Peter,

Thanks for the feedback.
Post by Peter
One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry. Easy.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).
I'll delete that sentence. I don't know why it's there -- you're right, it's
easy to return an iterable regardless of what the format itself supports.

On a more general note, you seem to be recreating the file/handle logic
Post by Peter
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and
Bio.TreeIO.write() functions *only* and have the underlying format specific
code just use handles. This avoids the code duplication.
I did the handle management case-by-case because some of the underlying
libraries already do filename-to-handle conversion -- ElementTree and
Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
ad-hoc handle management, but of course I can move it all to the top if you
think it's best. One day, perhaps we'll have a context manager that we can
reuse everywhere to make magic easy:

with maybe_open(file) as handle:
tree = FooIO.parse(handle)

Not today, though.
Post by Peter
(1) 'phyloxml' uses a different object representation than the other two, so
Post by Eric Talevich
converting between those formats is not possible until Nexus.Trees is
ported
Post by Eric Talevich
over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).
I could comment out the 'nexus' and 'newick' lines from the
supported_formats dict. That would disable the top-level functions but leave
the direct NexusIO and NewickIO equivalents intact until the port is
complete.


Note that Bio.Nexus.Trees still has some useful methods you don't
Post by Peter
appear to support, like finding the last common ancestor and distances
between nodes.
That's intentional, I was just going to port those methods directly from
Bio.Nexus.Trees rather than invent a new API myself.

Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
combined parsers and object representations. My goal is to chop out the
pure-object parts and merge them into Bio.Tree, and let the remaining
parsers return objects built from the new Bio.Tree classes. This looks like
it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
done.

For backward compatibility, I'll leave some wrappers that trigger
DeprecationWarnings in the original places. Nexus.Trees can probably be
reduced to:

import warnings
warnings.warn("Use Bio.Tree and Bio.TreeIO instead", DeprecationWarning)

from Bio.Tree.Newick import *
from Bio.TreeIO.NewickIO import *

(more or less)
Post by Peter
(2) NexusIO.write() just doesn't seem to work. I don't understand how to
Post by Eric Talevich
make the original Nexus module write out trees that it didn't parse
itself.
Post by Eric Talevich
Help?
To get the Newick tree, you can just call str(tree), which is basically what
you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
more complicated. You'll need to create a minimal Nexus file - have a
look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having
a hard coded nexus template, and just insert the tree as a Newick string
(and insert the list of taxa?). Perhaps Frank or Cymon can advise us.
OK, thanks, I'll give it a shot. I see some default Nexus template stuff in
Bio.Nexus.Nexus already.
Post by Peter
Post by Eric Talevich
*Tree
*The BaseTree module is meant to be the basis for Newick trees
eventually,
Post by Eric Talevich
so I'd like to get the design right with the minimum number of public
(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and
filtering
Post by Eric Talevich
necessary for locating data and automatically adding annotations to a
tree.
Post by Eric Talevich
There's a 'terminal' argument for selecting internal nodes, external
nodes,
Post by Eric Talevich
or both, and I think this means get_leaf_nodes() is unnecessary. I'm
going
Post by Eric Talevich
to remove it if no one protests.
I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.
I think .find(terminal=True) will do the right thing and looks reasonably
simple, but as Brad said, this is a ridiculously common operation so finding
it in the API should be ridiculously easy. I'll rename this function to
get_leaves() and rename find() to findall() (to match ElementTree and make
it clear that it returns an iterable).
Post by Peter
Post by Eric Talevich
(3) I left room in each Node for the left and right indexes used by
BioSQL's
Post by Eric Talevich
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of
the
Post by Eric Talevich
nested-set representation, or try to support it fully?
A partial implementation doesn't seem helpful, and wastes memory
allocating unused properties. I would remove it from the base Node,
but a full implementation might be useful for something (would it be
possible via a subclass?).
On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.


Cheers,
Eric
Peter
2009-09-25 09:59:08 UTC
Permalink
Post by Eric Talevich
Post by Peter
On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.
As things stand, there is no usage of the left/right index fields in
Biopython.

The current Biopython BioSQL code focusses on the database
variants of the Seq and SeqRecord objects. The only interaction
with the taxon tables is to load/retrieve the species annotations,
and for this we don't need the complications of the left/right index.
We leave them empty if we populate the taxonomy via Entrez
(recalculating the left/right values is computationally expensive).

However, any "DBTaxonTree" object (or whatever we call it) could
potentially offer us a way to (a) populate and (b) use the these
alternative indexes as a way to speed up various subtree operations.

Peter
Hilmar Lapp
2009-09-25 11:39:03 UTC
Permalink
On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com
Post by Eric Talevich
Post by Peter
On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra
attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.
As things stand, there is no usage of the left/right index fields in
Biopython.
The left/right fields are really a crutch for doing hierarchical
(recursive) queries in SQL more efficiently. SQL doesn't have native
support for recursive queries, and the left/right index values allow
you to rewrite an otherwise recursive query as a single-hit set.

Within an object-oriented programming language that supports recursion
these values are of no use - they don't let you traverse a tree faster
than you would already be able to do through recursing up or down your
tree data structure. If there's a natural order of nodes, you can
speed up finding nodes through binary search. But for pulling out
lineages or subtrees I doubt that this will help at all - it'll have
to be your data structure (such as having double links) that makes
those operations efficient.

-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
Hilmar Lapp
2009-09-25 11:39:03 UTC
Permalink
On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com
Post by Eric Talevich
Post by Peter
On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra
attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.
As things stand, there is no usage of the left/right index fields in
Biopython.
The left/right fields are really a crutch for doing hierarchical
(recursive) queries in SQL more efficiently. SQL doesn't have native
support for recursive queries, and the left/right index values allow
you to rewrite an otherwise recursive query as a single-hit set.

Within an object-oriented programming language that supports recursion
these values are of no use - they don't let you traverse a tree faster
than you would already be able to do through recursing up or down your
tree data structure. If there's a natural order of nodes, you can
speed up finding nodes through binary search. But for pulling out
lineages or subtrees I doubt that this will help at all - it'll have
to be your data structure (such as having double links) that makes
those operations efficient.

-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
Peter
2009-09-25 10:08:56 UTC
Permalink
Post by Eric Talevich
Post by Peter
One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).
I'll delete that sentence. I don't know why it's there -- you're right, it's
easy to return an iterable regardless of what the format itself supports.
OK.
Post by Eric Talevich
Post by Peter
On a more general note, you seem to be recreating the file/handle logic
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read()
and Bio.TreeIO.write() functions *only* and have the underlying format
specific code just use handles. This avoids the code duplication.
I did the handle management case-by-case because some of the underlying
libraries already do filename-to-handle conversion -- ElementTree and
Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
ad-hoc handle management, but of course I can move it all to the top if you
think it's best.
Having a single layer of handle/filename conversion in Bio.TreeIO does
seem cleanest to me (even if some of the back ends allow either) and
will ensure our code is consistent.
Post by Eric Talevich
One day, perhaps we'll have a context manager that we can
? tree = FooIO.parse(handle)
Not today, though.
Not yet, no. For one thing we'll have to phase out Python 2.4 support.
Post by Eric Talevich
Post by Peter
Post by Eric Talevich
(1) 'phyloxml' uses a different object representation than the other two,
so converting between those formats is not possible until Nexus.Trees
is ported over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it
would actually let you do phyloxml -> newick, and phyloxml -> nexus
(and assuming that phyloxml allows very minimal trees, the reverse
as well). It does look like the best plan is to use the same tree objects
for all three (updating Bio.Nexus if possible).
I could comment out the 'nexus' and 'newick' lines from the
supported_formats dict. That would disable the top-level functions
but leave the direct NexusIO and NewickIO equivalents intact until
the port is complete.
I guess shipping a "phyloxml" only Bio.TreeIO would work, but it
would be rather less useful. We could certainly start with just that
on the trunk (i.e. initially no Bio.TreeIO.NewickIO and also no
Bio.TreeIO.NexusIO modules - initially have just a single backend).
Post by Eric Talevich
Post by Peter
Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and
distances between nodes.
That's intentional, I was just going to port those methods directly from
Bio.Nexus.Trees rather than invent a new API myself.
OK - sounds good.
Post by Eric Talevich
Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
combined parsers and object representations. My goal is to chop out the
pure-object parts and merge them into Bio.Tree, and let the remaining
parsers return objects built from the new Bio.Tree classes. This looks like
it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
done.
Sounds good - as with Bio.SeqIO and Bio.AlignIO, one of the goals has
been to separate the data object from the (many possible) parsers.
Post by Eric Talevich
For backward compatibility, I'll leave some wrappers that trigger
DeprecationWarnings in the original places. Nexus.Trees can
probably be reduced ...
Something like that, sure.
Post by Eric Talevich
Post by Peter
Post by Eric Talevich
(1) The find() function, named after the Unix utility that does the
same thing for directory trees, seems capable of all the iteration
and filtering necessary for locating data and automatically adding
annotations to a tree. There's a 'terminal' argument for selecting
internal nodes, external nodes, or both, and I think this means
get_leaf_nodes() is unnecessary. I'm going to remove it if no one
protests.
I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.
I think .find(terminal=True) will do the right thing and looks reasonably
simple, but as Brad said, this is a ridiculously common operation so
finding it in the API should be ridiculously easy. I'll rename this function
to get_leaves() and rename find() to findall() (to match ElementTree
and make it clear that it returns an iterable).
OK.

Peter
Peter
2009-09-25 09:59:08 UTC
Permalink
Post by Eric Talevich
Post by Peter
On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.
As things stand, there is no usage of the left/right index fields in
Biopython.

The current Biopython BioSQL code focusses on the database
variants of the Seq and SeqRecord objects. The only interaction
with the taxon tables is to load/retrieve the species annotations,
and for this we don't need the complications of the left/right index.
We leave them empty if we populate the taxonomy via Entrez
(recalculating the left/right values is computationally expensive).

However, any "DBTaxonTree" object (or whatever we call it) could
potentially offer us a way to (a) populate and (b) use the these
alternative indexes as a way to speed up various subtree operations.

Peter
Peter
2009-09-25 10:08:56 UTC
Permalink
Post by Eric Talevich
Post by Peter
One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).
I'll delete that sentence. I don't know why it's there -- you're right, it's
easy to return an iterable regardless of what the format itself supports.
OK.
Post by Eric Talevich
Post by Peter
On a more general note, you seem to be recreating the file/handle logic
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read()
and Bio.TreeIO.write() functions *only* and have the underlying format
specific code just use handles. This avoids the code duplication.
I did the handle management case-by-case because some of the underlying
libraries already do filename-to-handle conversion -- ElementTree and
Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
ad-hoc handle management, but of course I can move it all to the top if you
think it's best.
Having a single layer of handle/filename conversion in Bio.TreeIO does
seem cleanest to me (even if some of the back ends allow either) and
will ensure our code is consistent.
Post by Eric Talevich
One day, perhaps we'll have a context manager that we can
? tree = FooIO.parse(handle)
Not today, though.
Not yet, no. For one thing we'll have to phase out Python 2.4 support.
Post by Eric Talevich
Post by Peter
Post by Eric Talevich
(1) 'phyloxml' uses a different object representation than the other two,
so converting between those formats is not possible until Nexus.Trees
is ported over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it
would actually let you do phyloxml -> newick, and phyloxml -> nexus
(and assuming that phyloxml allows very minimal trees, the reverse
as well). It does look like the best plan is to use the same tree objects
for all three (updating Bio.Nexus if possible).
I could comment out the 'nexus' and 'newick' lines from the
supported_formats dict. That would disable the top-level functions
but leave the direct NexusIO and NewickIO equivalents intact until
the port is complete.
I guess shipping a "phyloxml" only Bio.TreeIO would work, but it
would be rather less useful. We could certainly start with just that
on the trunk (i.e. initially no Bio.TreeIO.NewickIO and also no
Bio.TreeIO.NexusIO modules - initially have just a single backend).
Post by Eric Talevich
Post by Peter
Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and
distances between nodes.
That's intentional, I was just going to port those methods directly from
Bio.Nexus.Trees rather than invent a new API myself.
OK - sounds good.
Post by Eric Talevich
Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
combined parsers and object representations. My goal is to chop out the
pure-object parts and merge them into Bio.Tree, and let the remaining
parsers return objects built from the new Bio.Tree classes. This looks like
it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
done.
Sounds good - as with Bio.SeqIO and Bio.AlignIO, one of the goals has
been to separate the data object from the (many possible) parsers.
Post by Eric Talevich
For backward compatibility, I'll leave some wrappers that trigger
DeprecationWarnings in the original places. Nexus.Trees can
probably be reduced ...
Something like that, sure.
Post by Eric Talevich
Post by Peter
Post by Eric Talevich
(1) The find() function, named after the Unix utility that does the
same thing for directory trees, seems capable of all the iteration
and filtering necessary for locating data and automatically adding
annotations to a tree. There's a 'terminal' argument for selecting
internal nodes, external nodes, or both, and I think this means
get_leaf_nodes() is unnecessary. I'm going to remove it if no one
protests.
I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.
I think .find(terminal=True) will do the right thing and looks reasonably
simple, but as Brad said, this is a ridiculously common operation so
finding it in the API should be ridiculously easy. I'll rename this function
to get_leaves() and rename find() to findall() (to match ElementTree
and make it clear that it returns an iterable).
OK.

Peter
Jaime Huerta Cepas
2009-09-24 10:45:21 UTC
Permalink
Hi,

( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.

I develop a lot of code around tree handling, so if you think I can help,
please tell me.
jaime.
Post by Eric Talevich
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave
unlabeled
Post by Eric Talevich
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
even
Post by Eric Talevich
usable. Plus, the nodes are now a pretty shade of blue. Still, it would
be
Post by Eric Talevich
nice to have a Reportlab-based module in Bio.Graphics to print
phylogenies
Post by Eric Talevich
in the way biologists are used to seeing them. Does anyone know of
existing
Post by Eric Talevich
code that could be borrowed for this? I looked at ETE (announced on the
main
Post by Eric Talevich
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...
Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================
Brad Chapman
2009-09-24 12:08:00 UTC
Permalink
Eric and Peter;
Looking forward to seeing the PhyloXML work merged into the main
branch. Eric, thanks for posting the summary of where things are at.
Post by Peter
Post by Eric Talevich
(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).
Agreed that this would be nice to have, but I'm not sure why it's
blocking getting the base TreeIO framework and all of PhyloXML into
the main branch. That's a major step forward from the format
specific phylogenetic code we had before and gets us a portion of
the way there.

Next up should be moving over Bio.Nexus to the new framework and
then conversions, but this is another project. I think we should
take this one step at a time.
Post by Peter
Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and distances
between nodes.
Agreed. As we move Nexus over, we should be sure to keep current
functionality.
Post by Peter
Post by Eric Talevich
(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.
I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.
I'm for keeping it as well, and just having the underlying
implementation of get_leaf_nodes call find with the right arguments.
This seems like an operation that should be dead obvious to do.
Post by Peter
Post by Eric Talevich
(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?
Again I agree with Peter here -- this would be best supported as a
subclass that is database aware with an identical API, similar to
how the Seq objects and BioSQL Seq objects work. This avoids any
overhead for the in-memory case, which will be more common, but
gives you a point to implement the useful database representation
code in the future. If you don't have time to work on all of this
right now, I'd leave the nested-set stuff out and keep it in mind as
a future addition.

Brad
Eric Talevich
2009-09-25 04:34:17 UTC
Permalink
Hi Peter,

Thanks for the feedback.
Post by Peter
One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry. Easy.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).
I'll delete that sentence. I don't know why it's there -- you're right, it's
easy to return an iterable regardless of what the format itself supports.

On a more general note, you seem to be recreating the file/handle logic
Post by Peter
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and
Bio.TreeIO.write() functions *only* and have the underlying format specific
code just use handles. This avoids the code duplication.
I did the handle management case-by-case because some of the underlying
libraries already do filename-to-handle conversion -- ElementTree and
Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
ad-hoc handle management, but of course I can move it all to the top if you
think it's best. One day, perhaps we'll have a context manager that we can
reuse everywhere to make magic easy:

with maybe_open(file) as handle:
tree = FooIO.parse(handle)

Not today, though.
Post by Peter
(1) 'phyloxml' uses a different object representation than the other two, so
Post by Eric Talevich
converting between those formats is not possible until Nexus.Trees is
ported
Post by Eric Talevich
over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).
I could comment out the 'nexus' and 'newick' lines from the
supported_formats dict. That would disable the top-level functions but leave
the direct NexusIO and NewickIO equivalents intact until the port is
complete.


Note that Bio.Nexus.Trees still has some useful methods you don't
Post by Peter
appear to support, like finding the last common ancestor and distances
between nodes.
That's intentional, I was just going to port those methods directly from
Bio.Nexus.Trees rather than invent a new API myself.

Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
combined parsers and object representations. My goal is to chop out the
pure-object parts and merge them into Bio.Tree, and let the remaining
parsers return objects built from the new Bio.Tree classes. This looks like
it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
done.

For backward compatibility, I'll leave some wrappers that trigger
DeprecationWarnings in the original places. Nexus.Trees can probably be
reduced to:

import warnings
warnings.warn("Use Bio.Tree and Bio.TreeIO instead", DeprecationWarning)

from Bio.Tree.Newick import *
from Bio.TreeIO.NewickIO import *

(more or less)
Post by Peter
(2) NexusIO.write() just doesn't seem to work. I don't understand how to
Post by Eric Talevich
make the original Nexus module write out trees that it didn't parse
itself.
Post by Eric Talevich
Help?
To get the Newick tree, you can just call str(tree), which is basically what
you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
more complicated. You'll need to create a minimal Nexus file - have a
look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having
a hard coded nexus template, and just insert the tree as a Newick string
(and insert the list of taxa?). Perhaps Frank or Cymon can advise us.
OK, thanks, I'll give it a shot. I see some default Nexus template stuff in
Bio.Nexus.Nexus already.
Post by Peter
Post by Eric Talevich
*Tree
*The BaseTree module is meant to be the basis for Newick trees
eventually,
Post by Eric Talevich
so I'd like to get the design right with the minimum number of public
(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and
filtering
Post by Eric Talevich
necessary for locating data and automatically adding annotations to a
tree.
Post by Eric Talevich
There's a 'terminal' argument for selecting internal nodes, external
nodes,
Post by Eric Talevich
or both, and I think this means get_leaf_nodes() is unnecessary. I'm
going
Post by Eric Talevich
to remove it if no one protests.
I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.
I think .find(terminal=True) will do the right thing and looks reasonably
simple, but as Brad said, this is a ridiculously common operation so finding
it in the API should be ridiculously easy. I'll rename this function to
get_leaves() and rename find() to findall() (to match ElementTree and make
it clear that it returns an iterable).
Post by Peter
Post by Eric Talevich
(3) I left room in each Node for the left and right indexes used by
BioSQL's
Post by Eric Talevich
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of
the
Post by Eric Talevich
nested-set representation, or try to support it fully?
A partial implementation doesn't seem helpful, and wastes memory
allocating unused properties. I would remove it from the base Node,
but a full implementation might be useful for something (would it be
possible via a subclass?).
On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.


Cheers,
Eric
Eric Talevich
2009-09-24 03:48:49 UTC
Permalink
Folks,

I've fixed a couple of remaining issues in the Bio.Tree and Bio.TreeIO
modules and I'd like your opinion on what else should be done before merging
this into the mainline.

First, the wiki documentation for PhyloXML has an example pipeline showing
how to build a phylogeny in Biopython, from a raw protein sequence to a
lightly annotated phyloXML file.
http://biopython.org/wiki/PhyloXML#Example_pipeline

Does this look like right? I copied the first few steps from the official
docs.

The source code, for your review, is here:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py

Discussion:

*TreeIO*
The read, parse, write and convert functions work essentially the same as in
SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues:

(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.

(2) NexusIO.write() just doesn't seem to work. I don't understand how to
make the original Nexus module write out trees that it didn't parse itself.
Help?

*Tree
*The BaseTree module is meant to be the basis for Newick trees eventually,
so I'd like to get the design right with the minimum number of public
methods:

(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.

(2) Should find() be based on depth_first_search or breadth_first_search
(not checked in yet)? DFS would potentially find a leaf node faster, but BFS
seems more common in phylogenetics. Note that iteration can easily be
reversed with the standard reversed() function, so we don't need extra
functions for those cases.

(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?

(4) There's some mention in the literature of a relationship-matrix
representation for phylogenies. Does anyone here know how to work with this
representation, or know if it would let us perform complex calculations with
blinding speed behind the scenes? If so, should there be a function in
Bio.Tree.Utils to export a tree to a NumPy array represented this way? If
not, I'll forget about it.

*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
in the way biologists are used to seeing them. Does anyone know of existing
code that could be borrowed for this? I looked at ETE (announced on the main
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.

Best regards,
Eric
Peter
2009-09-24 09:57:12 UTC
Permalink
Post by Eric Talevich
*TreeIO*
The read, parse, write and convert functions work essentially the same as in
Great.

One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry. Easy.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).

On a more general note, you seem to be recreating the file/handle logic
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and
Bio.TreeIO.write() functions *only* and have the underlying format specific
code just use handles. This avoids the code duplication.

[In fact, as I have said before, I prefer the simplicity of just allowing
handles - and we should make TreeIO and SeqIO/AlignIO consistent]
Post by Eric Talevich
(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.
I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).

Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and distances
between nodes.
Post by Eric Talevich
(2) NexusIO.write() just doesn't seem to work. I don't understand how to
make the original Nexus module write out trees that it didn't parse itself.
Help?
To get the Newick tree, you can just call str(tree), which is basically what
you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
more complicated. You'll need to create a minimal Nexus file - have a
look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having
a hard coded nexus template, and just insert the tree as a Newick string
(and insert the list of taxa?). Perhaps Frank or Cymon can advise us.
Post by Eric Talevich
*Tree
*The BaseTree module is meant to be the basis for Newick trees eventually,
so I'd like to get the design right with the minimum number of public
(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.
I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.
Post by Eric Talevich
(2) Should find() be based on depth_first_search or breadth_first_search
(not checked in yet)? DFS would potentially find a leaf node faster, but BFS
seems more common in phylogenetics. Note that iteration can easily be
reversed with the standard reversed() function, so we don't need extra
functions for those cases.
You could do both, either via an argument or having two methods, say
depth_fist_search and breadth_first_search instead of find.
Post by Eric Talevich
(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?
A partial implementation doesn't seem helpful, and wastes memory
allocating unused properties. I would remove it from the base Node,
but a full implementation might be useful for something (would it be
possible via a subclass?).

On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.
Post by Eric Talevich
(4) There's some mention in the literature of a relationship-matrix
representation for phylogenies. Does anyone here know how to work with this
representation, or know if it would let us perform complex calculations with
blinding speed behind the scenes? If so, should there be a function in
Bio.Tree.Utils to export a tree to a NumPy array represented this way? ?If
not, I'll forget about it.
I don't know.
Post by Eric Talevich
*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
in the way biologists are used to seeing them. Does anyone know of existing
code that could be borrowed for this? I looked at ETE (announced on the main
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.
I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...

Peter
Eric Talevich
2009-12-29 01:51:40 UTC
Permalink
Hi folks,

Here's an update on the status of Bio.Tree and TreeIO. I think I've taken
care of most of the blockers since the last review in September.

First, some links:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py
http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py
http://biopython.org/wiki/PhyloXML

Discussion:

*TreeIO*
Conversion between Nexus, Newick and phyloXML tree file formats works; the
read/parse/write functions for each IO format use the same object types.
Neat!

The tree annotations (e.g. id) aren't preserved perfectly during conversions
-- I'll keep working on this, but I don't think it's a blocker. The taxon
names of terminal nodes are kept as "clade" names in phyloXML for
round-tripping. Tree topology and branch lengths seem OK.

Under the hood:
-- PhyloXMLIO is from GSoC
-- NewickIO is ported from the Bio.Nexus.Trees parser. I think it works the
same way.
-- NexusIO relies on Bio.Nexus.Nexus for parsing, then converts the
resulting Nexus.Trees.Tree objects to Bio.Tree.Newick objects. One day, when
Nexus.Trees is replaced by NewickIO in the main Nexus parser, then this
conversion can be dropped and NexusIO will be very simple.

*Tree*
The BaseTree object structure looks like this:*

-- BaseTree.**Tree* contains global tree information, like whether the tree
is rooted, and a reference to the root clade. The phyloXML Phylogeny object
inherits from this.*

-- BaseTree.**Subtree* contains local (clade- or node-specific) information,
and references to each of its direct descendents, recursively. The phyloXML
Clade object inherits from this. Nodes are implicit. I could add references
to the ancestor of each sub-tree without too much difficulty, but I haven't
needed them yet.

The same methods (get_terminals et al.) generally apply to both classes, so
I created a separate TreeMixin class from which both BaseTree.Tree and
BaseTree.Subtree inherit.

Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
as I imagine it:
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
reasonable (since the node IDs and adjacency list lookup are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original argument lists,
but triggering a deprecation warning indicating the newer replacement method
(3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.

I'm currently doing (1) and (2), with more emphasis on getting (1) right.
Not all of the important methods have been ported, but I'm happy with the
tree traversal methods.
*
Tests
*I created test_Tree.py to test the methods in Bio.Tree.BaseTree;
test_PhyloXML.py tests Bio.Tree.PhyloXML objects and Bio.TreeIO.PhyloXMLIO
parsing/writing.

I noticed that in Tests/Nexus/, the example file for internal node labels is
actually in Newick/NH format, not Nexus. That was briefly confusing, so
maybe that file should be renamed.

What do you think?

All the best,
Eric
Brad Chapman
2010-01-04 13:16:31 UTC
Permalink
Hey Eric;
Happy New Year -- thanks for all the work on TreeIO. This sounds
great and looking forward to getting it in the main trunk. I'd like
to hear Peter's and other's thoughts, but just a few small comments
below.
Post by Eric Talevich
The tree annotations (e.g. id) aren't preserved perfectly during conversions
-- I'll keep working on this, but I don't think it's a blocker. The taxon
names of terminal nodes are kept as "clade" names in phyloXML for
round-tripping. Tree topology and branch lengths seem OK.
Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.
Post by Eric Talevich
Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
reasonable (since the node IDs and adjacency list lookup are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original argument lists,
but triggering a deprecation warning indicating the newer replacement method
(3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
I'm currently doing (1) and (2), with more emphasis on getting (1) right.
Not all of the important methods have been ported, but I'm happy with the
tree traversal methods.
Nice. This all sounds like a really good refactoring. It sounds like 1
can happen once this all gets merged with the main branch, and
could benefit from others being able to more easily look at it and
make suggestions.
Post by Eric Talevich
I noticed that in Tests/Nexus/, the example file for internal node labels is
actually in Newick/NH format, not Nexus. That was briefly confusing, so
maybe that file should be renamed.
Oops, I think that may have been me. No problem, rename away.

Brad
Eric Talevich
2010-01-05 00:09:18 UTC
Permalink
Hi Brad, I hope the holidays treated you well.
Post by Brad Chapman
Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.
The data that TreeIO preserves round-trip are:

- Branching structure (topology)
- Branch lengths
- Clade/taxon names
- Rooted-ness (for the whole tree)
- Tree ID

The troublesome parts are:

- The "confidences" attribute in PhyloXML trees should map onto the
"support" attribute in Nexus trees, but that's tricky -- the original Nexus
attribute seemed content with a little ambiguity in what that attribute's
numerical value actually meant (relative/absolute support), while PhyloXML
uses a list of Confidence objects containing both a numerical value and a
"type" string such as "bootstrap". Currently that information is dropped
when converting between PhyloXML and Nexus/Newick trees.
- Nexus also has a "comment" attribute for each node, while PhyloXML
doesn't directly support that.
- The branch length of the root node/clade is None in PhyloXML, but 0.0 in
Nexus. I prefer None because there is no meaningful branch leading to that
node, but there might be a reason 0.0 was chosen for Nexus that I'm not
aware of.
- The names of unlabeled internal nodes might change from None to "" in
some cases, since None is the PhyloXML default and "" is the Nexus default.
- Since PhyloXML supports more structured taxonomic information on each
node than Newick, it's possible to have a PhyloXML tree where a Clade has no
name, but instead one or more Taxonomy objects containing the scientific
name, common names, etc. -- so when this tree is converted to Newick format
the taxonomy info is lost for those nodes. I could squash the Taxonomy
object into a string for the sake of Nexus labels, but I think it would be
safer (less surprising) to just write a cookbook entry on how to collapse
PhyloXML Taxonomies into Clade names to aid format conversions.

If the support-vs-confidence issue can be resolved, then we can treat
PhyloXML as a rough superset of Newick, in terms of annotation, and then it
shouldn't be surprising to lose some annotation data in converting PhyloXML
to Newick.

Cheers,
Eric
Eric Talevich
2010-01-05 00:09:18 UTC
Permalink
Hi Brad, I hope the holidays treated you well.
Post by Brad Chapman
Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.
The data that TreeIO preserves round-trip are:

- Branching structure (topology)
- Branch lengths
- Clade/taxon names
- Rooted-ness (for the whole tree)
- Tree ID

The troublesome parts are:

- The "confidences" attribute in PhyloXML trees should map onto the
"support" attribute in Nexus trees, but that's tricky -- the original Nexus
attribute seemed content with a little ambiguity in what that attribute's
numerical value actually meant (relative/absolute support), while PhyloXML
uses a list of Confidence objects containing both a numerical value and a
"type" string such as "bootstrap". Currently that information is dropped
when converting between PhyloXML and Nexus/Newick trees.
- Nexus also has a "comment" attribute for each node, while PhyloXML
doesn't directly support that.
- The branch length of the root node/clade is None in PhyloXML, but 0.0 in
Nexus. I prefer None because there is no meaningful branch leading to that
node, but there might be a reason 0.0 was chosen for Nexus that I'm not
aware of.
- The names of unlabeled internal nodes might change from None to "" in
some cases, since None is the PhyloXML default and "" is the Nexus default.
- Since PhyloXML supports more structured taxonomic information on each
node than Newick, it's possible to have a PhyloXML tree where a Clade has no
name, but instead one or more Taxonomy objects containing the scientific
name, common names, etc. -- so when this tree is converted to Newick format
the taxonomy info is lost for those nodes. I could squash the Taxonomy
object into a string for the sake of Nexus labels, but I think it would be
safer (less surprising) to just write a cookbook entry on how to collapse
PhyloXML Taxonomies into Clade names to aid format conversions.

If the support-vs-confidence issue can be resolved, then we can treat
PhyloXML as a rough superset of Newick, in terms of annotation, and then it
shouldn't be surprising to lose some annotation data in converting PhyloXML
to Newick.

Cheers,
Eric
Eric Talevich
2010-01-05 00:09:18 UTC
Permalink
Hi Brad, I hope the holidays treated you well.
Post by Brad Chapman
Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.
The data that TreeIO preserves round-trip are:

- Branching structure (topology)
- Branch lengths
- Clade/taxon names
- Rooted-ness (for the whole tree)
- Tree ID

The troublesome parts are:

- The "confidences" attribute in PhyloXML trees should map onto the
"support" attribute in Nexus trees, but that's tricky -- the original Nexus
attribute seemed content with a little ambiguity in what that attribute's
numerical value actually meant (relative/absolute support), while PhyloXML
uses a list of Confidence objects containing both a numerical value and a
"type" string such as "bootstrap". Currently that information is dropped
when converting between PhyloXML and Nexus/Newick trees.
- Nexus also has a "comment" attribute for each node, while PhyloXML
doesn't directly support that.
- The branch length of the root node/clade is None in PhyloXML, but 0.0 in
Nexus. I prefer None because there is no meaningful branch leading to that
node, but there might be a reason 0.0 was chosen for Nexus that I'm not
aware of.
- The names of unlabeled internal nodes might change from None to "" in
some cases, since None is the PhyloXML default and "" is the Nexus default.
- Since PhyloXML supports more structured taxonomic information on each
node than Newick, it's possible to have a PhyloXML tree where a Clade has no
name, but instead one or more Taxonomy objects containing the scientific
name, common names, etc. -- so when this tree is converted to Newick format
the taxonomy info is lost for those nodes. I could squash the Taxonomy
object into a string for the sake of Nexus labels, but I think it would be
safer (less surprising) to just write a cookbook entry on how to collapse
PhyloXML Taxonomies into Clade names to aid format conversions.

If the support-vs-confidence issue can be resolved, then we can treat
PhyloXML as a rough superset of Newick, in terms of annotation, and then it
shouldn't be surprising to lose some annotation data in converting PhyloXML
to Newick.

Cheers,
Eric
Eric Talevich
2010-01-05 00:09:18 UTC
Permalink
Hi Brad, I hope the holidays treated you well.
Post by Brad Chapman
Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.
The data that TreeIO preserves round-trip are:

- Branching structure (topology)
- Branch lengths
- Clade/taxon names
- Rooted-ness (for the whole tree)
- Tree ID

The troublesome parts are:

- The "confidences" attribute in PhyloXML trees should map onto the
"support" attribute in Nexus trees, but that's tricky -- the original Nexus
attribute seemed content with a little ambiguity in what that attribute's
numerical value actually meant (relative/absolute support), while PhyloXML
uses a list of Confidence objects containing both a numerical value and a
"type" string such as "bootstrap". Currently that information is dropped
when converting between PhyloXML and Nexus/Newick trees.
- Nexus also has a "comment" attribute for each node, while PhyloXML
doesn't directly support that.
- The branch length of the root node/clade is None in PhyloXML, but 0.0 in
Nexus. I prefer None because there is no meaningful branch leading to that
node, but there might be a reason 0.0 was chosen for Nexus that I'm not
aware of.
- The names of unlabeled internal nodes might change from None to "" in
some cases, since None is the PhyloXML default and "" is the Nexus default.
- Since PhyloXML supports more structured taxonomic information on each
node than Newick, it's possible to have a PhyloXML tree where a Clade has no
name, but instead one or more Taxonomy objects containing the scientific
name, common names, etc. -- so when this tree is converted to Newick format
the taxonomy info is lost for those nodes. I could squash the Taxonomy
object into a string for the sake of Nexus labels, but I think it would be
safer (less surprising) to just write a cookbook entry on how to collapse
PhyloXML Taxonomies into Clade names to aid format conversions.

If the support-vs-confidence issue can be resolved, then we can treat
PhyloXML as a rough superset of Newick, in terms of annotation, and then it
shouldn't be surprising to lose some annotation data in converting PhyloXML
to Newick.

Cheers,
Eric
Brad Chapman
2010-01-04 13:16:31 UTC
Permalink
Hey Eric;
Happy New Year -- thanks for all the work on TreeIO. This sounds
great and looking forward to getting it in the main trunk. I'd like
to hear Peter's and other's thoughts, but just a few small comments
below.
Post by Eric Talevich
The tree annotations (e.g. id) aren't preserved perfectly during conversions
-- I'll keep working on this, but I don't think it's a blocker. The taxon
names of terminal nodes are kept as "clade" names in phyloXML for
round-tripping. Tree topology and branch lengths seem OK.
Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.
Post by Eric Talevich
Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
reasonable (since the node IDs and adjacency list lookup are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original argument lists,
but triggering a deprecation warning indicating the newer replacement method
(3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
I'm currently doing (1) and (2), with more emphasis on getting (1) right.
Not all of the important methods have been ported, but I'm happy with the
tree traversal methods.
Nice. This all sounds like a really good refactoring. It sounds like 1
can happen once this all gets merged with the main branch, and
could benefit from others being able to more easily look at it and
make suggestions.
Post by Eric Talevich
I noticed that in Tests/Nexus/, the example file for internal node labels is
actually in Newick/NH format, not Nexus. That was briefly confusing, so
maybe that file should be renamed.
Oops, I think that may have been me. No problem, rename away.

Brad
Brad Chapman
2010-01-04 13:16:31 UTC
Permalink
Hey Eric;
Happy New Year -- thanks for all the work on TreeIO. This sounds
great and looking forward to getting it in the main trunk. I'd like
to hear Peter's and other's thoughts, but just a few small comments
below.
Post by Eric Talevich
The tree annotations (e.g. id) aren't preserved perfectly during conversions
-- I'll keep working on this, but I don't think it's a blocker. The taxon
names of terminal nodes are kept as "clade" names in phyloXML for
round-tripping. Tree topology and branch lengths seem OK.
Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.
Post by Eric Talevich
Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
reasonable (since the node IDs and adjacency list lookup are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original argument lists,
but triggering a deprecation warning indicating the newer replacement method
(3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
I'm currently doing (1) and (2), with more emphasis on getting (1) right.
Not all of the important methods have been ported, but I'm happy with the
tree traversal methods.
Nice. This all sounds like a really good refactoring. It sounds like 1
can happen once this all gets merged with the main branch, and
could benefit from others being able to more easily look at it and
make suggestions.
Post by Eric Talevich
I noticed that in Tests/Nexus/, the example file for internal node labels is
actually in Newick/NH format, not Nexus. That was briefly confusing, so
maybe that file should be renamed.
Oops, I think that may have been me. No problem, rename away.

Brad
Brad Chapman
2010-01-04 13:16:31 UTC
Permalink
Hey Eric;
Happy New Year -- thanks for all the work on TreeIO. This sounds
great and looking forward to getting it in the main trunk. I'd like
to hear Peter's and other's thoughts, but just a few small comments
below.
Post by Eric Talevich
The tree annotations (e.g. id) aren't preserved perfectly during conversions
-- I'll keep working on this, but I don't think it's a blocker. The taxon
names of terminal nodes are kept as "clade" names in phyloXML for
round-tripping. Tree topology and branch lengths seem OK.
Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.
Post by Eric Talevich
Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
reasonable (since the node IDs and adjacency list lookup are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original argument lists,
but triggering a deprecation warning indicating the newer replacement method
(3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
I'm currently doing (1) and (2), with more emphasis on getting (1) right.
Not all of the important methods have been ported, but I'm happy with the
tree traversal methods.
Nice. This all sounds like a really good refactoring. It sounds like 1
can happen once this all gets merged with the main branch, and
could benefit from others being able to more easily look at it and
make suggestions.
Post by Eric Talevich
I noticed that in Tests/Nexus/, the example file for internal node labels is
actually in Newick/NH format, not Nexus. That was briefly confusing, so
maybe that file should be renamed.
Oops, I think that may have been me. No problem, rename away.

Brad
Eric Talevich
2009-12-29 01:51:40 UTC
Permalink
Hi folks,

Here's an update on the status of Bio.Tree and TreeIO. I think I've taken
care of most of the blockers since the last review in September.

First, some links:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py
http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py
http://biopython.org/wiki/PhyloXML

Discussion:

*TreeIO*
Conversion between Nexus, Newick and phyloXML tree file formats works; the
read/parse/write functions for each IO format use the same object types.
Neat!

The tree annotations (e.g. id) aren't preserved perfectly during conversions
-- I'll keep working on this, but I don't think it's a blocker. The taxon
names of terminal nodes are kept as "clade" names in phyloXML for
round-tripping. Tree topology and branch lengths seem OK.

Under the hood:
-- PhyloXMLIO is from GSoC
-- NewickIO is ported from the Bio.Nexus.Trees parser. I think it works the
same way.
-- NexusIO relies on Bio.Nexus.Nexus for parsing, then converts the
resulting Nexus.Trees.Tree objects to Bio.Tree.Newick objects. One day, when
Nexus.Trees is replaced by NewickIO in the main Nexus parser, then this
conversion can be dropped and NexusIO will be very simple.

*Tree*
The BaseTree object structure looks like this:*

-- BaseTree.**Tree* contains global tree information, like whether the tree
is rooted, and a reference to the root clade. The phyloXML Phylogeny object
inherits from this.*

-- BaseTree.**Subtree* contains local (clade- or node-specific) information,
and references to each of its direct descendents, recursively. The phyloXML
Clade object inherits from this. Nodes are implicit. I could add references
to the ancestor of each sub-tree without too much difficulty, but I haven't
needed them yet.

The same methods (get_terminals et al.) generally apply to both classes, so
I created a separate TreeMixin class from which both BaseTree.Tree and
BaseTree.Subtree inherit.

Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
as I imagine it:
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
reasonable (since the node IDs and adjacency list lookup are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original argument lists,
but triggering a deprecation warning indicating the newer replacement method
(3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.

I'm currently doing (1) and (2), with more emphasis on getting (1) right.
Not all of the important methods have been ported, but I'm happy with the
tree traversal methods.
*
Tests
*I created test_Tree.py to test the methods in Bio.Tree.BaseTree;
test_PhyloXML.py tests Bio.Tree.PhyloXML objects and Bio.TreeIO.PhyloXMLIO
parsing/writing.

I noticed that in Tests/Nexus/, the example file for internal node labels is
actually in Newick/NH format, not Nexus. That was briefly confusing, so
maybe that file should be renamed.

What do you think?

All the best,
Eric
Michiel de Hoon
2010-01-08 16:26:29 UTC
Permalink
I am not an expert in this area, but the code looks very well done and well organized. Thanks, Eric!

I have one suggestion though:
In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather have everything under Bio.Tree. This makes it easier to understand what each Bio.* module is about, and also agrees with the structure of the other modules in Biopython. The only exception is Bio.Seq, for which there is a closely related Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; I'd rather have a single Bio.Seq there too).

Thanks again,

--Michiel.
From: Eric Talevich <eric.talevich at gmail.com>
Subject: Re: [Biopython-dev] Code review request for phyloxml branch
To: "BioPython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Monday, December 28, 2009, 8:51 PM
Hi folks,
Here's an update on the status of Bio.Tree and TreeIO. I
think I've taken
care of most of the blockers since the last review in
September.
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py
http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py
http://biopython.org/wiki/PhyloXML
*TreeIO*
Conversion between Nexus, Newick and phyloXML tree file
formats works; the
read/parse/write functions for each IO format use the same
object types.
Neat!
The tree annotations (e.g. id) aren't preserved perfectly
during conversions
-- I'll keep working on this, but I don't think it's a
blocker. The taxon
names of terminal nodes are kept as "clade" names in
phyloXML for
round-tripping. Tree topology and branch lengths seem OK.
-- PhyloXMLIO is from GSoC
-- NewickIO is ported from the Bio.Nexus.Trees parser. I
think it works the
same way.
-- NexusIO relies on Bio.Nexus.Nexus for parsing, then
converts the
resulting Nexus.Trees.Tree objects to Bio.Tree.Newick
objects. One day, when
Nexus.Trees is replaced by NewickIO in the main Nexus
parser, then this
conversion can be dropped and NexusIO will be very simple.
*Tree*
The BaseTree object structure looks like this:*
-- BaseTree.**Tree* contains global tree information, like
whether the tree
is rooted, and a reference to the root clade. The phyloXML
Phylogeny object
inherits from this.*
-- BaseTree.**Subtree* contains local (clade- or
node-specific) information,
and references to each of its direct descendents,
recursively. The phyloXML
Clade object inherits from this. Nodes are implicit. I
could add references
to the ancestor of each sub-tree without too much
difficulty, but I haven't
needed them yet.
The same methods (get_terminals et al.) generally apply to
both classes, so
I created a separate TreeMixin class from which both
BaseTree.Tree and
BaseTree.Subtree inherit.
Bio.Tree.Newick contains simple subclasses of Tree and
Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree
(minus the I/O).
This is to ease the deprecation and eventual replacement of
Bio.Nexus.Trees,
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying
arguments where
reasonable (since the node IDs and adjacency list lookup
are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original
argument lists,
but triggering a deprecation warning indicating the newer
replacement method
(3) Replace Nexus.Trees with an import of
Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py
should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with
proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in
Bio.Tree.Newick.
I'm currently doing (1) and (2), with more emphasis on
getting (1) right.
Not all of the important methods have been ported, but I'm
happy with the
tree traversal methods.
*
Tests
*I created test_Tree.py to test the methods in
Bio.Tree.BaseTree;
test_PhyloXML.py tests Bio.Tree.PhyloXML objects and
Bio.TreeIO.PhyloXMLIO
parsing/writing.
I noticed that in Tests/Nexus/, the example file for
internal node labels is
actually in Newick/NH format, not Nexus. That was briefly
confusing, so
maybe that file should be renamed.
What do you think?
All the best,
Eric
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Peter Cock
2010-01-08 17:00:12 UTC
Permalink
Post by Michiel de Hoon
I am not an expert in this area, but the code looks very well done and well
organized. Thanks, Eric!
In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather
have everything under Bio.Tree. This makes it easier to understand what each
Bio.* module is about, and also agrees with the structure of the other modules
in Biopython. The only exception is Bio.Seq, for which there is a closely related
Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons;
I'd rather have a single Bio.Seq there too).
There is also Bio.AlignIO, which again might have been handled via Bio.Align
with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was
following the lead from BioPerl. I think there are some good points about making
the code for the common object (tree, SeqRecord, Alignment) clearly separate
from the code for parsing or writing it (although separate top level modules is
perhaps overkill). However, I agree, this isn't universal in Biopython (e.g.
Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO).

So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing
I don't like is that "Tree" could mean a class or a module (also a problem with
other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python
convention (PEP8) is to use lower case for the module ("tree") and title case
for the class ("Tree"), something most of Biopython does not follow (and
which we can't change without a lot of upheaval). Another option if we want
to try and keep the existing module name style might be Bio.Trees containing
a Tree class, or perhaps something different like Bio.Phylo instead?

Peter
Eric Talevich
2010-01-08 18:22:11 UTC
Permalink
Post by Peter Cock
Post by Michiel de Hoon
I am not an expert in this area, but the code looks very well done and well
organized. Thanks, Eric!
In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather
have everything under Bio.Tree. This makes it easier to understand what each
Bio.* module is about, and also agrees with the structure of the other modules
in Biopython. The only exception is Bio.Seq, for which there is a closely related
Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons;
I'd rather have a single Bio.Seq there too).
There is also Bio.AlignIO, which again might have been handled via Bio.Align
with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was
following the lead from BioPerl.
Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava do
something completely different.

I had the impression that pairing modules Foo & FooIO was an emerging
convention for organizing very general data types being fed by a
variety of file formats, while a single module Foo indicated support
for a particular program or source, like Entrez. But I think it would
be even cleaner if each Foo simply had a Foo.IO (or foo.io) sub-module
organizing the I/O for multiple file formats where applicable.

The TreeIO.* namespace is not crowded -- just read, write, parse,
convert. If that directory is moved under Bio.Tree and renamed to IO
or io, then Bio.Tree would still seem reasonably intuitive if
__init__.py contained:

from io import *
from utils import *

Then "from Bio import Tree" would be enough for most uses.
Post by Peter Cock
I think there are some good points about making
the code for the common object (tree, SeqRecord, Alignment) clearly separate
from the code for parsing or writing it (although separate top level modules is
perhaps overkill). However, I agree, this isn't universal in Biopython (e.g.
Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO).
PDB does its own thing, too -- and some consolidation there might be nice.
Post by Peter Cock
So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing
I don't like is that "Tree" could mean a class or a module (also a problem with
other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python
convention (PEP8) is to use lower case for the module ("tree") and title case
for the class ("Tree"), something most of Biopython does not follow (and
which we can't change without a lot of upheaval).
I could rename the modules inside Bio.Tree (or whatever we call it) to
follow the PEP8 convention:

Bio/Tree/
Bio/Tree/basetree.py
Bio/Tree/io.py
Bio/Tree/utils.py ...

The Biopython convention seems to be that directory names are title
case, file names are mostly title case if user-facing and lower case
otherwise, and C extensions are lower case. Most of the time there
won't be any need to import the sub-modules under Tree directly, so
the inconsistency shouldn't be too jarring.
Post by Peter Cock
perhaps something different like Bio.Phylo instead?
Sure, that sounds promising.


Thanks!
Eric
Eric Talevich
2010-01-08 18:22:11 UTC
Permalink
Post by Peter Cock
Post by Michiel de Hoon
I am not an expert in this area, but the code looks very well done and well
organized. Thanks, Eric!
In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather
have everything under Bio.Tree. This makes it easier to understand what each
Bio.* module is about, and also agrees with the structure of the other modules
in Biopython. The only exception is Bio.Seq, for which there is a closely related
Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons;
I'd rather have a single Bio.Seq there too).
There is also Bio.AlignIO, which again might have been handled via Bio.Align
with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was
following the lead from BioPerl.
Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava do
something completely different.

I had the impression that pairing modules Foo & FooIO was an emerging
convention for organizing very general data types being fed by a
variety of file formats, while a single module Foo indicated support
for a particular program or source, like Entrez. But I think it would
be even cleaner if each Foo simply had a Foo.IO (or foo.io) sub-module
organizing the I/O for multiple file formats where applicable.

The TreeIO.* namespace is not crowded -- just read, write, parse,
convert. If that directory is moved under Bio.Tree and renamed to IO
or io, then Bio.Tree would still seem reasonably intuitive if
__init__.py contained:

from io import *
from utils import *

Then "from Bio import Tree" would be enough for most uses.
Post by Peter Cock
I think there are some good points about making
the code for the common object (tree, SeqRecord, Alignment) clearly separate
from the code for parsing or writing it (although separate top level modules is
perhaps overkill). However, I agree, this isn't universal in Biopython (e.g.
Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO).
PDB does its own thing, too -- and some consolidation there might be nice.
Post by Peter Cock
So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing
I don't like is that "Tree" could mean a class or a module (also a problem with
other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python
convention (PEP8) is to use lower case for the module ("tree") and title case
for the class ("Tree"), something most of Biopython does not follow (and
which we can't change without a lot of upheaval).
I could rename the modules inside Bio.Tree (or whatever we call it) to
follow the PEP8 convention:

Bio/Tree/
Bio/Tree/basetree.py
Bio/Tree/io.py
Bio/Tree/utils.py ...

The Biopython convention seems to be that directory names are title
case, file names are mostly title case if user-facing and lower case
otherwise, and C extensions are lower case. Most of the time there
won't be any need to import the sub-modules under Tree directly, so
the inconsistency shouldn't be too jarring.
Post by Peter Cock
perhaps something different like Bio.Phylo instead?
Sure, that sounds promising.


Thanks!
Eric
Peter Cock
2010-01-08 17:00:12 UTC
Permalink
Post by Michiel de Hoon
I am not an expert in this area, but the code looks very well done and well
organized. Thanks, Eric!
In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather
have everything under Bio.Tree. This makes it easier to understand what each
Bio.* module is about, and also agrees with the structure of the other modules
in Biopython. The only exception is Bio.Seq, for which there is a closely related
Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons;
I'd rather have a single Bio.Seq there too).
There is also Bio.AlignIO, which again might have been handled via Bio.Align
with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was
following the lead from BioPerl. I think there are some good points about making
the code for the common object (tree, SeqRecord, Alignment) clearly separate
from the code for parsing or writing it (although separate top level modules is
perhaps overkill). However, I agree, this isn't universal in Biopython (e.g.
Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO).

So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing
I don't like is that "Tree" could mean a class or a module (also a problem with
other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python
convention (PEP8) is to use lower case for the module ("tree") and title case
for the class ("Tree"), something most of Biopython does not follow (and
which we can't change without a lot of upheaval). Another option if we want
to try and keep the existing module name style might be Bio.Trees containing
a Tree class, or perhaps something different like Bio.Phylo instead?

Peter
Michiel de Hoon
2010-01-09 15:15:56 UTC
Permalink
Post by Eric Talevich
Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava
do something completely different.
I had the impression that pairing modules Foo & FooIO
was an emerging convention for organizing very general
data types being fed by a variety of file formats, while
a single module Foo indicated support
for a particular program or source, like Entrez.
I think a workable convention, which is already followed by many Biopython module, is the following:

1) Bio.SomeStuff is a module containing everything related to SomeStuff, where SomeStuff is some broadly-defined field within bioinformatics (Cluster for clustering algorithms, Phylo for phylogenetics, PopGen for population genetics, Entrez for NCBI Entrez related stuff, etc.).

2) Parsing SomeStuff files, which can be in a variety of formats, is done by a read() function (to parse a single record), and/or a parse() function (to parse multiple records). The implementation details of these functions is hidden in a submodule of Bio.SomeStuff. Typically, the user won't need to interact with the submodule directly.

3) The read() / parse() functions return Bio.SomeStuff.Record objects, where Bio.SomeStuff.Record is a class that represents the primary data structure of SomeStuff information.

This general framework may not be suitable in all aspects for all Biopython modules, and can be modified as needed. For example, I can imagine that the most important data structure in Bio.Phylo is a Tree object rather than a Record object.
Post by Eric Talevich
But I think it would
be even cleaner if each Foo simply had a Foo.IO (or foo.io)
sub-module organizing the I/O for multiple file formats where
applicable.
I agree.
Post by Eric Talevich
The TreeIO.* namespace is not crowded -- just read, write,
parse, convert. If that directory is moved under Bio.Tree and
renamed to IO or io, then Bio.Tree would still seem reasonably
from io import *
from utils import *
Then "from Bio import Tree" would be enough for most uses.
Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.

Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.
Post by Eric Talevich
Post by Peter Cock
perhaps something different like Bio.Phylo instead?
Sure, that sounds promising.
I agree that Bio.Phylo is a good name. Note also that there already is a Tree class in Bio.Cluster (it represents hierarchical clustering trees). Having a Bio.Phylo.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees is not confusing. On the other hand, having a Bio.Tree.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees could potentially be confusing.

--Michiel
Eric Talevich
2010-01-09 23:38:29 UTC
Permalink
Hi,

Thanks for your comments. I've reorganized the modules like this:

Bio/Phylo/
__init__.py, BaseTree.py, Newick.py, PhyloXML.py, Utils.py
IO/
__init__.py, NexusIO.py, NewickIO.py, PhyloXMLIO.py

Now "from Bio import Phylo" works for the common cases, and "from
Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct access to the
parsers.

I renamed TreeIO to Phylo/IO -- keeping it uppercase because io is a
standard module in Py2.6+, Py2.7 changes the priority rules for
absolute vs. relative imports, and Py2.4 doesn't support the new
syntax for relative imports. I might change the other file names to
lower case before the next merge, though...
Post by Michiel de Hoon
Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.
Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.
I'm trying to avoid having to update Phylo/__init__.py each time I add
or rename a public function in Utils.py or IO. So, how about this:
I've added "__all__" definitions to Utils.py and IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules. Testing
manually, this seems to do the right thing.

Cheers,
Eric
Peter
2010-01-11 11:04:03 UTC
Permalink
Post by Eric Talevich
I'm trying to avoid having to update Phylo/__init__.py each time I add
I've added "__all__" definitions to Utils.py and IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules. Testing
manually, this seems to do the right thing.
Previously bits of Biopython have used __all__, and then
abandoned this a long term maintenance load. This was before
my time, so I am not familiar with the full history, but it makes me
wary about using __all__ here.

Personally I don't see a big problem with having just explicit
manual imports within Bio/Phylo/__init__.py if and when you
decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py
should be made available at the top level. In general I would
think relatively few things should be exposed like that.

Peter
Peter
2010-01-11 13:42:32 UTC
Permalink
Post by Peter
Post by Eric Talevich
I'm trying to avoid having to update Phylo/__init__.py each time I add
I've added "__all__" definitions to Utils.py and IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules. Testing
manually, this seems to do the right thing.
Previously bits of Biopython have used __all__, and then
abandoned this a long term maintenance load. This was before
my time, so I am not familiar with the full history, but it makes me
wary about using __all__ here.
Personally I don't see a big problem with having just explicit
manual imports within Bio/Phylo/__init__.py if and when you
decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py
should be made available at the top level. In general I would
think relatively few things should be exposed like that.
In fact, why even do this at all? What is wrong with leaving
the IO functions (read, parse, write) as Bio.Phylo.IO.read etc
e.g.
Post by Peter
Post by Eric Talevich
Post by Eric Talevich
from Bio import Phylo
tree = Phylo.IO.read(open("int_node_labels.nwk"),"newick")
What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.

If we do want to use Bio.Phylo.IO instead of Bio.PhyloIO
(or Bio.TreeIO) then thinking long term we may want to
do something about Bio.SeqIO and Bio.AlignIO to match.
We could move the Bio.AlignIO functionality under
Bio.Align.IO (with a suitable transition period). We could
move Bio.SeqIO to Bio.Seq.IO perhaps. Or we could
even talk about introducing Bio.Sequences (or something)
then move Bio.SeqIO to Bio.Sequences.IO, and move
Bio.SeqUtils.* under there too, and perhaps even the
Seq, SeqRecord and SeqFeature objects as well.
On the other hand, all that upheaval would cause a
lot of pain for end users, for relatively little gain.

Peter
Peter
2010-01-11 13:42:32 UTC
Permalink
Post by Peter
Post by Eric Talevich
I'm trying to avoid having to update Phylo/__init__.py each time I add
I've added "__all__" definitions to Utils.py and IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules. Testing
manually, this seems to do the right thing.
Previously bits of Biopython have used __all__, and then
abandoned this a long term maintenance load. This was before
my time, so I am not familiar with the full history, but it makes me
wary about using __all__ here.
Personally I don't see a big problem with having just explicit
manual imports within Bio/Phylo/__init__.py if and when you
decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py
should be made available at the top level. In general I would
think relatively few things should be exposed like that.
In fact, why even do this at all? What is wrong with leaving
the IO functions (read, parse, write) as Bio.Phylo.IO.read etc
e.g.
Post by Peter
Post by Eric Talevich
Post by Eric Talevich
from Bio import Phylo
tree = Phylo.IO.read(open("int_node_labels.nwk"),"newick")
What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.

If we do want to use Bio.Phylo.IO instead of Bio.PhyloIO
(or Bio.TreeIO) then thinking long term we may want to
do something about Bio.SeqIO and Bio.AlignIO to match.
We could move the Bio.AlignIO functionality under
Bio.Align.IO (with a suitable transition period). We could
move Bio.SeqIO to Bio.Seq.IO perhaps. Or we could
even talk about introducing Bio.Sequences (or something)
then move Bio.SeqIO to Bio.Sequences.IO, and move
Bio.SeqUtils.* under there too, and perhaps even the
Seq, SeqRecord and SeqFeature objects as well.
On the other hand, all that upheaval would cause a
lot of pain for end users, for relatively little gain.

Peter
Peter
2010-01-11 11:04:03 UTC
Permalink
Post by Eric Talevich
I'm trying to avoid having to update Phylo/__init__.py each time I add
I've added "__all__" definitions to Utils.py and IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules. Testing
manually, this seems to do the right thing.
Previously bits of Biopython have used __all__, and then
abandoned this a long term maintenance load. This was before
my time, so I am not familiar with the full history, but it makes me
wary about using __all__ here.

Personally I don't see a big problem with having just explicit
manual imports within Bio/Phylo/__init__.py if and when you
decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py
should be made available at the top level. In general I would
think relatively few things should be exposed like that.

Peter
Eric Talevich
2010-01-09 23:38:29 UTC
Permalink
Hi,

Thanks for your comments. I've reorganized the modules like this:

Bio/Phylo/
__init__.py, BaseTree.py, Newick.py, PhyloXML.py, Utils.py
IO/
__init__.py, NexusIO.py, NewickIO.py, PhyloXMLIO.py

Now "from Bio import Phylo" works for the common cases, and "from
Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct access to the
parsers.

I renamed TreeIO to Phylo/IO -- keeping it uppercase because io is a
standard module in Py2.6+, Py2.7 changes the priority rules for
absolute vs. relative imports, and Py2.4 doesn't support the new
syntax for relative imports. I might change the other file names to
lower case before the next merge, though...
Post by Michiel de Hoon
Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.
Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.
I'm trying to avoid having to update Phylo/__init__.py each time I add
or rename a public function in Utils.py or IO. So, how about this:
I've added "__all__" definitions to Utils.py and IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules. Testing
manually, this seems to do the right thing.

Cheers,
Eric
Michiel de Hoon
2010-01-10 02:50:21 UTC
Permalink
I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it. One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this?

Thanks!

--Michiel
From: Eric Talevich <eric.talevich at gmail.com>
Subject: Re: [Biopython-dev] Code review request for phyloxml branch
To: "Michiel de Hoon" <mjldehoon at yahoo.com>
Cc: "Peter Cock" <p.j.a.cock at googlemail.com>, "BioPython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Saturday, January 9, 2010, 6:38 PM
Hi,
Thanks for your comments. I've reorganized the modules like
Bio/Phylo/
? ? __init__.py, BaseTree.py, Newick.py,
PhyloXML.py, Utils.py
? ? IO/
? ? ? ? __init__.py, NexusIO.py,
NewickIO.py, PhyloXMLIO.py
Now "from Bio import Phylo" works for the common cases, and
"from
Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct
access to the
parsers.
I renamed TreeIO to Phylo/IO -- keeping it uppercase
because io is a
standard module in Py2.6+, Py2.7 changes the priority rules
for
absolute vs. relative imports, and Py2.4 doesn't support
the new
syntax for relative imports. I might change the other file
names to
lower case before the next merge, though...
On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon <mjldehoon at yahoo.com>
Post by Michiel de Hoon
Rather than importing *, can we import only those
functions that a user would actually use? We should avoid
importing stuff that is essentially used only locally in
each sub-module.
Post by Michiel de Hoon
Another option is to have all functions that are
intended to be used by the user in Bio.Phylo, and have those
function access (internally) any sub-module as needed. For
example, a user would not notice that Bio.Phylo.read
actually uses code from Bio.Phylo.io; the latter module
would not be accessed directly by the user.
I'm trying to avoid having to update Phylo/__init__.py each
time I add
or rename a public function in Utils.py or IO. So, how
I've added "__all__" definitions to Utils.py and
IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules.
Testing
manually, this seems to do the right thing.
Cheers,
Eric
Eric Talevich
2010-01-10 22:02:10 UTC
Permalink
Post by Michiel de Hoon
I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it.
OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!

For documentation on the Biopython wiki, I moved the relevant parts of
the Tree, TreeIO and PhyloXML pages to a new page for Bio.Phylo:
http://biopython.org/wiki/Phylo

It's a little rough at the moment, but I'll refine it this week. Some
of the content can also be moved to separate cookbook entries.
Post by Michiel de Hoon
One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this?
I went over all the docstrings and comments again before merging; it
should be free of Tree/TreeIO references now.

Thanks for your help!
Eric
Peter
2010-01-11 11:37:42 UTC
Permalink
Post by Eric Talevich
OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!
Wow - that was quicker than I expected. As an aside, do you know
why there seem to be three main branches in the history now?
I guess this was the "original" master, your local master, and your
phyloxml branch?

One minor thing - test_Phylo.py needs to be tweaked to raise a
MissingExternalDependencyError if NetworkX isn't installed. That
way the run_tests.py script will treat it as a skipped test instead of
a failed test. Alternatively, if this is just a small part of the test,
maybe split test_Phylo.py into two files (e.g. add a new file
test_Phylo_NeworkX.py which needs the dependency).

And how's this for a draft entry in the NEWS file?

New module Bio.Phylo includes support for reading, writing and working with
phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by
Eric Talevich on a Google Summer of Code 2009 project, under The National
Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and
Christian Zmasek.

Peter
Eric Talevich
2010-01-11 16:30:32 UTC
Permalink
Post by Peter
Post by Eric Talevich
OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!
Wow - that was quicker than I expected. As an aside, do you know
why there seem to be three main branches in the history now?
I guess this was the "original" master, your local master, and your
phyloxml branch?
Er, sorry if I jumped the gun. I was eager to get this done before the
semester kicks in... anyway, these are the Git commands I used:

git checkout master
git pull upstream # remote: biopython master
git checkout phyloxml
git merge master # check that it merges cleanly
git checkout master
git merge phyloxml # fast-forward
git push upstream master
git push origin master # updating my own branches on github
git push origin phyloxml

It looks more reasonable in gitk; maybe the branches will separate
again later on GitHub when they're no longer equivalent, or when I
delete the phyloxml branch.
Post by Peter
One minor thing - test_Phylo.py needs to be tweaked to raise a
MissingExternalDependencyError if NetworkX isn't installed. That
way the run_tests.py script will treat it as a skipped test instead of
a failed test. Alternatively, if this is just a small part of the test,
maybe split test_Phylo.py into two files (e.g. add a new file
test_Phylo_NeworkX.py which needs the dependency).
I extracted test_Phylo_depend.py from test_Phylo and added tests at
the top level for networkx and either pygraphviz or pydot (since those
are also used by Bio/Phylo/Utils.py).
Post by Peter
And how's this for a draft entry in the NEWS file?
New module Bio.Phylo includes support for reading, writing and working with
phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by
Eric Talevich on a Google Summer of Code 2009 project, under The National
Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and
Christian Zmasek.
Great, thanks!

Eric
Eric Talevich
2010-01-11 16:30:32 UTC
Permalink
Post by Peter
Post by Eric Talevich
OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!
Wow - that was quicker than I expected. As an aside, do you know
why there seem to be three main branches in the history now?
I guess this was the "original" master, your local master, and your
phyloxml branch?
Er, sorry if I jumped the gun. I was eager to get this done before the
semester kicks in... anyway, these are the Git commands I used:

git checkout master
git pull upstream # remote: biopython master
git checkout phyloxml
git merge master # check that it merges cleanly
git checkout master
git merge phyloxml # fast-forward
git push upstream master
git push origin master # updating my own branches on github
git push origin phyloxml

It looks more reasonable in gitk; maybe the branches will separate
again later on GitHub when they're no longer equivalent, or when I
delete the phyloxml branch.
Post by Peter
One minor thing - test_Phylo.py needs to be tweaked to raise a
MissingExternalDependencyError if NetworkX isn't installed. That
way the run_tests.py script will treat it as a skipped test instead of
a failed test. Alternatively, if this is just a small part of the test,
maybe split test_Phylo.py into two files (e.g. add a new file
test_Phylo_NeworkX.py which needs the dependency).
I extracted test_Phylo_depend.py from test_Phylo and added tests at
the top level for networkx and either pygraphviz or pydot (since those
are also used by Bio/Phylo/Utils.py).
Post by Peter
And how's this for a draft entry in the NEWS file?
New module Bio.Phylo includes support for reading, writing and working with
phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by
Eric Talevich on a Google Summer of Code 2009 project, under The National
Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and
Christian Zmasek.
Great, thanks!

Eric
Brad Chapman
2010-01-11 13:18:40 UTC
Permalink
Hi all;
Post by Eric Talevich
OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!
Awesome. Congrats Eric -- thanks for all the hard work on this
during the summer, and getting it in shape for inclusion. Peter and
Michiel, thanks for all the helpful feedback. Really happy to have
this integrated,
Brad
Peter
2010-01-11 11:37:42 UTC
Permalink
Post by Eric Talevich
OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!
Wow - that was quicker than I expected. As an aside, do you know
why there seem to be three main branches in the history now?
I guess this was the "original" master, your local master, and your
phyloxml branch?

One minor thing - test_Phylo.py needs to be tweaked to raise a
MissingExternalDependencyError if NetworkX isn't installed. That
way the run_tests.py script will treat it as a skipped test instead of
a failed test. Alternatively, if this is just a small part of the test,
maybe split test_Phylo.py into two files (e.g. add a new file
test_Phylo_NeworkX.py which needs the dependency).

And how's this for a draft entry in the NEWS file?

New module Bio.Phylo includes support for reading, writing and working with
phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by
Eric Talevich on a Google Summer of Code 2009 project, under The National
Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and
Christian Zmasek.

Peter
Brad Chapman
2010-01-11 13:18:40 UTC
Permalink
Hi all;
Post by Eric Talevich
OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!
Awesome. Congrats Eric -- thanks for all the hard work on this
during the summer, and getting it in shape for inclusion. Peter and
Michiel, thanks for all the helpful feedback. Really happy to have
this integrated,
Brad
Eric Talevich
2010-01-10 22:02:10 UTC
Permalink
Post by Michiel de Hoon
I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it.
OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!

For documentation on the Biopython wiki, I moved the relevant parts of
the Tree, TreeIO and PhyloXML pages to a new page for Bio.Phylo:
http://biopython.org/wiki/Phylo

It's a little rough at the moment, but I'll refine it this week. Some
of the content can also be moved to separate cookbook entries.
Post by Michiel de Hoon
One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this?
I went over all the docstrings and comments again before merging; it
should be free of Tree/TreeIO references now.

Thanks for your help!
Eric
Michiel de Hoon
2010-01-11 15:02:46 UTC
Permalink
What is wrong with leaving the IO functions
(read, parse, write) as Bio.Phylo.IO.read etc
e.g.
Post by Eric Talevich
from Bio import Phylo
tree =
Phylo.IO.read(open("int_node_labels.nwk"),"newick")
What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.
If we use Bio.Phylo.IO.read directly, then for consistency we'd have to do the same for all other modules. Otherwise, we'd be guessing each time whether the read() and parse() functions are in Bio.SomeModule, or Bio.SomeModule.IO.

For Bio.Phylo, a simple solution is to put whatever is in Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and remove Bio.Phylo.IO.__init__.py. Then there is only one way to access the read() etc. functions.

[About doing the same for Bio.Seq and Bio.Align]
On the other hand, all that upheaval would cause a
lot of pain for end users, for relatively little gain.
For new users, it may be confusing to have all those different modules dealing with sequences. At least, it was for me when I started with Biopython. Therefore, for a long term solution, I'd prefer a single Bio.Seq module that incorporates all (Seq, SeqRecord, SeqIO, SeqFeature).

I agree that that may cause a lot of upheaval for end users, but a suitably long transition period may mitigate those concerns. I'd prefer that to being stuck with a less-than-optimal code organization forever.

--Michiel
Peter
2010-01-11 16:17:36 UTC
Permalink
Post by Michiel de Hoon
Post by Peter
What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.
If we use Bio.Phylo.IO.read directly, then for consistency we'd have
to do the same for all other modules. Otherwise, we'd be guessing
each time whether the read() and parse() functions are in
Bio.SomeModule, or Bio.SomeModule.IO.
Fair point.
Post by Michiel de Hoon
For Bio.Phylo, a simple solution is to put whatever is in
Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and
remove Bio.Phylo.IO.__init__.py. Then there is only one
way to access the read() etc. functions.
Or (if the functions are reasonably complex) keep the
input/output code in a separate file, but make it explicit
that it is not a public interface - e.g. use Bio/Phylo/_IO.py?
Post by Michiel de Hoon
[About doing the same for Bio.Seq and Bio.Align]
Post by Peter
On the other hand, all that upheaval would cause a
lot of pain for end users, for relatively little gain.
For new users, it may be confusing to have all those
different modules dealing with sequences. At least, it
was for me when I started with Biopython. Therefore,
for a long term solution, I'd prefer a single Bio.Seq
module that incorporates all (Seq, SeqRecord, SeqIO,
SeqFeature).
I agree that for a long term solution a single module
make sense here, although I'm not convinced that
Bio.Seq is the best name. We'd have to switch from
a single file Bio/Seq.py to a folder with multiple files
including Bio/Seq/__init__.py - I worry this may cause
problems with updating existing Biopython installations.
Post by Michiel de Hoon
I agree that that may cause a lot of upheaval for end
users, but a suitably long transition period may mitigate
those concerns. I'd prefer that to being stuck with a
less-than-optimal code organization forever.
In principle I agree with that.

Peter
Eric Talevich
2010-01-11 16:43:01 UTC
Permalink
Post by Peter
Post by Michiel de Hoon
Post by Peter
What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.
If we use Bio.Phylo.IO.read directly, then for consistency we'd have
to do the same for all other modules. Otherwise, we'd be guessing
each time whether the read() and parse() functions are in
Bio.SomeModule, or Bio.SomeModule.IO.
Fair point.
Post by Michiel de Hoon
For Bio.Phylo, a simple solution is to put whatever is in
Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and
remove Bio.Phylo.IO.__init__.py. Then there is only one
way to access the read() etc. functions.
Or (if the functions are reasonably complex) keep the
input/output code in a separate file, but make it explicit
that it is not a public interface - e.g. use Bio/Phylo/_IO.py?
Something like this?

Phylo/
BaseTree.py
Newick.py
PhyloXML.py
_IO.py
_Utils.py
PhyloXMLIO.py
NewickIO.py
NexusIO.py

This plays well with the expected import styles:

from Bio import Phylo # most common
from Bio.Phylo import PhyloXML # access the defined types
from Bio.Phylo import PhyloXMLIO # special parsing
Peter
2010-01-12 14:51:58 UTC
Permalink
Post by Eric Talevich
Post by Peter
Or (if the functions are reasonably complex) keep the
input/output code in a separate file, but make it explicit
that it is not a public interface - e.g. use Bio/Phylo/_IO.py?
Something like this?
Phylo/
? ?BaseTree.py
? ?Newick.py
? ?PhyloXML.py
? ?_IO.py
? ?_Utils.py
? ?PhyloXMLIO.py
? ?NewickIO.py
? ?NexusIO.py
from Bio import Phylo ?# most common
from Bio.Phylo import PhyloXML ?# access the defined types
from Bio.Phylo import PhyloXMLIO ?# special parsing
I'd forgotten Bio/Phylo/IO was a directory, and that the users may
want to access PhyloXMLIO directly. That suggested structure
looks reasonable... what do you think Michiel?

Peter
Peter
2010-01-12 14:51:58 UTC
Permalink
Post by Eric Talevich
Post by Peter
Or (if the functions are reasonably complex) keep the
input/output code in a separate file, but make it explicit
that it is not a public interface - e.g. use Bio/Phylo/_IO.py?
Something like this?
Phylo/
? ?BaseTree.py
? ?Newick.py
? ?PhyloXML.py
? ?_IO.py
? ?_Utils.py
? ?PhyloXMLIO.py
? ?NewickIO.py
? ?NexusIO.py
from Bio import Phylo ?# most common
from Bio.Phylo import PhyloXML ?# access the defined types
from Bio.Phylo import PhyloXMLIO ?# special parsing
I'd forgotten Bio/Phylo/IO was a directory, and that the users may
want to access PhyloXMLIO directly. That suggested structure
looks reasonable... what do you think Michiel?

Peter

Eric Talevich
2010-01-11 16:43:01 UTC
Permalink
Post by Peter
Post by Michiel de Hoon
Post by Peter
What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.
If we use Bio.Phylo.IO.read directly, then for consistency we'd have
to do the same for all other modules. Otherwise, we'd be guessing
each time whether the read() and parse() functions are in
Bio.SomeModule, or Bio.SomeModule.IO.
Fair point.
Post by Michiel de Hoon
For Bio.Phylo, a simple solution is to put whatever is in
Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and
remove Bio.Phylo.IO.__init__.py. Then there is only one
way to access the read() etc. functions.
Or (if the functions are reasonably complex) keep the
input/output code in a separate file, but make it explicit
that it is not a public interface - e.g. use Bio/Phylo/_IO.py?
Something like this?

Phylo/
BaseTree.py
Newick.py
PhyloXML.py
_IO.py
_Utils.py
PhyloXMLIO.py
NewickIO.py
NexusIO.py

This plays well with the expected import styles:

from Bio import Phylo # most common
from Bio.Phylo import PhyloXML # access the defined types
from Bio.Phylo import PhyloXMLIO # special parsing
Peter
2010-01-11 16:17:36 UTC
Permalink
Post by Michiel de Hoon
Post by Peter
What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.
If we use Bio.Phylo.IO.read directly, then for consistency we'd have
to do the same for all other modules. Otherwise, we'd be guessing
each time whether the read() and parse() functions are in
Bio.SomeModule, or Bio.SomeModule.IO.
Fair point.
Post by Michiel de Hoon
For Bio.Phylo, a simple solution is to put whatever is in
Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and
remove Bio.Phylo.IO.__init__.py. Then there is only one
way to access the read() etc. functions.
Or (if the functions are reasonably complex) keep the
input/output code in a separate file, but make it explicit
that it is not a public interface - e.g. use Bio/Phylo/_IO.py?
Post by Michiel de Hoon
[About doing the same for Bio.Seq and Bio.Align]
Post by Peter
On the other hand, all that upheaval would cause a
lot of pain for end users, for relatively little gain.
For new users, it may be confusing to have all those
different modules dealing with sequences. At least, it
was for me when I started with Biopython. Therefore,
for a long term solution, I'd prefer a single Bio.Seq
module that incorporates all (Seq, SeqRecord, SeqIO,
SeqFeature).
I agree that for a long term solution a single module
make sense here, although I'm not convinced that
Bio.Seq is the best name. We'd have to switch from
a single file Bio/Seq.py to a folder with multiple files
including Bio/Seq/__init__.py - I worry this may cause
problems with updating existing Biopython installations.
Post by Michiel de Hoon
I agree that that may cause a lot of upheaval for end
users, but a suitably long transition period may mitigate
those concerns. I'd prefer that to being stuck with a
less-than-optimal code organization forever.
In principle I agree with that.

Peter
Michiel de Hoon
2010-01-08 16:26:29 UTC
Permalink
I am not an expert in this area, but the code looks very well done and well organized. Thanks, Eric!

I have one suggestion though:
In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather have everything under Bio.Tree. This makes it easier to understand what each Bio.* module is about, and also agrees with the structure of the other modules in Biopython. The only exception is Bio.Seq, for which there is a closely related Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; I'd rather have a single Bio.Seq there too).

Thanks again,

--Michiel.
From: Eric Talevich <eric.talevich at gmail.com>
Subject: Re: [Biopython-dev] Code review request for phyloxml branch
To: "BioPython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Monday, December 28, 2009, 8:51 PM
Hi folks,
Here's an update on the status of Bio.Tree and TreeIO. I
think I've taken
care of most of the blockers since the last review in
September.
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py
http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py
http://biopython.org/wiki/PhyloXML
*TreeIO*
Conversion between Nexus, Newick and phyloXML tree file
formats works; the
read/parse/write functions for each IO format use the same
object types.
Neat!
The tree annotations (e.g. id) aren't preserved perfectly
during conversions
-- I'll keep working on this, but I don't think it's a
blocker. The taxon
names of terminal nodes are kept as "clade" names in
phyloXML for
round-tripping. Tree topology and branch lengths seem OK.
-- PhyloXMLIO is from GSoC
-- NewickIO is ported from the Bio.Nexus.Trees parser. I
think it works the
same way.
-- NexusIO relies on Bio.Nexus.Nexus for parsing, then
converts the
resulting Nexus.Trees.Tree objects to Bio.Tree.Newick
objects. One day, when
Nexus.Trees is replaced by NewickIO in the main Nexus
parser, then this
conversion can be dropped and NexusIO will be very simple.
*Tree*
The BaseTree object structure looks like this:*
-- BaseTree.**Tree* contains global tree information, like
whether the tree
is rooted, and a reference to the root clade. The phyloXML
Phylogeny object
inherits from this.*
-- BaseTree.**Subtree* contains local (clade- or
node-specific) information,
and references to each of its direct descendents,
recursively. The phyloXML
Clade object inherits from this. Nodes are implicit. I
could add references
to the ancestor of each sub-tree without too much
difficulty, but I haven't
needed them yet.
The same methods (get_terminals et al.) generally apply to
both classes, so
I created a separate TreeMixin class from which both
BaseTree.Tree and
BaseTree.Subtree inherit.
Bio.Tree.Newick contains simple subclasses of Tree and
Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree
(minus the I/O).
This is to ease the deprecation and eventual replacement of
Bio.Nexus.Trees,
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying
arguments where
reasonable (since the node IDs and adjacency list lookup
are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original
argument lists,
but triggering a deprecation warning indicating the newer
replacement method
(3) Replace Nexus.Trees with an import of
Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py
should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with
proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in
Bio.Tree.Newick.
I'm currently doing (1) and (2), with more emphasis on
getting (1) right.
Not all of the important methods have been ported, but I'm
happy with the
tree traversal methods.
*
Tests
*I created test_Tree.py to test the methods in
Bio.Tree.BaseTree;
test_PhyloXML.py tests Bio.Tree.PhyloXML objects and
Bio.TreeIO.PhyloXMLIO
parsing/writing.
I noticed that in Tests/Nexus/, the example file for
internal node labels is
actually in Newick/NH format, not Nexus. That was briefly
confusing, so
maybe that file should be renamed.
What do you think?
All the best,
Eric
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Michiel de Hoon
2010-01-09 15:15:56 UTC
Permalink
Post by Eric Talevich
Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava
do something completely different.
I had the impression that pairing modules Foo & FooIO
was an emerging convention for organizing very general
data types being fed by a variety of file formats, while
a single module Foo indicated support
for a particular program or source, like Entrez.
I think a workable convention, which is already followed by many Biopython module, is the following:

1) Bio.SomeStuff is a module containing everything related to SomeStuff, where SomeStuff is some broadly-defined field within bioinformatics (Cluster for clustering algorithms, Phylo for phylogenetics, PopGen for population genetics, Entrez for NCBI Entrez related stuff, etc.).

2) Parsing SomeStuff files, which can be in a variety of formats, is done by a read() function (to parse a single record), and/or a parse() function (to parse multiple records). The implementation details of these functions is hidden in a submodule of Bio.SomeStuff. Typically, the user won't need to interact with the submodule directly.

3) The read() / parse() functions return Bio.SomeStuff.Record objects, where Bio.SomeStuff.Record is a class that represents the primary data structure of SomeStuff information.

This general framework may not be suitable in all aspects for all Biopython modules, and can be modified as needed. For example, I can imagine that the most important data structure in Bio.Phylo is a Tree object rather than a Record object.
Post by Eric Talevich
But I think it would
be even cleaner if each Foo simply had a Foo.IO (or foo.io)
sub-module organizing the I/O for multiple file formats where
applicable.
I agree.
Post by Eric Talevich
The TreeIO.* namespace is not crowded -- just read, write,
parse, convert. If that directory is moved under Bio.Tree and
renamed to IO or io, then Bio.Tree would still seem reasonably
from io import *
from utils import *
Then "from Bio import Tree" would be enough for most uses.
Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.

Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.
Post by Eric Talevich
Post by Peter Cock
perhaps something different like Bio.Phylo instead?
Sure, that sounds promising.
I agree that Bio.Phylo is a good name. Note also that there already is a Tree class in Bio.Cluster (it represents hierarchical clustering trees). Having a Bio.Phylo.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees is not confusing. On the other hand, having a Bio.Tree.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees could potentially be confusing.

--Michiel
Michiel de Hoon
2010-01-10 02:50:21 UTC
Permalink
I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it. One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this?

Thanks!

--Michiel
From: Eric Talevich <eric.talevich at gmail.com>
Subject: Re: [Biopython-dev] Code review request for phyloxml branch
To: "Michiel de Hoon" <mjldehoon at yahoo.com>
Cc: "Peter Cock" <p.j.a.cock at googlemail.com>, "BioPython-Dev Mailing List" <biopython-dev at biopython.org>
Date: Saturday, January 9, 2010, 6:38 PM
Hi,
Thanks for your comments. I've reorganized the modules like
Bio/Phylo/
? ? __init__.py, BaseTree.py, Newick.py,
PhyloXML.py, Utils.py
? ? IO/
? ? ? ? __init__.py, NexusIO.py,
NewickIO.py, PhyloXMLIO.py
Now "from Bio import Phylo" works for the common cases, and
"from
Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct
access to the
parsers.
I renamed TreeIO to Phylo/IO -- keeping it uppercase
because io is a
standard module in Py2.6+, Py2.7 changes the priority rules
for
absolute vs. relative imports, and Py2.4 doesn't support
the new
syntax for relative imports. I might change the other file
names to
lower case before the next merge, though...
On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon <mjldehoon at yahoo.com>
Post by Michiel de Hoon
Rather than importing *, can we import only those
functions that a user would actually use? We should avoid
importing stuff that is essentially used only locally in
each sub-module.
Post by Michiel de Hoon
Another option is to have all functions that are
intended to be used by the user in Bio.Phylo, and have those
function access (internally) any sub-module as needed. For
example, a user would not notice that Bio.Phylo.read
actually uses code from Bio.Phylo.io; the latter module
would not be accessed directly by the user.
I'm trying to avoid having to update Phylo/__init__.py each
time I add
or rename a public function in Utils.py or IO. So, how
I've added "__all__" definitions to Utils.py and
IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules.
Testing
manually, this seems to do the right thing.
Cheers,
Eric
Michiel de Hoon
2010-01-11 15:02:46 UTC
Permalink
What is wrong with leaving the IO functions
(read, parse, write) as Bio.Phylo.IO.read etc
e.g.
Post by Eric Talevich
from Bio import Phylo
tree =
Phylo.IO.read(open("int_node_labels.nwk"),"newick")
What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.
If we use Bio.Phylo.IO.read directly, then for consistency we'd have to do the same for all other modules. Otherwise, we'd be guessing each time whether the read() and parse() functions are in Bio.SomeModule, or Bio.SomeModule.IO.

For Bio.Phylo, a simple solution is to put whatever is in Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and remove Bio.Phylo.IO.__init__.py. Then there is only one way to access the read() etc. functions.

[About doing the same for Bio.Seq and Bio.Align]
On the other hand, all that upheaval would cause a
lot of pain for end users, for relatively little gain.
For new users, it may be confusing to have all those different modules dealing with sequences. At least, it was for me when I started with Biopython. Therefore, for a long term solution, I'd prefer a single Bio.Seq module that incorporates all (Seq, SeqRecord, SeqIO, SeqFeature).

I agree that that may cause a lot of upheaval for end users, but a suitably long transition period may mitigate those concerns. I'd prefer that to being stuck with a less-than-optimal code organization forever.

--Michiel
Loading...