Discussion:
[Wikisource-l] Upload/import wizard
Sam Wilson
2017-01-02 04:29:19 UTC
Permalink
Hi all,

I've attempted to start a phab ticket about what the import wizard
should look like:
https://phabricator.wikimedia.org/T154413

There are plenty of unanswered questions I'm sure, and lots missing
still. Please edit the task or add comments about anything.

This is 2016 Wishlist #73, so I'm not sure it'll get much 'official'
comm-tech time (yet; there *is* a plan to address further-down wishes,
but they may take some time), but I'm keen to work on it in my own time
anyway.

One thing I'd love to have in a Wikisource upload wizard is a thing that
I can show to Glam people that makes it easier for them to see the value
(and ease) in getting their stuff online and ready for crowd-sourced
transcription. :-)

Thanks,
Sam.
Alex Brollo
2017-01-02 09:08:49 UTC
Permalink
Very interesting.

About djvu files on IA, they can be built simply by pdf2djvu from pdf files
of IA, but quality is very poor; or they can be built, with some more pain,
from _jp2.zip images merged with _djvu.xml files, the quality is high but
resulting djvu is heavy.

As Aubrey told some time ago, it.source uses a python script to do the
latter job, but it is a DIY (do it yourself) script, just to proof that *it
can be done*.

Alex
Post by Sam Wilson
Hi all,
I've attempted to start a phab ticket about what the import wizard
https://phabricator.wikimedia.org/T154413
There are plenty of unanswered questions I'm sure, and lots missing
still. Please edit the task or add comments about anything.
This is 2016 Wishlist #73, so I'm not sure it'll get much 'official'
comm-tech time (yet; there *is* a plan to address further-down wishes,
but they may take some time), but I'm keen to work on it in my own time
anyway.
One thing I'd love to have in a Wikisource upload wizard is a thing that
I can show to Glam people that makes it easier for them to see the value
(and ease) in getting their stuff online and ready for crowd-sourced
transcription. :-)
Thanks,
Sam.
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Sam Wilson
2017-01-02 09:17:13 UTC
Permalink
Yes, I've been wondering about the best approach with that. Obviously,
better quality is better, but we don't want to overwhelm the various
tools that deal with the DjVus.


And if we're building DjVus for Commons from IA files (either PDFs or
the Jpegs), should we also be adding those DjVus back to the IA item?
(Actually, can we even edit IA items that we haven't created ourselves?)
I'm figuring not doing so (but maybe adding a comment to the IA item
that links to the DjVu on Commons).


—sam
Post by Alex Brollo
Very interesting.
About djvu files on IA, they can be built simply by pdf2djvu from pdf
files of IA, but quality is very poor; or they can be built, with some
more pain, from _jp2.zip images merged with _djvu.xml files, the
quality is high but resulting djvu is heavy.
As Aubrey told some time ago, it.source uses a python script to do the
latter job, but it is a DIY (do it yourself) script, just to proof
that *it can be done*.
Alex
Post by Sam Wilson
Hi all,
I've attempted to start a phab ticket about what the import wizard
https://phabricator.wikimedia.org/T154413
There are plenty of unanswered questions I'm sure, and lots missing
still. Please edit the task or add comments about anything.
This is 2016 Wishlist #73, so I'm not sure it'll get much 'official'
comm-tech time (yet; there *is* a plan to address further-down wishes,
but they may take some time), but I'm keen to work on it in my own time
anyway.
One thing I'd love to have in a Wikisource upload wizard is a thing that
I can show to Glam people that makes it easier for them to see the value
(and ease) in getting their stuff online and ready for crowd-sourced
transcription. :-)
Thanks,
Sam.
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_________________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Andrea Zanni
2017-01-02 09:29:01 UTC
Permalink
And if we're building DjVus for Commons from IA files (either PDFs or the
Jpegs), should we also be adding those DjVus back to the IA item?
(Actually, can we even edit IA items that we haven't created ourselves?)
I'm figuring not doing so (but maybe adding a comment to the IA item that
links to the DjVu on Commons).
Ideally, we should talk to IA about this.
Adding a comment on the IA item is a very low-cost solution and I think is
important, adding the djvu would be much better. We should check if a
script can edit every kind of item and add files (I think not).

Aubrey
Sam Wilson
2017-01-02 13:37:58 UTC
Permalink
Post by Andrea Zanni
Ideally, we should talk to IA about this.
Adding a comment on the IA item is a very low-cost solution and I
think is important, adding the djvu would be much better. We should
check if a script can edit every kind of item and add files (I
think not).
Aubrey
Yes, good idea about talking to them.



I wonder about the workflow too, because what about the situation of
someone uploading a new work with our tool: the script creates a new
IA item then (I assume as the 'wikisource-import-tool' or whatever
user) and then it will have full permissions over that item. So the
update-DjVu scenario will only apply for IA items that already exist
but which don't have DjVu files (i.e. only the last few months'
worth). Which is good...


—sam
Alex Brollo
2017-01-02 15:49:59 UTC
Permalink
Please take a look to https://archive.org/details/spinoza_etica_paravia_djvu,
this is precisely a djvu-only item that I uploaded some days ago. I asked
for permission to create "djvu-only items" into IA forum and I got it; this
is the fiirst item I created; as you see there's some "implicit convention"
too (the name of item is the original one + a _djvu suffix: it has been
derived from https://archive.org/details/spinoza_etica_paravia) and
metadata are the same, but a standard warning "Derived from files into
L'Etica <https://archive.org/details/spinoza_etica_paravia>" into the
description field.

So far I did not do the last step, t.i. adding a "backlink" from original
item to the derived one.

internetarchive.py allows to automatize the whole work (to download
metadata of source item, to build the new item name and to add the warning
do description field and to upload the new item).

Alex
Post by Andrea Zanni
Ideally, we should talk to IA about this.
Adding a comment on the IA item is a very low-cost solution and I think is
important, adding the djvu would be much better. We should check if a
script can edit every kind of item and add files (I think not).
Aubrey
Yes, good idea about talking to them.
I wonder about the workflow too, because what about the situation of
someone uploading a new work with our tool: the script creates a new IA
item then (I assume as the 'wikisource-import-tool' or whatever user) and
then it will have full permissions over that item. So the update-DjVu
scenario will only apply for IA items that already exist but which don't
have DjVu files (i.e. only the last few months' worth). Which is good...
—sam
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Alex Brollo
2017-01-02 15:52:37 UTC
Permalink
Done :-)

Alex
Post by Alex Brollo
Please take a look to https://archive.org/details/spinoza_etica_paravia_
djvu, this is precisely a djvu-only item that I uploaded some days ago. I
asked for permission to create "djvu-only items" into IA forum and I got
it; this is the fiirst item I created; as you see there's some "implicit
convention" too (the name of item is the original one + a _djvu suffix: it
has been derived from https://archive.org/details/spinoza_etica_paravia)
and metadata are the same, but a standard warning "Derived from files
into L'Etica <https://archive.org/details/spinoza_etica_paravia>" into
the description field.
So far I did not do the last step, t.i. adding a "backlink" from original
item to the derived one.
internetarchive.py allows to automatize the whole work (to download
metadata of source item, to build the new item name and to add the warning
do description field and to upload the new item).
Alex
Post by Andrea Zanni
Ideally, we should talk to IA about this.
Adding a comment on the IA item is a very low-cost solution and I think
is important, adding the djvu would be much better. We should check if a
script can edit every kind of item and add files (I think not).
Aubrey
Yes, good idea about talking to them.
I wonder about the workflow too, because what about the situation of
someone uploading a new work with our tool: the script creates a new IA
item then (I assume as the 'wikisource-import-tool' or whatever user) and
then it will have full permissions over that item. So the update-DjVu
scenario will only apply for IA items that already exist but which don't
have DjVu files (i.e. only the last few months' worth). Which is good...
—sam
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Sam Wilson
2017-01-02 23:07:07 UTC
Permalink
Good idea. I guess it's not ideal to end up with two items, but at least
the 2nd will be updateable from our end.


It looks like we can add HTML links to IA reviews too, which is nice:
https://archive.org/details/spinoza_etica_paravia
Post by Alex Brollo
Done :-)
Alex
Post by Alex Brollo
Please take a look to
https://archive.org/details/spinoza_etica_paravia_djvu, this is
precisely a djvu-only item that I uploaded some days ago. I asked for
permission to create "djvu-only items" into IA forum and I got it;
this is the fiirst item I created; as you see there's some "implicit
convention" too (the name of item is the original one + a _djvu
suffix: it has been derived from
https://archive.org/details/spinoza_etica_paravia) and metadata are
the same, but a standard warning "Derived from files into L'Etica[1]"
into the description field.
So far I did not do the last step, t.i. adding a "backlink" from
original item to the derived one.
internetarchive.py allows to automatize the whole work (to download
metadata of source item, to build the new item name and to add the
warning do description field and to upload the new item).
Links:

1. https://archive.org/details/spinoza_etica_paravia
Sam Wilson
2017-01-02 23:19:14 UTC
Permalink
I wonder if, rather than creating a new IA item, we should just link the
original IA item to the DjVu on Commons (via a review)? Or is there a
discoverability benefit to be had by having the DjVu also on IA?
Post by Sam Wilson
Good idea. I guess it's not ideal to end up with two items, but at
least the 2nd will be updateable from our end.
https://archive.org/details/spinoza_etica_paravia
Post by Alex Brollo
Done :-)
Alex
Post by Alex Brollo
Please take a look to
https://archive.org/details/spinoza_etica_paravia_djvu, this is
precisely a djvu-only item that I uploaded some days ago. I asked
for permission to create "djvu-only items" into IA forum and I got
it; this is the fiirst item I created; as you see there's some
"implicit convention" too (the name of item is the original one + a
_djvu suffix: it has been derived from
https://archive.org/details/spinoza_etica_paravia) and metadata are
the same, but a standard warning "Derived from files into
L'Etica[1]" into the description field.
So far I did not do the last step, t.i. adding a "backlink" from
original item to the derived one.
internetarchive.py allows to automatize the whole work (to download
metadata of source item, to build the new item name and to add the
warning do description field and to upload the new item).
_________________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Links:

1. https://archive.org/details/spinoza_etica_paravia
Alex Brollo
2017-01-02 23:46:22 UTC
Permalink
You can see a great advantage of djvu files over pdf files into the present
file list of any IA item. You can see that IA removed djvu files, but it
builds and publishes _djvu.xml file. Why? I presume that IA uses that file
to "map words" into its book viewer, since it has a good text structure
while being *pretty simple*. It can be translated into hOCR, and editing
its text nodes the edited text can be uploaded again into the djvu file.
Itsource is testing, on some texts, tricks to mass-fix djvu text layer
(removing scannos etc.) *before* uploading it into Commons.

It's a pity IMHO that this magic book format has been disregarded. Its
structure is *open* just as the pdf structure is *closed*.

Alex
Post by Sam Wilson
I wonder if, rather than creating a new IA item, we should just link the
original IA item to the DjVu on Commons (via a review)? Or is there a
discoverability benefit to be had by having the DjVu also on IA?
Good idea. I guess it's not ideal to end up with two items, but at least
the 2nd will be updateable from our end.
https://archive.org/details/spinoza_etica_paravia
Done :-)
Alex
Please take a look to https://archive.org/details
/spinoza_etica_paravia_djvu, this is precisely a djvu-only item that I
uploaded some days ago. I asked for permission to create "djvu-only items"
into IA forum and I got it; this is the fiirst item I created; as you see
there's some "implicit convention" too (the name of item is the original
one + a _djvu suffix: it has been derived from
https://archive.org/details/spinoza_etica_paravia) and metadata are the
same, but a standard warning "Derived from files into L'Etica
<https://archive.org/details/spinoza_etica_paravia>" into the description
field.
So far I did not do the last step, t.i. adding a "backlink" from original
item to the derived one.
internetarchive.py allows to automatize the whole work (to download
metadata of source item, to build the new item name and to add the warning
do description field and to upload the new item).
*_______________________________________________*
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Sam Wilson
2017-01-05 00:50:41 UTC
Permalink
There's also this new Phab task, that's looking at a more limited
first-step:


Investigation: Could we build a Tool Labs project to generate Djvu files for WikiSource
https://phabricator.wikimedia.org/T154538
Post by Alex Brollo
You can see a great advantage of djvu files over pdf files into the
present file list of any IA item. You can see that IA removed djvu
files, but it builds and publishes _djvu.xml file. Why? I presume
that IA uses that file to "map words" into its book viewer, since it
has a good text structure while being *pretty simple*. It can be
translated into hOCR, and editing its text nodes the edited text can
be uploaded again into the djvu file. Itsource is testing, on some
texts, tricks to mass-fix djvu text layer (removing scannos etc.)
*before* uploading it into Commons.
It's a pity IMHO that this magic book format has been disregarded. Its
structure is *open* just as the pdf structure is *closed*.
Alex
Post by Sam Wilson
__
I wonder if, rather than creating a new IA item, we should just
link the original IA item to the DjVu on Commons (via a review)? Or
is there a discoverability benefit to be had by having the DjVu
also on IA?
Post by Sam Wilson
Good idea. I guess it's not ideal to end up with two items, but at
least the 2nd will be updateable from our end.
It looks like we can add HTML links to IA reviews too, which is
nice: https://archive.org/details/spinoza_etica_paravia
Post by Alex Brollo
Done :-)
Alex
Post by Alex Brollo
Please take a look to
https://archive.org/details/spinoza_etica_paravia_djvu, this is
precisely a djvu-only item that I uploaded some days ago. I asked
for permission to create "djvu-only items" into IA forum and I got
it; this is the fiirst item I created; as you see there's some
"implicit convention" too (the name of item is the original one +
a _djvu suffix: it has been derived from
https://archive.org/details/spinoza_etica_paravia) and metadata
are the same, but a standard warning "Derived from files into
L'Etica[1]" into the description field.
So far I did not do the last step, t.i. adding a "backlink" from
original item to the derived one.
internetarchive.py allows to automatize the whole work (to
download metadata of source item, to build the new item name and
to add the warning do description field and to upload the new
item).
_________________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_________________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Links:

1. https://archive.org/details/spinoza_etica_paravia

Ankry
2017-01-02 11:52:25 UTC
Permalink
Post by Alex Brollo
Very interesting.
About djvu files on IA, they can be built simply by pdf2djvu from pdf files
of IA, but quality is very poor;
[...]

Did you try to set the -d parameter to something higher than the default 300?
While converting PDF files from Polish digital libraries, I often use -d
450 or -d 600 with good results.

Ankry
Alex Brollo
2017-01-02 12:37:59 UTC
Permalink
The problem is that many new IA pdf files have a poor resolution / too high
compression from beginning, so their quality can't be improved.

IA viewer doesnìt use pdf or djvu file, it uses jpg images coming from jp2
images; this explains why images seen by the viewer are so beautiful, while
pdf or djvu files are poor.

@Sam: About uploading djvu into IA item lacking of it: no, nobody but the
original contributor or a sysop can upload files into an item. But it can
be uploaded as a new item linked with the original one; its link could be
shown into source item adding a comment (a "review"),
Post by Ankry
Post by Alex Brollo
Very interesting.
About djvu files on IA, they can be built simply by pdf2djvu from pdf files
of IA, but quality is very poor;
[...]
Did you try to set the -d parameter to something higher than the default 300?
While converting PDF files from Polish digital libraries, I often use -d
450 or -d 600 with good results.
Ankry
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Loading...