[Wikisource-l] Drop OAI-PMH repository of Index: pages

Discussion:

Thomas PT

2016-12-30 16:31:39 UTC

Hello everyone,

The ProofreadPage MediaWiki extension provides an OAI-PMH API allowing to retrieve the content of Index: pages in a more or less structured format.

According to the Wikimedia PageView statistic tool nobody seems to use it (probably because of the low quality of the provided data).

In order to ease the maintenance of the extension it is probably a good idea to drop this feature. The source code will still be in the git history so, if we want to make this feature alive again, we won't have to start from scratch.

Is there any strong concerns about it?

Thomas

Federico Leva (Nemo)

2016-12-30 17:06:10 UTC

Permalink

For those who haven't seen it, it's this thing:
https://it.wikisource.org/wiki/Speciale:ProofreadIndexOai?verb=ListRecords&metadataPrefix=oai_dc
https://it.wikisource.org/wiki/Speciale:ProofreadIndexOai?verb=ListRecords&metadataPrefix=prp_qdc

(Enter a Special:ProofreadIndexOai URL in http://validator.oaipmh.com/
for a clickable interface to get previews.)

Post by Thomas PT
Is there any strong concerns about it?

Removing the OAI-PMH endpoint probably means nobody interested in
OAI-PMH will ever get to know that there was such a possibility or what
it used to do, so it's very unlikely that any improvement would be made.

Do we have a list of current issues which make the code expensive to
maintain in the short term?

Nemo

Federico Leva (Nemo)

2016-12-30 17:11:27 UTC

Permalink

Sorry for the double message.

Post by Thomas PT
According to the Wikimedia PageView statistic tool

Did you literally use https://tools.wmflabs.org/pageviews , or have you
asked for real requests data? The pageviews API doesn't count requests
to the OAI-PMH endpoint at all, because they have "content-type:
text/xml" while text/html is required:
https://meta.wikimedia.org/wiki/Research:Page_view#Definition

Only people with access to
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#wmf.webrequest
can extract data on how much it's used.

Nemo

Thomas PT

2016-12-30 17:15:19 UTC

Permalink

I definitely used the pageviews API. So I understand now why the count was 0. Sorry for the false info and thank you for your correction.

But my proposal still stands as I do not know any actual user of the API.

Thomas

Post by Federico Leva (Nemo)
Sorry for the double message.

Post by Thomas PT
According to the Wikimedia PageView statistic tool

Did you literally use https://tools.wmflabs.org/pageviews , or have you asked for real requests data? The pageviews API doesn't count requests to the OAI-PMH endpoint at all, because they have "content-type: text/xml" while text/html is required: https://meta.wikimedia.org/wiki/Research:Page_view#Definition
Only people with access to https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#wmf.webrequest can extract data on how much it's used.
Nemo
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Andrea Zanni

2016-12-31 11:58:13 UTC

Permalink

Hi Thomas.

I used, one year ago, the API: I downloaded the data from the Index pages,
and I think that it would be good to have it while we still don't have
Wikidata.
I guess it could be very useful to use for importing those data into
Wikidata.

The problem with those API is that it works only it Index pages, which are
only a fraction of the "book" entity on Wikisource. Index pages are not
linked in a structured way with their ns0 pages, and this is a problem for
us.

Ideally, we would know when a Index page has only one ns0 page, and we
would use the same set of data to create an entity (or more) into Wikidata.

I know that Sam is trying to develop a similar tool:
https://tools.wmflabs.org/ws-search/
and I don't know if that uses your API.

Aubrey

Post by Thomas PT
I definitely used the pageviews API. So I understand now why the count was
0. Sorry for the false info and thank you for your correction.
But my proposal still stands as I do not know any actual user of the API.
Thomas

Post by Federico Leva (Nemo)
Sorry for the double message.

Post by Thomas PT
According to the Wikimedia PageView statistic tool

Did you literally use https://tools.wmflabs.org/pageviews , or have you

asked for real requests data? The pageviews API doesn't count requests to
the OAI-PMH endpoint at all, because they have "content-type: text/xml"
while text/html is required: https://meta.wikimedia.org/
wiki/Research:Page_view#Definition

Post by Federico Leva (Nemo)
Only people with access to https://wikitech.wikimedia.

org/wiki/Analytics/Data/Webrequest#wmf.webrequest can extract data on how
much it's used.

Post by Federico Leva (Nemo)
Nemo
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Thomas PT

2016-12-31 12:09:02 UTC

Permalink

Hello Andrea,

I guess it could be very useful to use for importing those data into Wikidata.

Even when removing the OAI-PMH API we could still extract data from the Index: page serialization. It's a bit more difficult but not much more (and definitely far less than the entity matching problem).

The problem with those API is that it works only it Index pages, which are only a fraction of the "book" entity on Wikisource. Index pages are not linked in a structured way with their ns0 pages, and this is a problem for us.

It's possible to retrieve the ns0 pages that uses a given Index: page using the <pages> tag (you just have to retrieve the list of transclusions of the Index: page as if it where a regular template).

Ideally, we would know when a Index page has only one ns0 page, and we would use the same set of data to create an entity (or more) into Wikidata.

Yes. What we could do is see if the "Title" field of the index pages has only one link to a ns0 page and consider this is the "one" ns0 page. An other possible thing is, when the header feature of the <pages> tag is use, retrieve the pages that use the automatic summary feature and, if there is only one, consider this as the "one".

and I don't know if that uses your API.

I believe he doesn't but we should definitely ask him if it's useful for his use case.

Thomas

Hi Thomas.
I used, one year ago, the API: I downloaded the data from the Index pages, and I think that it would be good to have it while we still don't have Wikidata.
I guess it could be very useful to use for importing those data into Wikidata.
The problem with those API is that it works only it Index pages, which are only a fraction of the "book" entity on Wikisource. Index pages are not linked in a structured way with their ns0 pages, and this is a problem for us.
Ideally, we would know when a Index page has only one ns0 page, and we would use the same set of data to create an entity (or more) into Wikidata.
https://tools.wmflabs.org/ws-search/
and I don't know if that uses your API.
Aubrey
I definitely used the pageviews API. So I understand now why the count was 0. Sorry for the false info and thank you for your correction.
But my proposal still stands as I do not know any actual user of the API.
Thomas

Post by Federico Leva (Nemo)
Sorry for the double message.

Post by Thomas PT
According to the Wikimedia PageView statistic tool

Did you literally use https://tools.wmflabs.org/pageviews , or have you asked for real requests data? The pageviews API doesn't count requests to the OAI-PMH endpoint at all, because they have "content-type: text/xml" while text/html is required: https://meta.wikimedia.org/wiki/Research:Page_view#Definition
Only people with access to https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#wmf.webrequest can extract data on how much it's used.
Nemo
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l