Discussion:
Google crawlers bringing down the server
(too old to reply)
Christof Meerwald
2013-11-24 22:51:58 UTC
Permalink
Hi,

as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
(directly under //depot) with something like:

User-agent: *
Disallow: /


This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.


Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org
Marty Stanquist
2013-11-25 00:16:34 UTC
Permalink
"Christof Meerwald" wrote in message news:***@msgid.cmeerw.org...

Hi,

as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
(directly under //depot) with something like:

User-agent: *
Disallow: /


This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.


Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Please review the following change:

Change 37509 created with 1 open file(s).
Submitting change 37509.
Locking 1 files ...
add //depot/robots.txt#1
Change 37509 submitted.

robots.txt
------------
User-agent: *
Disallow: /

Hope this works. Thanks.

Marty
Marty Stanquist
2013-11-25 01:47:41 UTC
Permalink
"Marty Stanquist" wrote in message news:l6u4t5$qqk$***@www.openwatcom.org...

"Christof Meerwald" wrote in message news:***@msgid.cmeerw.org...

Hi,

as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
(directly under //depot) with something like:

User-agent: *
Disallow: /


This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.


Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Please review the following change:

Change 37509 created with 1 open file(s).
Submitting change 37509.
Locking 1 files ...
add //depot/robots.txt#1
Change 37509 submitted.

robots.txt
------------
User-agent: *
Disallow: /

Hope this works. Thanks.

Marty

I'm not sure why the changelist does not show the file //depot/robots.txt,
also why the file does not show up on Perforce Web, but I believe it is in
the correct spot in the depot. Here is the file:

c:\SwDev\OW>p4 print //depot/robots.txt
//depot/robots.txt#3 - edit change 37511 (text)
User-agent: *
Disallow: /

When you access link http://perforce.openwatcom.org:4000/robots.txt, you get
the message:

//depot/robots.txt - protected namespace - access denied.

Is this what you are looking for?

Marty
Christof Meerwald
2013-11-25 07:21:11 UTC
Permalink
Post by Marty Stanquist
When you access link http://perforce.openwatcom.org:4000/robots.txt, you get
//depot/robots.txt - protected namespace - access denied.
Is this what you are looking for?
No, it should instead return the contents of the file.

Hmm, maybe the client workspace also needs to be updated to explicitly
include the robots.txt file: the p4webnotes.html says "P4Web now
translates the URL http://www.hostname.com:8080/robots.txt as
//depot/robots.txt. This change enables bots crawling websites served
by P4Web to always locate robots.txt in the same location, regardless
of the P4Web instance. To function properly there must be a depot
named "depot" in the Perforce Server targeted by P4Web. The client
workspace used by P4Web must also include the view
//depot/robots.txt."


Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org
Paul S. Person
2013-11-27 17:44:32 UTC
Permalink
On Mon, 25 Nov 2013 07:21:11 +0000 (UTC), Christof Meerwald
Post by Christof Meerwald
Post by Marty Stanquist
When you access link http://perforce.openwatcom.org:4000/robots.txt, you get
//depot/robots.txt - protected namespace - access denied.
Is this what you are looking for?
No, it should instead return the contents of the file.
Hmm, maybe the client workspace also needs to be updated to explicitly
include the robots.txt file: the p4webnotes.html says "P4Web now
translates the URL http://www.hostname.com:8080/robots.txt as
//depot/robots.txt. This change enables bots crawling websites served
by P4Web to always locate robots.txt in the same location, regardless
of the P4Web instance. To function properly there must be a depot
named "depot" in the Perforce Server targeted by P4Web. The client
workspace used by P4Web must also include the view
//depot/robots.txt."
Nonetheless

1) Perforce was accessible yesterday at about 11 AM (PST).
2) The Usenet server is accessible this morning.

So, whether it is working as expected or not, it does appear to be
working in the most important sense.
--
"Nature must be explained in
her own terms through
the experience of our senses."
Marty Stanquist
2013-11-27 22:52:41 UTC
Permalink
"Paul S. Person" wrote in message news:***@4ax.com...

On Mon, 25 Nov 2013 07:21:11 +0000 (UTC), Christof Meerwald
Post by Christof Meerwald
Post by Marty Stanquist
When you access link http://perforce.openwatcom.org:4000/robots.txt, you get
//depot/robots.txt - protected namespace - access denied.
Is this what you are looking for?
No, it should instead return the contents of the file.
Hmm, maybe the client workspace also needs to be updated to explicitly
include the robots.txt file: the p4webnotes.html says "P4Web now
translates the URL http://www.hostname.com:8080/robots.txt as
//depot/robots.txt. This change enables bots crawling websites served
by P4Web to always locate robots.txt in the same location, regardless
of the P4Web instance. To function properly there must be a depot
named "depot" in the Perforce Server targeted by P4Web. The client
workspace used by P4Web must also include the view
//depot/robots.txt."
Nonetheless

1) Perforce was accessible yesterday at about 11 AM (PST).
2) The Usenet server is accessible this morning.

So, whether it is working as expected or not, it does appear to be
working in the most important sense.
--
"Nature must be explained in
her own terms through
the experience of our senses."

Due credit goes to Perforce and their UK support group. They are also
researching the web crawler issue and will get back to us with a
recommendation.

Marty
Marty Stanquist
2013-12-13 05:34:52 UTC
Permalink
"Christof Meerwald" wrote in message news:***@msgid.cmeerw.org...

Hi,

as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
(directly under //depot) with something like:

User-agent: *
Disallow: /


This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.


Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Perforce has confirmed this fix, but only for crawlers that conform to the
robots.txt protocol. Are there other crawlers that we should be concerned
about?

Marty
Paul S. Person
2013-12-14 18:57:39 UTC
Permalink
It's been doing this for a week. And I thought we did that.

Maybe it is time to recognize this as what it clearly is -- a DOS
attack.

And, if the server in the USA, report it to the FBI.

Maybe Google will take its misbehavior seriously if the bosses are
facing jail-time.

On Thu, 12 Dec 2013 23:34:52 -0600, "Marty Stanquist"
Post by Christof Meerwald
Hi,
as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
User-agent: *
Disallow: /
This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.
Christof
--
"Nature must be explained in
her own terms through
the experience of our senses."
Marty Stanquist
2013-12-14 21:13:19 UTC
Permalink
"Paul S. Person" wrote in message news:***@4ax.com...

It's been doing this for a week. And I thought we did that.

Maybe it is time to recognize this as what it clearly is -- a DOS
attack.

And, if the server in the USA, report it to the FBI.

Maybe Google will take its misbehavior seriously if the bosses are
facing jail-time.

On Thu, 12 Dec 2013 23:34:52 -0600, "Marty Stanquist"
Post by Christof Meerwald
Hi,
as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
User-agent: *
Disallow: /
This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.
Christof
--
"Nature must be explained in
her own terms through
the experience of our senses."

The server is in California, USA and yes, we did implement robots.txt. So,
it may be a different type of crawler. I'm looking into this.

Marty
Paul S. Person
2013-12-15 18:02:42 UTC
Permalink
On Sat, 14 Dec 2013 15:13:19 -0600, "Marty Stanquist"
Post by Paul S. Person
It's been doing this for a week. And I thought we did that.
Maybe it is time to recognize this as what it clearly is -- a DOS
attack.
And, if the server in the USA, report it to the FBI.
Maybe Google will take its misbehavior seriously if the bosses are
facing jail-time.
<snippo>

Note: by quoting my sig, you caused Agent to not copy your reply.

I recommend not copying sigs.
Post by Paul S. Person
The server is in California, USA and yes, we did implement robots.txt. So,
it may be a different type of crawler. I'm looking into this.
Marty
I apologize if I got a little excited. However, any program that shuts
down a web site for a week is clearly malware, no matter who wrote it
or what their excuse may be.

And you are right -- it might not be Google. It might be the German
Intelligence service, checking to see if any terrorists are using our
code base to pass messages (what, you thought only the NSA did that
stuff?). Or some kids having fun.
--
"Nature must be explained in
her own terms through
the experience of our senses."
Mike
2014-02-07 21:06:11 UTC
Permalink
Paul S. Person wrote:

[...]
Post by Paul S. Person
However, any program that shuts
down a web site for a week is clearly malware, no matter who wrote it
or what their excuse may be.
While that could probably be argued, and this reply is quite late,
could a front-end cache help? Thinking something like Varnish.
--
Mike
Loading...