Google crawlers bringing down the server

Discussion:

(too old to reply)

Christof Meerwald

2013-11-24 22:51:58 UTC

Hi,

as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
(directly under //depot) with something like:

User-agent: *
Disallow: /

This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.

Christof

--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Marty Stanquist

2013-11-25 00:16:34 UTC

Permalink

"Christof Meerwald" wrote in message news:***@msgid.cmeerw.org...

Hi,

as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
(directly under //depot) with something like:

User-agent: *
Disallow: /

This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.

Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Please review the following change:

Change 37509 created with 1 open file(s).
Submitting change 37509.
Locking 1 files ...
add //depot/robots.txt#1
Change 37509 submitted.

robots.txt
------------
User-agent: *
Disallow: /

Hope this works. Thanks.

Marty

Marty Stanquist

2013-11-25 01:47:41 UTC

Permalink

"Marty Stanquist" wrote in message news:l6u4t5$qqk$***@www.openwatcom.org...

"Christof Meerwald" wrote in message news:***@msgid.cmeerw.org...

Hi,

as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
(directly under //depot) with something like:

User-agent: *
Disallow: /

This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.

Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Please review the following change:

Change 37509 created with 1 open file(s).
Submitting change 37509.
Locking 1 files ...
add //depot/robots.txt#1
Change 37509 submitted.

robots.txt
------------
User-agent: *
Disallow: /

Hope this works. Thanks.

Marty

I'm not sure why the changelist does not show the file //depot/robots.txt,
also why the file does not show up on Perforce Web, but I believe it is in
the correct spot in the depot. Here is the file:

c:\SwDev\OW>p4 print //depot/robots.txt
//depot/robots.txt#3 - edit change 37511 (text)
User-agent: *
Disallow: /

When you access link http://perforce.openwatcom.org:4000/robots.txt, you get
the message:

//depot/robots.txt - protected namespace - access denied.

Is this what you are looking for?

Marty

Christof Meerwald

2013-11-25 07:21:11 UTC

Permalink

Post by Marty Stanquist
When you access link http://perforce.openwatcom.org:4000/robots.txt, you get
//depot/robots.txt - protected namespace - access denied.
Is this what you are looking for?

No, it should instead return the contents of the file.

Hmm, maybe the client workspace also needs to be updated to explicitly
include the robots.txt file: the p4webnotes.html says "P4Web now
translates the URL http://www.hostname.com:8080/robots.txt as
//depot/robots.txt. This change enables bots crawling websites served
by P4Web to always locate robots.txt in the same location, regardless
of the P4Web instance. To function properly there must be a depot
named "depot" in the Perforce Server targeted by P4Web. The client
workspace used by P4Web must also include the view
//depot/robots.txt."

Christof

--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Paul S. Person

2013-11-27 17:44:32 UTC

Permalink

On Mon, 25 Nov 2013 07:21:11 +0000 (UTC), Christof Meerwald

Post by Christof Meerwald

Post by Marty Stanquist
When you access link http://perforce.openwatcom.org:4000/robots.txt, you get
//depot/robots.txt - protected namespace - access denied.
Is this what you are looking for?

No, it should instead return the contents of the file.
Hmm, maybe the client workspace also needs to be updated to explicitly
include the robots.txt file: the p4webnotes.html says "P4Web now
translates the URL http://www.hostname.com:8080/robots.txt as
//depot/robots.txt. This change enables bots crawling websites served
by P4Web to always locate robots.txt in the same location, regardless
of the P4Web instance. To function properly there must be a depot
named "depot" in the Perforce Server targeted by P4Web. The client
workspace used by P4Web must also include the view
//depot/robots.txt."

--
"Nature must be explained in
her own terms through
the experience of our senses."

Marty Stanquist

2013-11-27 22:52:41 UTC

Permalink

"Paul S. Person" wrote in message news:***@4ax.com...

On Mon, 25 Nov 2013 07:21:11 +0000 (UTC), Christof Meerwald

Post by Christof Meerwald

Post by Marty Stanquist
When you access link http://perforce.openwatcom.org:4000/robots.txt, you get
//depot/robots.txt - protected namespace - access denied.
Is this what you are looking for?

No, it should instead return the contents of the file.
Hmm, maybe the client workspace also needs to be updated to explicitly
include the robots.txt file: the p4webnotes.html says "P4Web now
translates the URL http://www.hostname.com:8080/robots.txt as
//depot/robots.txt. This change enables bots crawling websites served
by P4Web to always locate robots.txt in the same location, regardless
of the P4Web instance. To function properly there must be a depot
named "depot" in the Perforce Server targeted by P4Web. The client
workspace used by P4Web must also include the view
//depot/robots.txt."

Nonetheless

1) Perforce was accessible yesterday at about 11 AM (PST).
2) The Usenet server is accessible this morning.

So, whether it is working as expected or not, it does appear to be
working in the most important sense.
--
"Nature must be explained in
her own terms through
the experience of our senses."

Due credit goes to Perforce and their UK support group. They are also
researching the web crawler issue and will get back to us with a
recommendation.

Marty

Marty Stanquist

2013-12-13 05:34:52 UTC

Permalink

--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Perforce has confirmed this fix, but only for crawlers that conform to the
robots.txt protocol. Are there other crawlers that we should be concerned
about?

Marty

Paul S. Person

2013-12-14 18:57:39 UTC

Permalink

It's been doing this for a week. And I thought we did that.

Maybe it is time to recognize this as what it clearly is -- a DOS
attack.

And, if the server in the USA, report it to the FBI.

Maybe Google will take its misbehavior seriously if the bosses are
facing jail-time.

On Thu, 12 Dec 2013 23:34:52 -0600, "Marty Stanquist"

Post by Christof Meerwald
Hi,
as I have now noticed for the second time that Google is trying to
crawl through the Perforce web interface and in the process bringing
down the server to a standstill due to overloading, could someone
(with the right permissions) please check in a file robots.txt
User-agent: *
Disallow: /
This should then hopefully be returned when accessing
http://perforce.openwatcom.org:4000/robots.txt and stop the useless
Google crawler.
Christof

--
"Nature must be explained in
her own terms through
the experience of our senses."

Marty Stanquist

2013-12-14 21:13:19 UTC

Permalink

"Paul S. Person" wrote in message news:***@4ax.com...

It's been doing this for a week. And I thought we did that.

Maybe it is time to recognize this as what it clearly is -- a DOS
attack.

And, if the server in the USA, report it to the FBI.

Maybe Google will take its misbehavior seriously if the bosses are
facing jail-time.

On Thu, 12 Dec 2013 23:34:52 -0600, "Marty Stanquist"

--
"Nature must be explained in
her own terms through
the experience of our senses."

The server is in California, USA and yes, we did implement robots.txt. So,
it may be a different type of crawler. I'm looking into this.

Marty

Paul S. Person

2013-12-15 18:02:42 UTC

Permalink

On Sat, 14 Dec 2013 15:13:19 -0600, "Marty Stanquist"

Post by Paul S. Person
It's been doing this for a week. And I thought we did that.
Maybe it is time to recognize this as what it clearly is -- a DOS
attack.
And, if the server in the USA, report it to the FBI.
Maybe Google will take its misbehavior seriously if the bosses are
facing jail-time.

<snippo>

Note: by quoting my sig, you caused Agent to not copy your reply.

I recommend not copying sigs.

Post by Paul S. Person
The server is in California, USA and yes, we did implement robots.txt. So,
it may be a different type of crawler. I'm looking into this.
Marty

I apologize if I got a little excited. However, any program that shuts
down a web site for a week is clearly malware, no matter who wrote it
or what their excuse may be.

And you are right -- it might not be Google. It might be the German
Intelligence service, checking to see if any terrorists are using our
code base to pass messages (what, you thought only the NSA did that
stuff?). Or some kids having fun.

--
"Nature must be explained in
her own terms through
the experience of our senses."

Mike

2014-02-07 21:06:11 UTC

Permalink

Paul S. Person wrote:

[...]

Post by Paul S. Person
However, any program that shuts
down a web site for a week is clearly malware, no matter who wrote it
or what their excuse may be.

While that could probably be argued, and this reply is quite late,
could a front-end cache help? Thinking something like Varnish.

--
Mike