[LLVMdev] llvm.org robots.txt prevents crawling by Google code search?

Mon Oct 18 14:10:57 PDT 2010

On Thu, Oct 14, 2010 at 10:28 AM, Talin <viridia at gmail.com> wrote:

> On Wed, Oct 13, 2010 at 11:10 PM, Anton Korobeynikov <
> anton at korobeynikov.info> wrote:
>
>> > indexing the llvm.org svn archive. This means that when you search for
>> an
>> > LLVM-related symbol in code search, you get one of the many (possibly
>> > out-of-date) mirrors, rather than the up-to-date llvm.org version. This
>> is
>> > sad.
>> This is intentional. The workload of the server was pretty huge w/o this.
>>
>
> Could we at least add a rule allowing the codesearch crawler, rather than
> opening it up to all crawlers? The user agent string is
> SVN/1.5.4/GoogleCodeSearch.
>

So what I am proposing is replacing the contents of the robots.txt with the
following:
----------------------------------------------------------
User-agent: GoogleCodeSearch Allow: /svn Disallow: /

User-agent: *
Disallow: /bugs
Disallow: /doxygen
Disallow: /cvsweb
Disallow: /stats
Disallow: /testresults/X86
Disallow: /nightlytest
Disallow: /viewvc
Disallow: /nightlytest2
Disallow: /devmtg/2008-08/*.m4v$
Disallow: /devmtg/2008-08/*.3gp$
Disallow: /svn----------------------------------------------------------

(See also http://www.robotstxt.org/norobots-rfc.txt)

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20101018/123a907a/attachment.html>