// docs / search indexing

If a crawler from MyJay just visited your site.

It identifies itself as MyJaySearch/1.0 in its User-Agent string, with a link back to this exact page, and sends a custom X-Crawler-Info header pointing here too, so you don't have to guess what hit your server.

Stop it from visiting your site

Three ways, in order of how fast they take effect:

Add a Disallow rule to your own robots.txt (for everyone, or specifically User-agent: MyJaySearch). Checked, and respected, before every single page fetch, and cached for at most 24 hours, so a change takes effect within a day at most.
Use the removal request link on any of your pages in search results, or submit one directly with just your URL and an optional reason, no account needed. Approved requests block the domain and delete it from the index, typically reviewed within 24 hours.
If your site is on MyJay itself: turn off "Allow this site to be indexed by MyJay Search" in your dashboard's Settings tab. Takes effect on the next crawl.

What gets indexed, and how

Three platforms, three different discovery methods:

MyJay: every published site is indexed by default (see the opt-out above), discovered directly from MyJay's own records, no crawling needed for discovery.
Neocities: discovered from Neocities' own public /browse listing (sorted by newest and most-recently updated), the same page any search engine would use, not from their API (their API's terms explicitly disallow using it to enumerate all sites, so this crawler doesn't).
Nekoweb: Nekoweb has no bulk site listing at all. Sites arrive either through a manual submission (anyone can submit a URL) or by being linked to from a page on another site that's already indexed.

For a discovered site, the crawl follows links up to two clicks deep from its homepage, extracting the page title, meta description, and visible body text (scripts, navs, and footers are stripped first). A handful of simple heuristics (presence of <canvas>, image-heavy pages, frequent <article> tags, a few keyword checks) infer rough tags like blog, art, or portfolio, shown as filters in search.

Nothing more than that is stored: no personal data is pulled out of a page, and raw page content isn't kept beyond what's needed to show a title, excerpt, and tags in results.

robots.txt and rate limits

robots.txt is fetched and checked before a single page on a domain is touched, including its Crawl-delay directive if it sets one. The crawler never goes faster than one request per second to any single domain, and slows down further if a site's robots.txt asks for it. A page-level <meta name="robots" content="noindex"> tag or X-Robots-Tag header is honored too, even when robots.txt itself allows the path.

If robots.txt can't be fetched at all (a timeout, a server error, anything other than a clean 404), the crawler skips that domain for the run entirely rather than guessing it's safe to proceed. A 404 is treated as "no restrictions," same as any other crawler would.

Re-crawl schedule

An incremental pass runs daily (recently changed MyJay sites, recently updated Neocities sites); a full pass runs weekly across everything currently indexed. Re-crawling refreshes a page's title, excerpt, and tags, and can pick up new pages linked from existing ones.