commit: 60cb31cb1d804aa02efb9931e55c92c4d6840887
parent 1308b294e225cf149ccbbce5b9f132f11a37bbb2
Author: Drew DeVault <sir@cmpwn.com>
Date: Tue, 17 Nov 2020 13:31:24 -0500
Minor tweaks
Diffstat:
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/content/blog/Better-than-DuckDuckGo.gmi b/content/blog/Better-than-DuckDuckGo.gmi
@@ -18,7 +18,7 @@ So, a SourceHut-style approach is better. 100% of the software would be free sof
It would also need its own crawler, and probably its own indexer. I’m not convinced that any of the existing FOSS solutions in this space are quite right for this problem. Crucially, I would not have it crawling the entire web from the outset. Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results. Pages that these sites link to would be crawled as well, and given tier 2 status, recursively up to an arbitrary N tiers. Users who want to find, say, a blog post about a subject rather than the documentation on that subject, would have to be more specific: "$subject blog posts".
-An advantage of this design is that it would be easy for anyone to take the software stack and plop it on their own servers, with their own whitelist of tier 1 domains, to easily create a domain-specific search engine. Independent groups could create search engines which specialize in academia, open standards, specific fandoms, and so on.
+An advantage of this design is that it would be easy for anyone to take the software stack and plop it on their own servers, with their own whitelist of tier 1 domains, to easily create a domain-specific search engine. Independent groups could create search engines which specialize in academia, open standards, specific fandoms, and so on. They could tweak their precise approach to indexing, tokenization, and so on to better suit their domain.
We should also prepare the software to boldly lead the way on new internet standards. Crawling and indexing non-HTTP data sources (Gemini? Man pages? Linux distribution repositories?), supporting non-traditional network stacks (Tor? Yggdrasil? cjdns?) and third-party name systems (OpenNIC?), and anything else we could leverage our influence to give a leg up on.
diff --git a/content/blog/Better-than-DuckDuckGo.md b/content/blog/Better-than-DuckDuckGo.md
@@ -64,7 +64,7 @@ for this problem. Crucially, I would *not* have it crawling the entire web from
the outset. Instead, it should crawl a whitelist of domains, or "tier 1"
domains. These would be the limited mainly to authoritative or high-quality
sources for their respective specializations, and would be weighed upwards in
-search results. Pages that these sites link *to* would be crawled as well, and
+search results. Pages that these sites link to would be crawled as well, and
given tier 2 status, recursively up to an arbitrary N tiers. Users who want to
find, say, a blog post about a subject rather than the documentation on that
subject, would have to be more specific: "$subject blog posts".
@@ -73,7 +73,8 @@ An advantage of this design is that it would be easy for anyone to take the
software stack and plop it on their own servers, with their own whitelist of
tier 1 domains, to easily create a domain-specific search engine. Independent
groups could create search engines which specialize in academia, open standards,
-specific fandoms, and so on.
+specific fandoms, and so on. They could tweak their precise approach to
+indexing, tokenization, and so on to better suit their domain.
We should also prepare the software to boldly lead the way on new internet
standards. Crawling and indexing non-HTTP data sources (Gemini? Man pages?