logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Better-than-DuckDuckGo.md (5726B)


  1. ---
  2. title: We can do better than DuckDuckGo
  3. date: 2020-11-17
  4. outputs: [html, gemtext]
  5. ---
  6. [DuckDuckGo](https://duckduckgo.com) is one of the long-time darlings of the
  7. technophile's pro-privacy recommendations, and in fact the search engine that I
  8. use myself on the daily. They certainly present a more compelling option than
  9. many of the incumbents, like Google or Bing. Even so, DuckDuckGo is not good
  10. enough, and we ought to do better.
  11. I have three grievances with DuckDuckGo:
  12. 1. **It's not open source.** Almost all of DDG's software is proprietary, and
  13. they've demonstrated [gross incompetence][github] in privacy in what little
  14. software they have made open source. Who knows what else is going on in the
  15. proprietary code?
  16. 2. **DuckDuckGo is not a search engine**. It's more aptly described as a search
  17. engine frontend. They *do* handle features like bangs and instant answers
  18. internally, but their actual search results come from third-parties like
  19. Bing. They don't operate a crawler for their search results, and are not
  20. independent.
  21. 3. **The search results suck!** The authoritative sources for anything I want to
  22. find are almost always buried beneath 2-5 results from content scrapers and
  23. blogspam. This is also true of other search engines like Google. Search
  24. engines are highly vulnerable to abuse and they aren't doing enough to
  25. address it.
  26. [github]: https://github.com/duckduckgo/Android/issues/527
  27. There are some FOSS attempts to do better here, but they all fall flat.
  28. [searX](https://github.com/bauruine/searx/) is also a false search engine
  29. — that is, they serve someone else's results. [YaCy](https://yacy.net/)
  30. has their own crawler, but the distributed design makes results untolerably
  31. slow, poor quality, and vulnerable to abuse, and it's missing strong central
  32. project leadership.
  33. We need a real, working FOSS search engine, complete with its own crawler.
  34. Here's how I would design it.
  35. First, YaCy-style decentralization is *way* too hard to get right, especially
  36. when a search engine project already has a lot of Very Hard problems to solve.
  37. Federation is also very hard in this situation — queries will have to
  38. consult *most* instances in order to get good quality results, or a novel
  39. sharding algorithm will have to be designed, and either approach will have to be
  40. tolerant of nodes appearing and disappearing at any time. Not to mention it'd be
  41. slow! Several unsolved problems with federation and decentralziation would have
  42. to be addressed on top of building a search engine in the first place.
  43. So, a SourceHut-style approach is better. 100% of the software would be free
  44. software, and third parties would be encouraged to set up their own
  45. installations. It would use standard protocols and formats where applicable, and
  46. accept patches from the community. However, the database would still be
  47. centralized, and even if programmable access were provided, it would not be with
  48. an emphasis on decentralization or shared governance. It might be possible to
  49. design tools which help third-parties bootstrap their indexes, and create a
  50. community of informal index sharing, but that's not the focus here.
  51. It would also need its own crawler, and probably its own indexer. I'm not
  52. convinced that any of the existing FOSS solutions in this space are quite right
  53. for this problem. Crucially, I would *not* have it crawling the entire web from
  54. the outset. Instead, it should crawl a whitelist of domains, or "tier 1"
  55. domains. These would be the limited mainly to authoritative or high-quality
  56. sources for their respective specializations, and would be weighed upwards in
  57. search results. Pages that these sites link to would be crawled as well, and
  58. given tier 2 status, recursively up to an arbitrary N tiers. Users who want to
  59. find, say, a blog post about a subject rather than the documentation on that
  60. subject, would have to be more specific: "$subject blog posts".
  61. An advantage of this design is that it would be easy for anyone to take the
  62. software stack and plop it on their own servers, with their own whitelist of
  63. tier 1 domains, to easily create a domain-specific search engine. Independent
  64. groups could create search engines which specialize in academia, open standards,
  65. specific fandoms, and so on. They could tweak their precise approach to
  66. indexing, tokenization, and so on to better suit their domain.
  67. We should also prepare the software to boldly lead the way on new internet
  68. standards. Crawling and indexing non-HTTP data sources (Gemini? Man pages?
  69. Linux distribution repositories?), supporting non-traditional network stacks
  70. (Tor? Yggdrasil? cjdns?) and third-party name systems (OpenNIC?), and anything
  71. else we could leverage our influence to give a leg up on.
  72. There's a *ton* of potential in this domain which is just sitting on the floor
  73. right now. The main problem is: who's going to pay for it? Advertisements or
  74. paid results are *not* going to fly — conflict of interest. Private, paid
  75. access to search APIs or index internals is one opportunity, but it's kind of
  76. shit and I think that preferring open data access and open APIs would be
  77. exceptionally valuable for the community.
  78. If SourceHut eventually grows in revenue — at least 5-10× its
  79. [present revenue][financial report] — I intend to sponsor this as a public
  80. benefit project, with no plans for generating revenue. I am not aware of any
  81. monetization approach for a search engine which squares with my ethics and
  82. doesn't fundamentally undermine the mission. So, if no one else has figured it
  83. out by the time we have the resources to take it on, we'll do it.
  84. [financial report]: https://sourcehut.org/blog/2020-11-11-sourcehut-q3-2020-financial-report/