logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Better-than-DuckDuckGo.md (5701B)


  1. ---
  2. title: We can do better than DuckDuckGo
  3. date: 2020-11-17
  4. ---
  5. [DuckDuckGo](https://duckduckgo.com) is one of the long-time darlings of the
  6. technophile's pro-privacy recommendations, and in fact the search engine that I
  7. use myself on the daily. They certainly present a more compelling option than
  8. many of the incumbents, like Google or Bing. Even so, DuckDuckGo is not good
  9. enough, and we ought to do better.
  10. I have three grievances with DuckDuckGo:
  11. 1. **It's not open source.** Almost all of DDG's software is proprietary, and
  12. they've demonstrated [gross incompetence][github] in privacy in what little
  13. software they have made open source. Who knows what else is going on in the
  14. proprietary code?
  15. 2. **DuckDuckGo is not a search engine**. It's more aptly described as a search
  16. engine frontend. They *do* handle features like bangs and instant answers
  17. internally, but their actual search results come from third-parties like
  18. Bing. They don't operate a crawler for their search results, and are not
  19. independent.
  20. 3. **The search results suck!** The authoritative sources for anything I want to
  21. find are almost always buried beneath 2-5 results from content scrapers and
  22. blogspam. This is also true of other search engines like Google. Search
  23. engines are highly vulnerable to abuse and they aren't doing enough to
  24. address it.
  25. [github]: https://github.com/duckduckgo/Android/issues/527
  26. There are some FOSS attempts to do better here, but they all fall flat.
  27. [searX](https://github.com/bauruine/searx/) is also a false search engine
  28. — that is, they serve someone else's results. [YaCy](https://yacy.net/)
  29. has their own crawler, but the distributed design makes results untolerably
  30. slow, poor quality, and vulnerable to abuse, and it's missing strong central
  31. project leadership.
  32. We need a real, working FOSS search engine, complete with its own crawler.
  33. Here's how I would design it.
  34. First, YaCy-style decentralization is *way* too hard to get right, especially
  35. when a search engine project already has a lot of Very Hard problems to solve.
  36. Federation is also very hard in this situation — queries will have to
  37. consult *most* instances in order to get good quality results, or a novel
  38. sharding algorithm will have to be designed, and either approach will have to be
  39. tolerant of nodes appearing and disappearing at any time. Not to mention it'd be
  40. slow! Several unsolved problems with federation and decentralziation would have
  41. to be addressed on top of building a search engine in the first place.
  42. So, a SourceHut-style approach is better. 100% of the software would be free
  43. software, and third parties would be encouraged to set up their own
  44. installations. It would use standard protocols and formats where applicable, and
  45. accept patches from the community. However, the database would still be
  46. centralized, and even if programmable access were provided, it would not be with
  47. an emphasis on decentralization or shared governance. It might be possible to
  48. design tools which help third-parties bootstrap their indexes, and create a
  49. community of informal index sharing, but that's not the focus here.
  50. It would also need its own crawler, and probably its own indexer. I'm not
  51. convinced that any of the existing FOSS solutions in this space are quite right
  52. for this problem. Crucially, I would *not* have it crawling the entire web from
  53. the outset. Instead, it should crawl a whitelist of domains, or "tier 1"
  54. domains. These would be the limited mainly to authoritative or high-quality
  55. sources for their respective specializations, and would be weighed upwards in
  56. search results. Pages that these sites link to would be crawled as well, and
  57. given tier 2 status, recursively up to an arbitrary N tiers. Users who want to
  58. find, say, a blog post about a subject rather than the documentation on that
  59. subject, would have to be more specific: "$subject blog posts".
  60. An advantage of this design is that it would be easy for anyone to take the
  61. software stack and plop it on their own servers, with their own whitelist of
  62. tier 1 domains, to easily create a domain-specific search engine. Independent
  63. groups could create search engines which specialize in academia, open standards,
  64. specific fandoms, and so on. They could tweak their precise approach to
  65. indexing, tokenization, and so on to better suit their domain.
  66. We should also prepare the software to boldly lead the way on new internet
  67. standards. Crawling and indexing non-HTTP data sources (Gemini? Man pages?
  68. Linux distribution repositories?), supporting non-traditional network stacks
  69. (Tor? Yggdrasil? cjdns?) and third-party name systems (OpenNIC?), and anything
  70. else we could leverage our influence to give a leg up on.
  71. There's a *ton* of potential in this domain which is just sitting on the floor
  72. right now. The main problem is: who's going to pay for it? Advertisements or
  73. paid results are *not* going to fly — conflict of interest. Private, paid
  74. access to search APIs or index internals is one opportunity, but it's kind of
  75. shit and I think that preferring open data access and open APIs would be
  76. exceptionally valuable for the community.
  77. If SourceHut eventually grows in revenue — at least 5-10× its
  78. [present revenue][financial report] — I intend to sponsor this as a public
  79. benefit project, with no plans for generating revenue. I am not aware of any
  80. monetization approach for a search engine which squares with my ethics and
  81. doesn't fundamentally undermine the mission. So, if no one else has figured it
  82. out by the time we have the resources to take it on, we'll do it.
  83. [financial report]: https://sourcehut.org/blog/2020-11-11-sourcehut-q3-2020-financial-report/