logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

When-not-to-use-a-regex.md (2568B)


  1. ---
  2. date: 2017-08-13
  3. layout: post
  4. title: When not to use a regex
  5. tags: [regex]
  6. ---
  7. The other day, I saw [Learn regex the easy
  8. way](https://github.com/zeeshanu/learn-regex). This is a great resource, but I
  9. felt the need to pen a post explaining that regexes are usually not the right
  10. approach.
  11. Let's do a little exercise. I googled "URL regex" and here's the first Stack
  12. Overflow result:
  13. ```
  14. https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)
  15. ```
  16. <p style="text-align: right">
  17. <small><a href="https://stackoverflow.com/a/3809435/1191610">source</a></small>
  18. </p>
  19. This is a bad regex. Here are some valid URLs that this regex fails to match:
  20. - http://x.org
  21. - http://nic.science
  22. - http://名がドメイン.com (warning: this is a parked domain)
  23. - http://example.org/url,with,commas
  24. - https://en.wikipedia.org/wiki/Harry_Potter_(film_series)
  25. - http://127.0.0.1
  26. - http://[::1] (ipv6 loopback)
  27. Here are some invalid URLs the regex is fine with:
  28. - http://exam..ple.org
  29. - http://--example.org
  30. This answer has been revised 9 times on Stack Overflow, and this is the best
  31. they could come up with. Go back and read the regex. Can you tell where each of
  32. these bugs are? How long did it take you? If you received a bug report in your
  33. application because one of these URLs was handled incorrectly, do you understand
  34. this regex well enough to fix it? If your application has a URL regex, go find
  35. it and see how it fares with these tests.
  36. Complicated regexes are opaque, unmaintainable, and often wrong. The correct
  37. approach to validating a URL is as follows:
  38. ```python
  39. from urllib.parse import urlparse
  40. def is_url_valid(url):
  41. try:
  42. urlparse(url)
  43. return True
  44. except:
  45. return False
  46. ```
  47. A regex is useful for validating *simple* patterns and for *finding* patterns in
  48. text. For anything beyond that it's almost certainly a terrible choice. Say you
  49. want to...
  50. **validate an email address**: try to send an email to it!
  51. **validate password strength requirements**: estimate the complexity with
  52. [zxcvbn](https://github.com/dropbox/zxcvbn)!
  53. **validate a date**: use your standard library!
  54. [datetime.datetime.strptime](https://docs.python.org/3.6/library/datetime.html#datetime.datetime.strptime)
  55. **validate a credit card number**: run the [Luhn
  56. algorithm](https://en.wikipedia.org/wiki/Luhn_algorithm) on it!
  57. **validate a social security number**: alright, use a regex. But don't expect
  58. the number to be assigned to someone until you ask the Social Security
  59. Administration about it!
  60. Get the picture?