Skip to content

Data.String.CodePoints.uncons probably isn't constant-time #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hdgarrood opened this issue Oct 15, 2019 · 4 comments
Closed

Data.String.CodePoints.uncons probably isn't constant-time #120

hdgarrood opened this issue Oct 15, 2019 · 4 comments

Comments

@hdgarrood
Copy link
Contributor

Here's the documentation for uncons:

-- | Returns a record with the first code point and the remaining code points
-- | of the string. Returns Nothing if the string is empty. Operates in
-- | constant space and time.
-- |
-- | ```purescript
-- | >>> uncons "𝐀𝐀 c 𝐀"
-- | Just { head: CodePoint 0x1D400, tail: "𝐀 c 𝐀" }
-- | >>> uncons ""
-- | Nothing
-- | ```
-- |
uncons :: String -> Maybe { head :: CodePoint, tail :: String }

I would have expected this to be O(n), because you need to copy almost the entirety of the argument string into the tail field of the result. We should probably verify this before changing it.

@hdgarrood
Copy link
Contributor Author

cc @michaelficarra, I'd appreciate any input you might have here.

@michaelficarra
Copy link
Contributor

@hdgarrood I don't work on any of the big JavaScript engines, but I'm pretty sure any reasonably-performant JavaScript engine will implement .slice(...) on strings without copying. Remember, strings are immutable in JavaScript. I don't think it's unreasonable to impose these complexity constraints on other backends either. If their language's native/convenient strings don't have this property (likely because they're mutable), they should back PureScript strings by a library implementation that does.

@hdgarrood
Copy link
Contributor Author

Oh of course, that’s good to hear. Thanks!

@hdgarrood
Copy link
Contributor Author

I just tried to verify this with the following program:

module Main where

import Prelude
import Data.Array as Array
import Data.Foldable (for_, foldMap)
import Data.Int as Int
import Data.Maybe (fromMaybe, maybe, Maybe(..))
import Data.String as String
import Data.Unfoldable (replicateA)
import Effect (Effect)
import Effect.Console (log, logShow)
import Effect.Random (randomInt)
import Performance.Minibench (bench)

main :: Effect Unit
main = do
  log "testing (should print 15):"
  logShow (digitSum "lol 1234 hahaha 5")

  let inputLengths = (map (_ * 100_000) (Array.range 1 10))
  for_ inputLengths \l -> do
    log ""
    log ("string of length " <> show l)
    s <- genRandomString l
    bench \_ -> digitSum s

sampler :: String
sampler = "abcdefg1234567890"

genRandomString :: Int -> Effect String
genRandomString resultLength =
  let
    len = String.length sampler
  in
    map (foldMap (maybe "" String.singleton) :: Array _ -> _)
      $ replicateA resultLength do
          ix <- randomInt 0 (len - 1)
          pure $ Array.index (String.toCodePointArray sampler) ix

-- | Sum all digits appearing in the input string.
digitSum :: String -> Int
digitSum = go 0
  where
  go acc s =
    case String.uncons s of
      Just { head, tail } ->
        let
          value =
            fromMaybe 0
              $ Int.fromString
              $ String.singleton head
        in
          go (value + acc) tail
      Nothing ->
        acc

Here are the results on Node v12.4.0:

testing (should print 15):
15

string of length 100000
mean   = 11.67 ms
stddev = 495.70 μs
min    = 11.37 ms
max    = 25.32 ms

string of length 200000
mean   = 23.62 ms
stddev = 438.12 μs
min    = 23.00 ms
max    = 31.32 ms

string of length 300000
mean   = 35.22 ms
stddev = 594.21 μs
min    = 34.35 ms
max    = 50.18 ms

string of length 400000
mean   = 53.78 ms
stddev = 3.28 ms
min    = 47.85 ms
max    = 61.54 ms

string of length 500000
mean   = 75.39 ms
stddev = 779.25 μs
min    = 73.91 ms
max    = 82.14 ms

string of length 600000
mean   = 86.23 ms
stddev = 5.49 ms
min    = 76.61 ms
max    = 108.24 ms

string of length 700000
mean   = 113.51 ms
stddev = 1.19 ms
min    = 110.47 ms
max    = 123.14 ms

string of length 800000
mean   = 114.60 ms
stddev = 7.41 ms
min    = 100.95 ms
max    = 149.26 ms

string of length 900000
mean   = 137.33 ms
stddev = 8.73 ms
min    = 121.62 ms
max    = 156.14 ms

string of length 1000000
mean   = 155.63 ms
stddev = 10.15 ms
min    = 137.15 ms
max    = 175.06 ms

which seems close enough to linear, which is what we'd expect with a constant-time uncons.

Here's a graph:
Screenshot from 2020-02-23 02-33-05

So I'm happy to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants