Skip to content

Offer Int32Map and Int64Map #355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
treeowl opened this issue Oct 24, 2016 · 17 comments
Open

Offer Int32Map and Int64Map #355

treeowl opened this issue Oct 24, 2016 · 17 comments

Comments

@treeowl
Copy link
Contributor

treeowl commented Oct 24, 2016

Someone suggested this weekend that we should really offer both 32-bit and 64-bit maps regardless of architecture, and I thoroughly agree. The same applies to IntSet, of course.

@AndreasPK
Copy link
Contributor

AndreasPK commented Sep 28, 2022

I don't think there is much benefit in 32 bit maps.

  • On 32bit they behave exactly the same as IntMap.
  • On 64bit they might have a tiny perf benefit through smaller instruction encodings. But that is unlikely to be worth the cost of having one more map type unless one uses backpack or similar to implement them. But even then it's possible the added code size makes up for any benefit in this area.

For Int64 it looks different though, since we can't use a IntMap on 32bit platforms.


If that approach is deemed acceptable I might add a Int64Map (by taking the IntMap implementation and changing the types). It looks like we will need one for GHC in the near future and it would be good to be able to use containers for this.

PS: If there is a strong demand for a Int32Map and it's shown to be worth the cost it might still be good. I just personally don't think it will have enough of a benefit to look into that myself.

@noughtmare
Copy link

noughtmare commented Jun 28, 2023

I've converted IntSet and IntMap to Word64Set and Word64Map as part of this GHC MR:

https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10568

It would require some polishing before upstreaming it. I might have time for it soon, but I'd gladly hand it over if there are other volunteers.

@treeowl
Copy link
Contributor Author

treeowl commented Jun 28, 2023

I would definitely want these, but how can we do it without increasing the maintenance burden too much?

@noughtmare
Copy link

noughtmare commented Jun 28, 2023

That depends on what you consider to be too much. The simplest way to upstream these is to just copy the files, but I get the impression you consider adding ~7000 lines of code as adding too much burden, which is a fair point.

But I don't see any easy solutions. The code is very similar in many places, but the types are pretty much all different.

I could imagine making a poor man's backpack by using something like #define SET IntSet or #define SET Word64Set to generate two modules from the same file. But I don't know if I would consider that much easier to maintain.

@treeowl
Copy link
Contributor Author

treeowl commented Jun 28, 2023

Just throwing an idea around: once our GHC lower bound rises high enough, we'll be able to have multiple "libraries" in the package. Will this give us enough power to have the necessary type definitions in a module that has different contents for the different libraries?

@treeowl
Copy link
Contributor Author

treeowl commented Jun 28, 2023

To be clear, my objection isn't the amount of code, per se, but the duplication. Having to make every change in triplicate is pretty darn annoying. The set/map split is bad enough.

@noughtmare
Copy link

noughtmare commented Jun 28, 2023

@Bodigrim suggested we use CPP following other boot libraries like filepath. That seems like a promising approach. Concretely filepath defines System.OsPath like this:

{-# LANGUAGE CPP #-}

#define FILEPATH_NAME OsPath
#define OSSTRING_NAME OsString
#define WORD_NAME OsChar

...

#include "OsPath/Common.hs"

And then in OsPath/Common.hs there is code like this:

...

splitSearchPath :: OSSTRING_NAME -> [FILEPATH_NAME]
splitSearchPath (OSSTRING_NAME x) = fmap OSSTRING_NAME . C.splitSearchPath $ x

...

@treeowl
Copy link
Contributor Author

treeowl commented Jun 28, 2023

How does that affect source links in the Haddocks? I would expect it to destroy them, which is ... not great.

@noughtmare
Copy link

noughtmare commented Jun 28, 2023

Yes, that seems like a disadvantage, e.g. this link doesn't point to anything useful: https://hackage.haskell.org/package/filepath-1.4.100.3/docs/src/System.OsPath.Posix.html#extSeparator

@treeowl
Copy link
Contributor Author

treeowl commented Jun 28, 2023

Ugh. Using the (long-standing) private library support is enough for what I was talking about (no need for the new multiple public libraries), but there's a big problem: the names of the modules, and more importantly the names of the types won't work out right without some CPP. I don't know just how bad that is in practice; probably better than what filepath does.

@meooow25
Copy link
Contributor

I've converted IntSet and IntMap to Word64Set and Word64Map as part of this GHC MR:

https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10568

Since it is unclear if this will make it into containers, are there plans to keep GHC's copy in sync with containers? Otherwise GHC is cut off from optimizations and bug fixes made here.


IMO the easiest way to get Word64Map on 32-bit is to nest IntMaps.

newtype Word64Map a = Word64Map (IntMap (IntMap a))

The top map is indexed by the high 32 bits of the Word64 and any inner map is indexed by the low 32 bits. Map operations should not be too hard to write.

high, low :: Word64 -> Int

lookup :: Word64 -> Word64Map a -> Maybe a
lookup k (Word64Map m) = IM.lookup (high k) m >>= IM.lookup (low k)

union :: Word64Map a -> Word64Map a -> Word64Map a
union (Word64Map m1) (Word64Map m2) = Word64Map (IM.unionWith IM.union m1 m2)

...

Admittedly this would be slightly slower compared to a directly implemented version, but I don't expect it to be too bad.

Not sure if this was considered for GHC, but it is simple and allows benefiting from development in containers.

(cc @noughtmare)

@noughtmare
Copy link

Whatever we decide to do, we should do it in containers so that everyone can benefit. I think we should only modify GHC's Word64Map if we discover critical issues. I had not considered the nested IntMap solution.

So, I think there are four options:

  1. Do nothing
  2. Copy Word64Map from GHC to containers (duplicates code, higher maintenance burden)
  3. Use internal libraries and/or CPP (this requires newer cabal version and/or obfuscates source links in Haddock)
  4. Use nested IntMaps (unknown performance hit)

I don't have time to champion any of these solutions in the near future, but I'm willing to help anyone who does.

@treeowl
Copy link
Contributor Author

treeowl commented Aug 17, 2024

Nested IntMaps are twice as deep. Not a bad hack by any means, but probably substantially slower.

@meooow25
Copy link
Contributor

meooow25 commented Aug 18, 2024

I think we should only modify GHC's Word64Map if we discover critical issues.

I see, that's unfortunate.

Whatever we decide to do, we should do it in containers so that everyone can benefit.

Are there parties other than GHC who would like to see this?

Nested IntMaps are twice as deep. Not a bad hack by any means, but probably substantially slower.

It's not twice as deep, it's deeper by exactly one. In other words, a path from root to leaf passes through only one extra step, the Tip of the root map which links to an inner map. That's why I expect it to be only "slightly slower".

@treeowl
Copy link
Contributor Author

treeowl commented Aug 18, 2024

It's not twice as deep, it's deeper by exactly one. In other words, a path from root to leaf passes through only one extra step, the Tip of the root map which links to an inner map. That's why I expect it to be only "slightly slower".

I think you're right; I didn't think that through enough. But can it be two deeper sometimes, depending on where the branches are?

@meooow25
Copy link
Contributor

meooow25 commented Aug 18, 2024

But can it be two deeper sometimes, depending on where the branches are?

I'm not seeing it, do you have a concrete example perhaps?

That's why I expect it to be only "slightly slower".

I'm beginning to realize that it may even be faster, since we are on a 32-bit system.

  • All comparisons and bitwise ops will be 32-bit, not 64-bit, except for the part where we split with 64-bit word.
  • The data stored at the Bins and Tips will be 32-bit, not 64-bit. Taking into account Optimize IntMap.Bin #995, GHC's Word64Map is currently storing 4 words of data (prefix, mask) in Bins and 2 words in Tips instead of just 1 in both, which is 4 excess words per element. Now there is a cost of the root map Tips. If the data is dense, which is likely the case for GHC since it was using an IntMap until recently, this is only 3 words per 2^32 elements.

It looks like a promising idea to test out on GHC with a 32-bit system, should someone volunteer.

@noughtmare
Copy link

I see, that's unfortunate.

Why? Ah, I see you've recently made some contributions that could potentially also benefit GHC. Maybe it would be worthwhile to port over some of them. Perhaps it would be useful to collect a number of those changes and port them over at the same time.

I should mention that this is my personal opinion and does not reflect the opinion of the GHC team. And I'm no longer a very active contributor to (that part of) GHC. Perhaps @mpickering could chime in.

Are there parties other than GHC who would like to see this?

This issue wasn't opened by GHC, but perhaps GHC is the first that actually ran into issues with IntMap on 32-bit platforms.

It looks like a promising idea to test out on GHC with a 32-bit system, should someone volunteer.

For people who do want to try this, you don't need a 32-bit system. You can instead use ghc-docker-jobs.sh with the i386-linux-deb12-validate job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants