Skip to content

Generate better workspace ids #2947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

Generate better workspace ids #2947

wants to merge 0 commits into from

Conversation

svenefftinge
Copy link
Member

@svenefftinge svenefftinge commented Jan 16, 2021

Generates workspace names such as "pink-panda-15x7r9b3".
I.e. [color]-[animal]-[0-9a-z]{8}

@svenefftinge
Copy link
Member Author

svenefftinge commented Jan 17, 2021

/werft run

👍 started the job as gitpod-build-se-workspace-id.2

@svenefftinge
Copy link
Member Author

svenefftinge commented Jan 17, 2021

/werft run

👍 started the job as gitpod-build-se-workspace-id.4

@svenefftinge svenefftinge force-pushed the se/workspace-id branch 4 times, most recently from 25a7485 to a17624c Compare January 17, 2021 11:53
@svenefftinge svenefftinge changed the title [draft] generate better workspace ids Generate better workspace ids Jan 17, 2021
@svenefftinge svenefftinge marked this pull request as ready for review January 17, 2021 12:05
@csweichel
Copy link
Contributor

csweichel commented Jan 18, 2021

First of: I love the idea of using friendlier workspace IDs. That said, a few remarks:

  • The assumption that workspace IDs are UUIDv4 is buried deep within the system and encoded in several regexp.
    Of the top of my head, at least dev-staging ingress, proxy and ws-proxy make that assumption.
  • What names does the library produce? Are they all clean or do we have to expect some workspaces with potentially harmful names, e.g. racist terms (even something that has a particular cultural meaning which the library produced arbitrarily)?
  • This library would be one of the most single important dependencies we have. How much do we trust it? For comparison: uuid has ~45e6 weekly downloads, unique-name-generator has ~20e3. Three orders of magnitude more.
  • Collision frequency: the adjectives (1400), colours (50) and animals (350) produce a total of 24500000 unique combinations which is considerably less than UUIDv4. E.g. If we start 1500000 workspaces over the course of one year, we'd expect a (1.5e6 - 1) / 24.5e6 = 0.061, a 6.1% chance of a collision. Such a collision would break things in pretty unforeseeable ways. For comparison: the collision probability of UUIDv4 with the same amount of workspaces is (1.5e6 - 1) / pow(2, 60) = 1.3e-12, a chance of 0.000000000013%.

@svenefftinge
Copy link
Member Author

The assumption that workspace IDs are UUIDv4 is buried deep within the system and encoded in several regexp. Of the top of my head, at least dev-staging ingress, proxy and ws-proxy make that assumption.

This PR already updates most of them, i.e. the dev staging previews work. But please help to spot places I missed.

What names does the library produce? Are they all clean or do we have to expect some workspaces with potentially harmful names, e.g. racist terms (even something that has a particular cultural meaning which the library produced arbitrarily)?
This library would be one of the most single important dependencies we have. How much do we trust it? For comparison: uuid has ~45e6 weekly downloads, unique-name-generator has ~20e3. Three orders of magnitude more.

So you are proposing to review and copy the dictionaries into our repo to have more control?

Collision frequency: the adjectives (1400), colours (50) and animals (350) produce a total of 24500000 unique combinations which is considerably less than UUIDv4. E.g. If we start 1500000 workspaces over the course of one year, we'd expect a (1.5e6 - 1) / 24.5e6 = 0.061, a 6.1% chance of a collision. Such a collision would break things in pretty unforeseeable ways. For comparison: the collision probability of UUIDv4 with the same amount of workspaces is (1.5e6 - 1) / pow(2, 60) = 1.3e-12, a chance of 0.000000000013%.

That's why this PR adds 8 characters from the uuid4. Maybe we should include generate our own sequence here, that even includes all lower case alpabetical letters.

@csweichel
Copy link
Contributor

This PR already updates most of them, i.e. the dev staging previews work. But please help to spot places I missed.

Places that come to mind are:

The proxy places seem generous already:

So you are proposing to review and copy the dictionaries into our repo to have more control?

That would certainly solve the problem. It would also allow us to generate those kinds of IDs not just from TypeScript. E.g. loadgen or the integration tests also need to generate workspace IDs. We could provide a port of this library in common-go and use the new Go 1.16 embed functionality or fall back to something like go.rice.

That's why this PR adds 8 characters from the uuid4. Maybe we should include generate our own sequence here, that even includes all lower case alpabetical letters.

Sorry, I did miss that. That solves the issue nicely and reduces the collision likelihood considerably: (1.5e6 - 1) / (50*1400*350*(36**8)) => 2.1702255845396105e-14

@svenefftinge
Copy link
Member Author

svenefftinge commented Jan 22, 2021

/werft run

👍 started the job as gitpod-build-se-workspace-id.9

Copy link
Contributor

@gtsiolis gtsiolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this happening! 🐼

while (uuid.charAt(0).match("[0-9]") != null) // No numbers as first char, as we use this id as DNS name
return uuid
const randomName: string = uniqueNamesGenerator({
dictionaries: [adjectives, colors, animals],
Copy link
Contributor

@gtsiolis gtsiolis Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I think using only colors and animals in conjunction with random hash in the end could suffice. This can make items more memorable for the user. 💭

CURRENT PROPOSED
ambitious-blue-hamster-155799b3 blue-hamster-155799b3

question: Even in the rare case of having the same pair being generated the hash can still be different, right?

Plus, given the ephemeral nature of environments that should not be a problem. In the end, we just need to introduce enough entropy to avoid as much as possible having the same color-animal pair being used.


idea: We could use later on these generated names for creating shorter URLs if we want, which could be useful or introduce friendlier URLs to share environments, snapshots, and more. See also #2905.

Environment URL (Example) (BEFORE) Environment URL (Example) (AFTER)
b9b5e28a-9198-4f4d-8e85-07c864d88026.ws-eu03.gitpod.io blue-hamster-155799b3.gitpod.io

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I think using only colors and animals in conjunction with random hash in the end could suffice. This can make items more memorable for the user. 💭

Yes, let's do that.

Plus, given the ephemeral nature of environments that should not be a problem. In the end, we just need to introduce enough entropy to avoid as much as possible having the same color-animal pair being used.

Environments are not technically ephemeral and we don't want to test whether a generated ID is unique, because this would be expensive. But I think we can add sufficient characters to the suffix. So the first two names are for humans and the rest is to make them technically unique.

@svenefftinge
Copy link
Member Author

I'm also now implementing a name generator in typescript and go, which uses the same ver simple pattern:
[color]-[animal]-[a-z0-9]{8}

@svenefftinge svenefftinge force-pushed the se/workspace-id branch 4 times, most recently from 9499b68 to 77ce9c7 Compare January 22, 2021 18:23
@svenefftinge
Copy link
Member Author

svenefftinge commented Jan 23, 2021

/werft run

👍 started the job as gitpod-build-se-workspace-id.14

@svenefftinge svenefftinge force-pushed the se/workspace-id branch 6 times, most recently from 71ea69b to 3167c26 Compare January 23, 2021 11:10
@svenefftinge
Copy link
Member Author

svenefftinge commented Jan 23, 2021

/werft run

👍 started the job as gitpod-build-se-workspace-id.21

@svenefftinge
Copy link
Member Author

svenefftinge commented Jan 23, 2021

/werft run

👍 started the job as gitpod-build-se-workspace-id.22

@svenefftinge
Copy link
Member Author

Ready for review

Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few more questions/thoughts:

  • Say we brought this into production, what's the effect on existing workspaces? Would they still work/start considering that they'd have old UUID and the patterns changed?
  • Considering the entity type changes, I expected a DB migration.


func TestGenerateWorkspaceID(t *testing.T) {

t.Run(fmt.Sprintf("check names are valid"), func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the t.Run. The outer t is already a valid test. The t.Run would indicate a sub test,

// Licensed under the GNU Affero General Public License (AGPL).
// See License-AGPL.txt in the project root for license information.

package util
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

util a bit too generic as package name (see https://blog.golang.org/package-names).
How about:

  • uidgen
  • wsid
  • wsidgen
  • namegen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like namegen best

"time"
)

// UnmarshalJSON parses the duration to a time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// UnmarshalJSON parses the duration to a time.Duration
// GenerateWorkspaceID generates a new workspace ID by randomly choosing
// a color, an animal and a number of characters.

import { Transformer } from "../transformer";

@Entity()
export class DBLayoutData implements LayoutData {

@PrimaryColumn(TypeORM.UUID_COLUMN_TYPE)
@PrimaryColumn("varchar")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original type hat a fixed length. Considering how important the workspace ID is, I think it would make sense to introduce a specific WORKSPACE_ID_TYPE type, akin to UUID_COLUMN_TYPE.

Otherwise we'll end up with new tables that use VARCHAR(255) for workspace IDs in the future.

Also, considering that workspace IDs have roughly the same char length, we would probably see better performance if we used CHAR columns (like we have in the past). See e.g. https://dba.stackexchange.com/questions/424/performance-implications-of-mysql-varchar-sizes/1915#1915

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We should keep the type (char 36). I'll just define a new column type TypeORM.WORKSPACE_ID_COLUMN_TYPE with the same technical type mapping.

Copy link
Member Author

@svenefftinge svenefftinge Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think migrations are needed then.


var characters = strings.Split("abcdefghijklmnopqrstuvwxyz0123456789", "")

var colors = []string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering we're maintaining this list in two places, I wonder if would make sense to keep this as txt file and use something like go generate/embed/go.rice in Go, load the text file on startup in TS.

This way we'd prevent those lists from drifting apart over time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect changes in those lists. So maybe this is something we should invest in once we start changing those lists (or drift becomes a burden).

@svenefftinge svenefftinge force-pushed the se/workspace-id branch 4 times, most recently from dfa4c7e to 92ea403 Compare January 25, 2021 08:42
@geropl
Copy link
Member

geropl commented Jan 25, 2021

Sorry for being late here, but:

  • Are we making sure the 8 characters do not contain the metadata bits of uuidv4? Not sure where those are located, but assume at the beginning.
  • Did you consider adding a nice name in addition to the technical id? IMO this would have made things drastically simpler (only the URL-> backend lookup(niceId -> UUID)) part needs to be updated, not the whole system incl. DB).

@svenefftinge
Copy link
Member Author

We discussed this in a video call, but just for transparency:

  • Are we making sure the 8 characters do not contain the metadata bits of uuidv4? Not sure where those are located, but assume at the beginning.

There was a missunderstanding that the new ID would use or even comply with uuids. They do not.

  • Did you consider adding a nice name in addition to the technical id? IMO this would have made things drastically simpler (only the URL-> backend lookup(niceId -> UUID)) part needs to be updated, not the whole system incl. DB).

We talked this through and decided that having two IDs (technical and human readable) wouldn't be very helpful, because the requirements of uniqueness would apply to both equally. Furthermore, all existing logic around the form of the ID would need to be updated because it sits around the URLs which would obviously contain the human-readable ID.

@svenefftinge
Copy link
Member Author

svenefftinge commented Jan 26, 2021

/werft run

👍 started the job as gitpod-build-se-workspace-id.28

@svenefftinge
Copy link
Member Author

svenefftinge commented Jan 26, 2021

/werft run

👍 started the job as gitpod-build-se-workspace-id.30

@geropl
Copy link
Member

geropl commented Jan 26, 2021

@csweichel Could you help check the likelihood of a collision in the situation of same seed used in multiple instances?

@@ -18,6 +18,10 @@ export class TypeORM {
type: 'char',
length: 36
};
static readonly WORKSPACE_ID_COLUMN_TYPE: ColumnOptions = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* @param hostname The hostname the request is headed to
*/
export const parseWorkspaceIdFromHostname = function(hostname: string) {
// We need to parse the workspace id precisely here to get the case '<some-str>-<port>-<wsid>.ws.' right
const wsIdExpression = /([a-z][0-9a-z]+\-([0-9a-z]+\-){3}[0-9a-z]+)\.ws/g;
const wsIdExpression = /([a-z]{3,12}-[a-z]{2,16}-[A-Za-z0-9]{8})\.ws/g;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expression does not support both cases (UUID and new one). Wanted?

package namegen

import (
"math/rand"
Copy link
Member

@geropl geropl Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid.NewRandom (the one we used before) uses crypto/rand. I suggest we used the same to make sure we have the same level of "randomness per bit".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impl seems straight forward:

var rander = rand.Reader // random function

func NewRandom() (UUID, error) {
	return NewRandomFromReader(rander)
}

// NewRandomFromReader returns a UUID based on bytes read from a given io.Reader.
func NewRandomFromReader(r io.Reader) (UUID, error) {
	var uuid UUID
	_, err := io.ReadFull(r, uuid[:])
	if err != nil {
		return Nil, err
	}
	uuid[6] = (uuid[6] & 0x0f) | 0x40 // Version 4
	uuid[8] = (uuid[8] & 0x3f) | 0x80 // Variant is 10
	return uuid, nil
}

(from here)

@geropl
Copy link
Member

geropl commented Jan 26, 2021

dev/loadgen/loadgen (12MB) is included in one of the commits. should be .gitignored.

* See License-AGPL.txt in the project root for license information.
*/

export function generateWorkspaceID() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for TypeScript version: uuid uses randomFillSync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants