Skip to content

Commit 1bdea21

Browse files
Chinmay DeshpandeChinmaydanielsn
authored andcommitted
Vector stubbing (rust-lang#377)
Vector Stubbing for the Rust Standard Library * This PR includes stub implementations for the Vector module for the Standard Library. * It includes 3 abstractions - RMCVec, CVec and NobackVec. * It also includes an experimental CHashSet implementation. Co-authored-by: Chinmay <[email protected]> Co-authored-by: Daniel Schwartz-Narbonne <[email protected]>
1 parent 2a41fcd commit 1bdea21

File tree

13 files changed

+2071
-8
lines changed

13 files changed

+2071
-8
lines changed

library/rmc/stubs/C/hashset/hashset.c

+153
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
// SPDX-License-Identifier: Apache-2.0 OR MIT
3+
4+
#include <stdint.h>
5+
#include <assert.h>
6+
#include <stdlib.h>
7+
8+
// This HashSet stub implementation is supposed to work with c_hashset.rs.
9+
// Please refer to that file for an introduction to the idea of a HashSet and
10+
// some other implemenntation details. Public methods defined in c_hashset.rs
11+
// act as wrappers around methods implemented here.
12+
13+
// As noted before, this HashSet implementation is specifically for inputs which
14+
// are u16. The details below can be extended to larger sets if necessary. The
15+
// domain of the output is i16.
16+
//
17+
// The hash function that we choose is the identity function.
18+
// For all input x, hasher(x) = x. For our case, this satisfies all the
19+
// requirements of an ideal hash function, it is 1:1 and there exists only one
20+
// element in the input domain for each value in the output domain.
21+
//
22+
// An important thing to note here is that the hash function can be
23+
// appropriately modified depending on the type of the input value which is
24+
// stored in the hashset. As an example, if the HashSet needs to store a tuple
25+
// of integers, say <u32, u32>, the hash function can be modified to be:
26+
//
27+
// hash((x, y)) = prime * x + y;
28+
//
29+
// Although this value can be greater than the chosen output domain, the
30+
// function is still sound if the value wraps around because it guarantees a
31+
// unique output for a given pair of x and y.
32+
//
33+
// Another way to think about this problem could be through the lens of
34+
// uninterpreted functions where : if x == y => f(x) == f(y). Exploring this can
35+
// be future work. The idea would be to implement a HashSet similar to that seen
36+
// in functional programming languages.
37+
//
38+
// For the purpose of a HashSet, we dont necessarily need a SENTINEL outside the
39+
// range of the hashing function because of the way we design the HashSet
40+
// operations.
41+
const uint16_t SENTINEL = 1;
42+
43+
uint16_t hasher(uint16_t value) {
44+
return value;
45+
}
46+
47+
// We initialize all values of the domain to be 0 by initializing it with
48+
// calloc. This lets us get around the problem of looping through all elements
49+
// to initialize them individually with a special value.
50+
//
51+
// The domain array is to be interpreted such that
52+
// if domain[index] != 0, value such that hash(value) = index is present.
53+
//
54+
// However, this logic does not work for the value 0. For this, we choose the
55+
// SENTINEL value to initialize that element.
56+
typedef struct {
57+
uint16_t* domain;
58+
} hashset;
59+
60+
// Ideally, this approach is much more suitable if we can work with arrays of
61+
// arbitrary size, specifically infinity. This would allow us to define hash
62+
// functions for any type because the output domain can be considered to be
63+
// infinite. However, CBMC currently does not handle unbounded arrays correctly.
64+
// Please see: https://github.com/diffblue/cbmc/issues/6261. Even in that case,
65+
// we might run into theoretical limitations of how solvers handle uninterpreted
66+
// symbols such as unbounded arrays. For the case of this API, the consumer can
67+
// request for an arbitrary number of HashSets which can be dynamically chosen.
68+
// As a result, the solver cannot know apriori how many unbounded arrays it
69+
// needs to initialize which might lead to errors.
70+
//
71+
// Firecracker uses HashSet<u32> (src/devices/src/virtio/vsock/unix/muxer.rs).
72+
// But for working with u32s, we run into the problem that the entire domain
73+
// cannot be allocated through malloc. We run into the error "array too large
74+
// for flattening". For that reason, we choose to work with u16 to demonstrate
75+
// the feasability of this approach. However, it should be extensible to other
76+
// integer types.
77+
//
78+
// Returns: pointer to a hashset instance which tracks the domain memory. This
79+
// pointer is used in later callbacks such as insert() and remove().
80+
hashset* hashset_new() {
81+
hashset* set = (hashset *) malloc(sizeof(hashset));
82+
// Initializes value all indexes to be 0, indicating that those elements are
83+
// not present in the HashSet.
84+
set->domain = calloc(UINT16_MAX, sizeof(uint16_t));
85+
// For 0, choose another value to achieve the same.
86+
set->domain[0] = SENTINEL;
87+
return set;
88+
}
89+
90+
// For insert, we need to first check if the value exists in the HashSet. If it
91+
// does, we immediately return a 0 (false) value back.
92+
//
93+
// If it doesnt, then we mark that element of the domain array with the value to
94+
// indicate that this element has been inserted. For element 0, we mark it with
95+
// the SENTINEL.
96+
//
97+
// To check if a value exists, we simply check if domain[hash] != 0 and
98+
// in the case of 0 if domain[0] != SENTINEL.
99+
//
100+
// Returns: an integer value 1 or 0. If the value is already present in the
101+
// hashset, this function returns a 0. If the value is sucessfully inserted, we
102+
// return a 1.
103+
uint32_t hashset_insert(hashset* s, uint16_t value) {
104+
uint16_t hash = hasher(value);
105+
106+
if ((hash == 0 && s->domain[hash] != SENTINEL) ||
107+
(hash !=0 && s->domain[hash] != 0)) {
108+
return 0;
109+
}
110+
111+
s->domain[hash] = value;
112+
return 1;
113+
}
114+
115+
// We perform a similar check here as described in hashset_insert(). We do not
116+
// duplicate code so as to not compute the hash twice. This can be improved.
117+
//
118+
// Returns: an integer value 1 or 0. If the value is present in the hashset,
119+
// this function returns a 1, otherwise 0.
120+
uint32_t hashset_contains(hashset* s, uint16_t value) {
121+
uint16_t hash = hasher(value);
122+
123+
if ((hash == 0 && s->domain[hash] != SENTINEL) ||
124+
(hash != 0 && s->domain[hash] != 0)) {
125+
return 1;
126+
}
127+
128+
return 0;
129+
}
130+
131+
// We check if the element exists in the array. If it does not, we return a 0
132+
// (false) value back. If it does, we mark it with 0 and in the case of 0, we
133+
// mark it with the SENTINEL and return 1.
134+
//
135+
// Returns: an integer value 1 or 0. If the value is not present in the hashset,
136+
// this function returns a 0. If the value is sucessfully removed from the
137+
// hashset, it returns a 1.
138+
uint32_t hashset_remove(hashset* s, uint16_t value) {
139+
uint16_t hash = hasher(value);
140+
141+
if ((hash == 0 && s->domain[hash] == SENTINEL) ||
142+
(hash !=0 && s->domain[hash] == 0)) {
143+
return 0;
144+
}
145+
146+
if (hash == 0) {
147+
s->domain[hash] = SENTINEL;
148+
} else {
149+
s->domain[hash] = 0;
150+
}
151+
152+
return 1;
153+
}

library/rmc/stubs/C/vec/vec.c

+182
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
// SPDX-License-Identifier: Apache-2.0 OR MIT
3+
4+
#include <stdint.h>
5+
#include <assert.h>
6+
#include <stdlib.h>
7+
#include <string.h>
8+
9+
// This Vector stub implementation is supposed to work with c_vec.rs. Please
10+
// refer to that file for a detailed explanation about the workings of this
11+
// abstraction. Public methods implemented in c_vec.rs act as wrappers around
12+
// methods implemented here.
13+
14+
// __CPROVER_max_malloc_size is dependent on the number of offset bits used to
15+
// represent a pointer variable. By default, this is chosen to be 56, in which
16+
// case the max_malloc_size is 2 ** (offset_bits - 1). We could go as far as to
17+
// assign the default capacity to be the max_malloc_size but that would be overkill.
18+
// Instead, we choose a high-enough value 2 ** 10. Another reason to do
19+
// this is that it would be easier for the solver to reason about memory if multiple
20+
// Vectors are initialized by the abstraction consumer.
21+
//
22+
// For larger array sizes such as 2 ** (31 - 1) we encounter "array size too large
23+
// for flattening" error.
24+
#define DEFAULT_CAPACITY 1024
25+
#define MAX_MALLOC_SIZE 18014398509481984
26+
27+
// A Vector is a dynamically growing array type with contiguous memory. We track
28+
// allocated memory, the length of the Vector and the capacity of the
29+
// allocation.
30+
31+
// As can be seen from the pointer to mem (unint32_t*), we track memory in terms
32+
// of words. The current implementation works only if the containing type is
33+
// u32. This was specifically chosen due to a use case seen in the Firecracker
34+
// codebase. This structure is used to communicate over the FFI boundary.
35+
// Future work:
36+
// Ideally, the pointer to memory would be uint8_t* - representing that we treat
37+
// memory as an array of bytes. This would allow us to be generic over the type
38+
// of the element contained in the Vector. In that case, we would have to treat
39+
// every sizeof(T) bytes as an indivdual element and cast memory accordingly.
40+
typedef struct {
41+
uint32_t* mem;
42+
size_t len;
43+
size_t capacity;
44+
} vec;
45+
46+
// The grow operation resizes the vector and copies its original contents into a
47+
// new allocation. This is one of the more expensive operations for the solver
48+
// to reason about and one way to get around this problem is to use a large
49+
// allocation size. We also implement sized_grow which takes a argument
50+
// definining the minimum number of additional elements that need to be fit into
51+
// the Vector memory. This aims to replicate behavior as seen in the Rust
52+
// standard library where the size of the vector is decided based on the
53+
// following equation:
54+
// new_capacity = max(capacity * 2, capacity + additional).
55+
// Please refer to method amortized_grow in raw_vec.rs in the Standard Library
56+
// for additional information.
57+
// The current implementation performance depends on CBMCs performance about
58+
// reasoning about realloc. If CBMC does better, do would we in the case of
59+
// this abstraction.
60+
//
61+
// One important callout to make here is that because we allocate a large enough
62+
// buffer, we cant reason about buffer overflow bugs. This is because the
63+
// allocated memory will (most-likely) always have enough space allocated after
64+
// the required vec capacity.
65+
//
66+
// Future work:
67+
// Ideally, we would like to get around the issue of resizing altogether since
68+
// CBMC supports unbounded arrays. In that case, we would allocate memory of
69+
// size infinity and work with that. For program verification, this would
70+
// optimize a lot of operations since the solver does not really have to worry
71+
// about the bounds of memory. The appropriate constant for capacity would be
72+
// __CPROVER_constant_infinity_uint but this is currently blocked due to
73+
// incorrect translation of the constant: https://github.com/diffblue/cbmc/issues/6261.
74+
//
75+
// Another way to approach this problem would be to implement optimizations in
76+
// the realloc operation of CBMC. Rather than allocating a new memory block and
77+
// copying over elements, we can track only the end pointer of the memory and
78+
// shift it to track the new length. Since this behavior is that of the
79+
// allocator, the consumer of the API is blind to it.
80+
void vec_grow_exact(vec* v, size_t new_cap) {
81+
uint32_t* new_mem = (uint32_t* ) realloc(v->mem, new_cap * sizeof(*v->mem));
82+
83+
v->mem = new_mem;
84+
v->capacity = new_cap;
85+
}
86+
87+
void vec_grow(vec* v) {
88+
size_t new_cap = v->capacity * 2;
89+
if (new_cap > MAX_MALLOC_SIZE) {
90+
// Panic if the new size requirement is greater than max size that can
91+
// be allocated through malloc.
92+
assert(0);
93+
}
94+
95+
vec_grow_exact(v, new_cap);
96+
}
97+
98+
void vec_sized_grow(vec* v, size_t additional) {
99+
size_t min_cap = v->capacity + additional;
100+
size_t grow_cap = v->capacity * 2;
101+
102+
// This resembles the Rust Standard Library behavior - amortized_grow in
103+
// alloc/raw_vec.rs
104+
//
105+
// Reference: https://doc.rust-lang.org/src/alloc/raw_vec.rs.html#421
106+
size_t new_cap = min_cap > grow_cap ? min_cap : grow_cap;
107+
if (new_cap > MAX_MALLOC_SIZE) {
108+
// Panic if the new size requirement is greater than max size that can
109+
// be allocated through malloc.
110+
assert(0);
111+
}
112+
113+
vec_grow_exact(v, new_cap);
114+
}
115+
116+
vec* vec_new() {
117+
vec *v = (vec *) malloc(sizeof(vec));
118+
// Default size is DEFAULT_CAPACITY. We compute the maximum number of
119+
// elements to ensure that allocation size is aligned.
120+
size_t max_elements = DEFAULT_CAPACITY / sizeof(*v->mem);
121+
v->mem = (uint32_t *) malloc(max_elements * sizeof(*v->mem));
122+
v->len = 0;
123+
v->capacity = max_elements;
124+
// Return a pointer to the allocated vec structure, which is used in future
125+
// callbacks.
126+
return v;
127+
}
128+
129+
vec* vec_with_capacity(size_t capacity) {
130+
vec *v = (vec *) malloc(sizeof(vec));
131+
if (capacity > MAX_MALLOC_SIZE) {
132+
// Panic if the new size requirement is greater than max size that can
133+
// be allocated through malloc.
134+
assert(0);
135+
}
136+
137+
v->mem = (uint32_t *) malloc(capacity * sizeof(*v->mem));
138+
v->len = 0;
139+
v->capacity = capacity;
140+
return v;
141+
}
142+
143+
void vec_push(vec* v, uint32_t elem) {
144+
// If we have already reached capacity, resize the Vector before
145+
// pushing in new elements.
146+
if (v->len == v->capacity) {
147+
// Ensure that we have capacity to hold atleast one more element
148+
vec_sized_grow(v, 1);
149+
}
150+
151+
v->mem[v->len] = elem;
152+
v->len += 1;
153+
}
154+
155+
uint32_t vec_pop(vec* v) {
156+
assert(v->len > 0);
157+
v->len -= 1;
158+
159+
return v->mem[v->len];
160+
}
161+
162+
void vec_append(vec* v1, vec* v2) {
163+
// Reserve enough space before adding in new elements.
164+
vec_sized_grow(v1, v2->len);
165+
// Perform a memcpy of elements which is cheaper than pushing each element
166+
// at once.
167+
memcpy(v1->mem + v1->len, v2->mem, v2->len * sizeof(*v2->mem));
168+
v1->len = v1->len + v2->len;
169+
}
170+
171+
size_t vec_len(vec* v) {
172+
return v->len;
173+
}
174+
175+
size_t vec_cap(vec* v) {
176+
return v->capacity;
177+
}
178+
179+
void vec_free(vec* v) {
180+
free(v->mem);
181+
free(v);
182+
}

library/rmc/stubs/README.md

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Verification-friendly Vector stubs
2+
----------
3+
4+

0 commit comments

Comments
 (0)