Major refactoring of array bounds code and some performance improvements #282

Machiry · 2020-10-03T03:41:14Z

Implemented the technique similar to the locksmith, where the bounds of a variable is the intersection of the bounds of all array variables that flows into the variable.
Added a flag to disable array bounds heuristics: -disable-arr-hu (By default, they are enabled but we can use this flag to disable them).
We used same node for all ocurrences of constant numbers, for example, for the below code:
```
int n = 8;
int l = 8;
```
We would construct our graph as: n <-> 8 <-> l, this wrongly leads us to believe that n is reachable (i.e., flows) into l. Which is not true.

We changed the code to create a new node for every occurrence of constants and thereby avoid this.
Added cache to breadth-first search, realized that we call BFS on the same node many times and it is better to cache this information (provided that the graph didn't change).
This also fixes: Should pick constant bound, but chooses parameter #281

Array bounds inference:

Every variable (both pointers and scalars) are identified by a unique number i.e., BoundsKey, There is a mapping from each BoundsKey to the corresponding ProgramVar* which contains information about program scope, variable name, etc.

The main entry point in this class is: bool AVarBoundsInfo::performFlowAnalysis(ProgramInfo *PI): This as the name suggests performs flow (i.e., LockSmith based) analysis to determine the bounds of all the array variables.

It has a fixed point loop, at each iteration, we try to find the possible bounds for all array variables (that do not have bounds) from its predecessors using the following function:

bool AvarBoundsInference::predictBounds(BoundsKey K,
                                        std::set<BoundsKey> &Neighbours,
                                        AVarGraph &BKGraph)

where K is the array variable whose bounds need to be computed and Neighbours are the predecessors and BKGraph is the graph that contains all the flow information. The possible bounds for K will be stored into CurrIterInferBounds.

Later we call convergeInferredBounds that will pick the best bounds from the possible bound values, with the following preference: count will be the priority over byte bounds. Variables will be given priority over constants. If there are multiple choices (i.e., variables or constants) we give up.

We first propagate all the bounds information from explicit declarations and mallocs.
For other variables that do not have any choice of bounds, we use potential bounds choices (FromPB), these are the variables that are upper bounds to an index variable used in an array indexing operation.
For example:

if (i < n) {
   ...arr[i]...
}

In the above case, we use n as a potential count bounds for arr.
Note: we only use potential bounds for a variable when none of its predecessors have bounds.

Function of interest:

AvarBoundsInference::getReachableBoundKeys(const ProgramVarScope *DstScope,
                                                BoundsKey FromVarK,
                                                std::set<BoundsKey> &PotK,
                                                AVarGraph &BKGraph,
                                                bool CheckImmediate)

The function: getReachableBoundKeys finds all the BoundsKeys (i.e., variables) in scope DstScope that are reachable from FromVarK in the graph BKGraph. All the reachable bounds key will be stored in PotK.

mwhicks1 · 2020-10-05T00:20:03Z

clang/include/clang/CConv/ConstraintsGraph.h

+    // Insert into BFS cache.
+    if (BFSCache.find(Start) == BFSCache.end()) {
+      std::set<Data> ReachableNodes;
+      ReachableNodes.clear();


Why do you need to call clear() here? Isn't just creating the variable calling its constructor, which will ensure that it's empty?

Agree. Will remove it.

mwhicks1 · 2020-10-05T00:21:26Z

clang/test/CheckedCRewriter/arrboundsmerging.c

+
+// Here x bounds will be c
+void foo(int *x, int c) {
+//CHECK: void foo(_Array_ptr<int> x : count(c), int c) {


Seems odd that this should be c since the index of 3 is not at all related to c. I.e., the index is unconditional.

Maybe this is because of the heuristic (which we are not disabling) ?

This is not a heuristic. It is from all the incoming array pointers i.e., p, q. i.e., all the calls to this function pass array to x and bounds to c. We do not worry about the indexing x as Checked C dynamic checks will take care of that.

mwhicks1 · 2020-10-05T00:22:31Z

clang/test/CheckedCRewriter/arrboundsmerging.c

+  x[3] = c;
+}
+
+// Here x bounds is c but the violates bounds.


mwhicks1 · 2020-10-05T00:23:16Z

clang/test/CheckedCRewriter/arrboundsmerging.c

+//CHECK: _Array_ptr<int> q1 : count(8) = malloc<int>(sizeof(int)*8);
+
+  int n = 8;
+  int l;


This is pretty weird, since l is not initialized. Maybe make l the parameter to bar (this function) ?

Will initialize l to something. I just wanted to check Variation 3

mwhicks1 · 2020-10-05T00:26:27Z

clang/test/CheckedCRewriter/arrboundsmerging.c

+
+  // Variation 3
+  foo3(q2,l);
+  foo3(q1,28);


What about also a test where you pass in n as the second parameter, ie. as if it was q's length, even though it's associated with p? The result should be that it's not treated as the length.

Sure. Will add it.

mwhicks1 · 2020-10-05T00:31:20Z

clang/lib/CConv/ProgramVar.cpp

@@ -32,7 +32,8 @@ const StructScope *StructScope::getStructScope(std::string StName) {
  if (AllStScopes.find(TmpS) == AllStScopes.end()) {
    AllStScopes.insert(TmpS);
  }
-  return &(*AllStScopes.find(TmpS));
+  const auto &SS = *AllStScopes.find(TmpS);
+  return &SS;


Returning &x where x is a local variable is bad, isn't it? Same with what's below.

Its a reference. So, returning &SS where SS is a reference is as good as returning a pointer.

mwhicks1 · 2020-10-05T00:32:49Z