Skip to content

[analyzer] Refine invalidation caused by fread #93408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 13, 2024
Merged

[analyzer] Refine invalidation caused by fread #93408

merged 19 commits into from
Jun 13, 2024

Conversation

steakhal
Copy link
Contributor

@steakhal steakhal commented May 26, 2024

This change enables more accurate modeling of the write effects of fread. In particular, instead of invalidating the whole buffer, in a best-effort basis, we would try to invalidate the actually accesses elements of the buffer. This preserves the previous value of the buffer of the unaffected slots. As a result, diagnose more uninitialized buffer uses for example.

Currently, this refined invalidation only triggers for fread if and only if the count parameter and the buffer pointer's index component are concrete or perfectly-constrained symbols.
Additionally, if the fread would read more than 64 elements, the whole buffer is invalidated as before. This is to have safeguards against performance issues.

Refer to the comments of the assertions in the following example to see the changes in the diagnostics:

void demo() {
  FILE *fp = fopen("/home/test", "rb+");
  if (!fp) return;
  int buffer[10]; // uninitialized
  int read_items = fread(buffer+1, sizeof(int), 5, fp);
  if (5 == read_items) {
    int v1 = buffer[1]; // Unknown value but not garbage.
    clang_analyzer_isTainted(v1); // expected-warning {{YES}} <-- Would be "NO" without this patch.
    clang_analyzer_dump(v1); // expected-warning {{conj_}} <-- Not a "derived" symbol, so it's directly invalidated now.
    int v0 = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}} <-- Had no report here before.
    (void)(v1 + v0);
  } else {
    // If 'fread' had an error.
    int v0 = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}} <-- Had no report here before.
    (void)v0;
  }
  fclose(fp);
}

CPP-3247, CPP-3802

Patch by Marco Borgeaud (marco-antognini-sonarsource)

This change enables more accurate modeling of the write effects of `fread`.
In particular, instead of invalidating the whole buffer, in a best-effort
basis, we would try to invalidate the actually accesses elements of the buffer.
This preserves the previous value of the buffer of the unaffected slots.
As a result, diagnose more uninitialized buffer uses for example.

Currently, this refined invalidation only triggers for `fread` if and
only if the `count` parameter and the buffer pointer's index component
are concrete or perfectly-constrained symbols.
Additionally, if the `fread` would read more than 64 elements, the whole
buffer is invalidated as before. This is to have safeguards against
performance issues.

Refer to the comments of the assertions in the following example to see
the changes in the diagnostics:

```c++
void demo() {
  FILE *fp = fopen("/home/test", "rb+");
  if (!fp) return;
  int buffer[10]; // uninitialized
  int read_items = fread(buffer+1, sizeof(int), 5, fp);
  if (5 == read_items) {
    int v1 = buffer[1]; // Unknown value but not garbage.
    clang_analyzer_isTainted(v1); // expected-warning {{YES}} <-- Would be "NO" without this patch.
    clang_analyzer_dump(v1); // expected-warning {{conj_}} <-- Not a "derived" symbol, so it's directly invalidated now.
    int v0 = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}} <-- Had no report here before.
    (void)(v1 + v0);
  } else {
    // If 'fread' had an error.
    int v0 = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}} <-- Had no report here before.
    (void)v0;
  }
  fclose(fp);
}
```

[CPP-3247](https://sonarsource.atlassian.net/browse/CPP-3247)

Patch by Marco Borgeaud (marco-antognini-sonarsource)
@llvmbot llvmbot added the clang Clang issues not falling into any other category label May 26, 2024
@llvmbot
Copy link
Member

llvmbot commented May 26, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-static-analyzer-1

Author: Balazs Benics (steakhal)

Changes

This change enables more accurate modeling of the write effects of fread. In particular, instead of invalidating the whole buffer, in a best-effort basis, we would try to invalidate the actually accesses elements of the buffer. This preserves the previous value of the buffer of the unaffected slots. As a result, diagnose more uninitialized buffer uses for example.

Currently, this refined invalidation only triggers for fread if and only if the count parameter and the buffer pointer's index component are concrete or perfectly-constrained symbols.
Additionally, if the fread would read more than 64 elements, the whole buffer is invalidated as before. This is to have safeguards against performance issues.

Refer to the comments of the assertions in the following example to see the changes in the diagnostics:

void demo() {
  FILE *fp = fopen("/home/test", "rb+");
  if (!fp) return;
  int buffer[10]; // uninitialized
  int read_items = fread(buffer+1, sizeof(int), 5, fp);
  if (5 == read_items) {
    int v1 = buffer[1]; // Unknown value but not garbage.
    clang_analyzer_isTainted(v1); // expected-warning {{YES}} &lt;-- Would be "NO" without this patch.
    clang_analyzer_dump(v1); // expected-warning {{conj_}} &lt;-- Not a "derived" symbol, so it's directly invalidated now.
    int v0 = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}} &lt;-- Had no report here before.
    (void)(v1 + v0);
  } else {
    // If 'fread' had an error.
    int v0 = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}} &lt;-- Had no report here before.
    (void)v0;
  }
  fclose(fp);
}

CPP-3247

Patch by Marco Borgeaud (marco-antognini-sonarsource)


Full diff: https://github.com/llvm/llvm-project/pull/93408.diff

2 Files Affected:

  • (modified) clang/lib/StaticAnalyzer/Checkers/StreamChecker.cpp (+77-11)
  • (added) clang/test/Analysis/fread.cpp (+328)
diff --git a/clang/lib/StaticAnalyzer/Checkers/StreamChecker.cpp b/clang/lib/StaticAnalyzer/Checkers/StreamChecker.cpp
index d4e020f7a72a0..7b42c4f72b322 100644
--- a/clang/lib/StaticAnalyzer/Checkers/StreamChecker.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/StreamChecker.cpp
@@ -717,18 +717,71 @@ const ExplodedNode *StreamChecker::getAcquisitionSite(const ExplodedNode *N,
   return nullptr;
 }
 
+/// Invalidate only the requested elements instead of the whole buffer.
+/// This is basically a refinement of the more generic 'escapeArgs' or
+/// the plain old 'invalidateRegions'.
+/// This only works if the \p StartIndex and \p Count are concrete or
+/// perfectly-constrained.
+static ProgramStateRef
+escapeByStartIndexAndCount(ProgramStateRef State, CheckerContext &C,
+                           const CallEvent &Call, const MemRegion *Buffer,
+                           QualType ElemType, SVal StartIndex, SVal Count) {
+  if (!llvm::isa_and_nonnull<SubRegion>(Buffer))
+    return State;
+
+  auto UnboxAsInt = [&C, &State](SVal V) -> std::optional<int64_t> {
+    auto &SVB = C.getSValBuilder();
+    if (const llvm::APSInt *Int = SVB.getKnownValue(State, V))
+      return Int->tryExtValue();
+    return std::nullopt;
+  };
+
+  auto StartIndexVal = UnboxAsInt(StartIndex);
+  auto CountVal = UnboxAsInt(Count);
+
+  // FIXME: Maybe we could make this more generic, and expose this by the
+  // 'invalidateRegions' API. After doing so, it might make sense to make this
+  // limit configurable.
+  constexpr int MaxInvalidatedElementsLimit = 64;
+  if (!StartIndexVal || !CountVal || *CountVal > MaxInvalidatedElementsLimit) {
+    return State->invalidateRegions({loc::MemRegionVal{Buffer}},
+                                    Call.getOriginExpr(), C.blockCount(),
+                                    C.getLocationContext(),
+                                    /*CausesPointerEscape=*/false);
+  }
+
+  constexpr auto DoNotInvalidateSuperRegion =
+      RegionAndSymbolInvalidationTraits::InvalidationKinds::
+          TK_DoNotInvalidateSuperRegion;
+
+  auto &RegionManager = Buffer->getMemRegionManager();
+  SmallVector<SVal> EscapingVals;
+  EscapingVals.reserve(*CountVal);
+
+  RegionAndSymbolInvalidationTraits ITraits;
+  for (auto Idx : llvm::seq(*StartIndexVal, *StartIndexVal + *CountVal)) {
+    NonLoc Index = C.getSValBuilder().makeArrayIndex(Idx);
+    const auto *Element = RegionManager.getElementRegion(
+        ElemType, Index, cast<SubRegion>(Buffer), C.getASTContext());
+    EscapingVals.push_back(loc::MemRegionVal(Element));
+    ITraits.setTrait(Element, DoNotInvalidateSuperRegion);
+  }
+  return State->invalidateRegions(EscapingVals, Call.getOriginExpr(),
+                                  C.blockCount(), C.getLocationContext(),
+                                  /*CausesPointerEscape=*/false,
+                                  /*InvalidatedSymbols=*/nullptr, &Call,
+                                  &ITraits);
+}
+
 static ProgramStateRef escapeArgs(ProgramStateRef State, CheckerContext &C,
                                   const CallEvent &Call,
                                   ArrayRef<unsigned int> EscapingArgs) {
-  const auto *CE = Call.getOriginExpr();
-
-  SmallVector<SVal> EscapingVals;
-  EscapingVals.reserve(EscapingArgs.size());
-  for (auto EscArgIdx : EscapingArgs)
-    EscapingVals.push_back(Call.getArgSVal(EscArgIdx));
-  State = State->invalidateRegions(EscapingVals, CE, C.blockCount(),
-                                   C.getLocationContext(),
-                                   /*CausesPointerEscape=*/false);
+  auto GetArgSVal = [&Call](int Idx) { return Call.getArgSVal(Idx); };
+  auto EscapingVals = to_vector(map_range(EscapingArgs, GetArgSVal));
+  State = State->invalidateRegions(EscapingVals, Call.getOriginExpr(),
+                                   C.blockCount(), C.getLocationContext(),
+                                   /*CausesPointerEscape=*/false,
+                                   /*InvalidatedSymbols=*/nullptr);
   return State;
 }
 
@@ -937,8 +990,21 @@ void StreamChecker::evalFreadFwrite(const FnDescription *Desc,
 
   // At read, invalidate the buffer in any case of error or success,
   // except if EOF was already present.
-  if (IsFread && !E.isStreamEof())
-    State = escapeArgs(State, C, Call, {0});
+  if (IsFread && !E.isStreamEof()) {
+    // Try to invalidate the individual elements.
+    if (const auto *BufferFirstElem =
+            dyn_cast_or_null<ElementRegion>(Call.getArgSVal(0).getAsRegion())) {
+      const MemRegion *Buffer = BufferFirstElem->getSuperRegion();
+      QualType ElemTy = BufferFirstElem->getElementType();
+      SVal FirstAccessedItem = BufferFirstElem->getIndex();
+      SVal ItemCount = Call.getArgSVal(2);
+      State = escapeByStartIndexAndCount(State, C, Call, Buffer, ElemTy,
+                                         FirstAccessedItem, ItemCount);
+    } else {
+      // Otherwise just fall back to invalidating the whole buffer.
+      State = escapeArgs(State, C, Call, {0});
+    }
+  }
 
   // Generate a transition for the success state.
   // If we know the state to be FEOF at fread, do not add a success state.
diff --git a/clang/test/Analysis/fread.cpp b/clang/test/Analysis/fread.cpp
new file mode 100644
index 0000000000000..2bf9baefe1395
--- /dev/null
+++ b/clang/test/Analysis/fread.cpp
@@ -0,0 +1,328 @@
+// RUN: %clang_analyze_cc1 -verify %s \
+// RUN:   -analyzer-checker=core,unix.Stream,alpha.security.taint \
+// RUN:   -analyzer-checker=debug.ExprInspection
+
+#define EOF (-1)
+
+extern "C" {
+typedef __typeof(sizeof(int)) size_t;
+typedef struct _FILE FILE;
+
+FILE *fopen(const char *filename, const char *mode);
+int fclose(FILE *stream);
+size_t fread(void *buffer, size_t size, size_t count, FILE *stream);
+int fgetc(FILE *stream);
+void *malloc(size_t size);
+}
+
+void clang_analyzer_dump(int);
+void clang_analyzer_isTainted(int);
+void clang_analyzer_warnIfReached();
+
+// A stream is only tracked by StreamChecker if it results from a call to "fopen".
+// Otherwise, there is no specific modelling of "fread".
+void untracked_stream(FILE *fp) {
+  char c;
+  if (1 == fread(&c, 1, 1, fp)) {
+    char p = c; // Unknown value but not garbage and not modeled by checker.
+  } else {
+    char p = c; // Possibly indeterminate value but not modeled by checker.
+  }
+}
+
+void fgetc_props_taint() {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    int c = fgetc(fp); // c is tainted.
+    if (c != EOF) {
+      clang_analyzer_isTainted(c); // expected-warning{{YES}}
+    }
+    fclose(fp);
+  }
+}
+
+void fread_props_taint() {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    char buffer[10];
+    int c = fread(buffer, 1, 10, fp); // c is tainted.
+    if (c != 10) {
+      // If the read failed, then the number of bytes successfully read should be tainted.
+      clang_analyzer_isTainted(c); // expected-warning{{YES}}
+    }
+    fclose(fp);
+  }
+}
+
+void read_one_byte1() {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    char c;
+    if (1 == fread(&c, 1, 1, fp)) {
+      char p = c; // Unknown value but not garbage.
+      clang_analyzer_isTainted(p); // expected-warning{{YES}}
+    } else {
+      char p = c; // Possibly indeterminate value but not modeled by checker.
+      clang_analyzer_isTainted(p); // expected-warning{{YES}}
+    }
+    fclose(fp);
+  }
+}
+
+void read_one_byte2(char *buffer) {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    if (1 == fread(buffer, 1, 1, fp)) {
+      char p = buffer[0]; // Unknown value but not garbage.
+      clang_analyzer_isTainted(p); // expected-warning{{YES}}
+    } else {
+      char p = buffer[0]; // Possibly indeterminate value but not modeled by checker.
+      clang_analyzer_isTainted(p); // expected-warning{{YES}}
+    }
+    fclose(fp);
+  }
+}
+
+void read_one_byte3(char *buffer) {
+  buffer[1] = 10;
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    // buffer[1] is not mutated by fread and remains not tainted.
+    fread(buffer, 1, 1, fp);
+    char p = buffer[1];
+    clang_analyzer_isTainted(p); // expected-warning{{NO}}
+    clang_analyzer_dump(buffer[1]); // expected-warning{{derived_}} FIXME This should be 10.
+    fclose(fp);
+  }
+}
+
+void read_many_bytes(char *buffer) {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    if (42 == fread(buffer, 1, 42, fp)) {
+      char p = buffer[0]; // Unknown value but not garbage.
+      clang_analyzer_isTainted(p); // expected-warning{{YES}}
+    } else {
+      char p = buffer[0]; // Possibly indeterminate value but not modeled.
+      clang_analyzer_isTainted(p); // expected-warning{{YES}}
+    }
+    fclose(fp);
+  }
+}
+
+void random_access_write1(int index) {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    long c[4];
+    bool success = 2 == fread(c + 1, sizeof(long), 2, fp);
+
+    switch (index) {
+    case 0:
+      // c[0] is not mutated by fread.
+      if (success) {
+        char p = c[0]; // expected-warning {{Assigned value is garbage or undefined}} We kept the first byte intact.
+      } else {
+        char p = c[0]; // expected-warning {{Assigned value is garbage or undefined}} We kept the first byte intact.
+      }
+      break;
+
+    case 1:
+      if (success) {
+        // Unknown value but not garbage.
+        clang_analyzer_isTainted(c[1]); // expected-warning {{YES}}
+        clang_analyzer_dump(c[1]); // expected-warning {{conj_}}
+      } else {
+        // Possibly indeterminate value but not modeled.
+        clang_analyzer_isTainted(c[1]); // expected-warning {{YES}}
+        clang_analyzer_dump(c[1]); // expected-warning {{conj_}}
+      }
+      break;
+
+    case 2:
+      if (success) {
+        long p = c[2]; // Unknown value but not garbage.
+        // FIXME: Taint analysis only marks the first byte of a memory region. See getPointeeOf in GenericTaintChecker.cpp.
+        clang_analyzer_isTainted(c[2]); // expected-warning {{NO}}
+        clang_analyzer_dump(c[2]); // expected-warning {{conj_}}
+      } else {
+        // Possibly indeterminate value but not modeled.
+        clang_analyzer_isTainted(c[2]); // expected-warning {{NO}} // FIXME: See above.
+        clang_analyzer_dump(c[2]); // expected-warning {{conj_}}
+      }
+      break;
+
+    case 3:
+      // c[3] is not mutated by fread.
+      if (success) {
+        long p = c[3]; // expected-warning {{Assigned value is garbage or undefined}}
+      } else {
+        long p = c[3]; // expected-warning {{Assigned value is garbage or undefined}}
+      }
+      break;
+    }
+
+    fclose(fp);
+  }
+}
+
+void random_access_write2(bool b) {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    int buffer[10];
+    int *ptr = buffer + 2;
+    if (5 == fread(ptr - 1, sizeof(int), 5, fp)) {
+      if (b) {
+        int p = buffer[1]; // Unknown value but not garbage.
+        clang_analyzer_isTainted(p); // expected-warning {{YES}}
+        clang_analyzer_dump(p); // expected-warning {{conj_}}
+      } else {
+        int p = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}}
+      }
+    } else {
+      int p = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}}
+    }
+    fclose(fp);
+  }
+}
+
+void random_access_write_symbolic_count(size_t count) {
+  // Cover a case that used to crash (symbolic count).
+  if (count > 2)
+    return;
+
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    long c[4];
+    fread(c + 1, sizeof(long), count, fp);
+
+    // c[0] and c[3] are never mutated by fread, but because "count" is a symbolic value, the checker doesn't know that.
+    long p = c[0];
+    clang_analyzer_isTainted(p); // expected-warning {{NO}}
+    clang_analyzer_dump(p); // expected-warning {{derived_}}
+
+    p = c[3];
+    clang_analyzer_isTainted(p); // expected-warning {{NO}}
+    clang_analyzer_dump(p); // expected-warning {{derived_}}
+
+    p = c[1];
+    clang_analyzer_isTainted(p); // expected-warning {{YES}}
+    clang_analyzer_dump(p); // expected-warning {{derived_}}
+
+    fclose(fp);
+  }
+}
+
+void dynamic_random_access_write(int startIndex) {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    long buffer[10];
+    // Cannot reason about index.
+    size_t res = fread(buffer + startIndex, sizeof(long), 5, fp);
+    if (5 == res) {
+      long p = buffer[startIndex];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+    } else if (res == 4) {
+      long p = buffer[startIndex];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+      p = buffer[startIndex + 1];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+      p = buffer[startIndex + 2];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+      p = buffer[startIndex + 3];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+      p = buffer[startIndex + 4];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+      p = buffer[startIndex + 5];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+      p = buffer[0];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+    } else {
+      long p = buffer[startIndex];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+      p = buffer[0];
+      clang_analyzer_isTainted(p); // expected-warning {{NO}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+    }
+    fclose(fp);
+  }
+}
+
+struct S {
+  int a;
+  long b;
+};
+
+void comopund_write1() {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    S s; // s.a is not touched by fread.
+    if (1 == fread(&s.b, sizeof(s.b), 1, fp)) {
+      long p = s.b;
+      clang_analyzer_isTainted(p); // expected-warning {{YES}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+    } else {
+      long p = s.b;
+      clang_analyzer_isTainted(p); // expected-warning {{YES}}
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+    }
+    fclose(fp);
+  }
+}
+
+void comopund_write2() {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    S s; // s.a is not touched by fread.
+    if (1 == fread(&s.b, sizeof(s.b), 1, fp)) {
+      long p = s.a; // FIXME: This should raise an uninitialized read.
+      clang_analyzer_isTainted(p); // expected-warning {{NO}} FIXME: This should be YES.
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+    } else {
+      long p = s.a; // FIXME: This should raise an uninitialized read.
+      clang_analyzer_isTainted(p); // expected-warning {{NO}} FIXME: This should be YES.
+      clang_analyzer_dump(p); // expected-warning {{conj_}}
+    }
+    fclose(fp);
+  }
+}
+
+void var_write() {
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    int a, b; // 'a' is not touched by fread.
+    if (1 == fread(&b, sizeof(b), 1, fp)) {
+      long p = a; // expected-warning{{Assigned value is garbage or undefined}}
+    } else {
+      long p = a; // expected-warning{{Assigned value is garbage or undefined}}
+    }
+    fclose(fp);
+  }
+}
+
+// When reading a lot of data, invalidating all elements is too time-consuming.
+// Instead, the knowledge of the whole array is lost.
+#define MaxInvalidatedElementRegion 64 // See StreamChecker::evalFreadFwrite in StreamChecker.cpp.
+#define PastMaxComplexity MaxInvalidatedElementRegion + 1
+void test_large_read() {
+  int buffer[PastMaxComplexity + 1];
+  buffer[PastMaxComplexity] = 42;
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    if (buffer[PastMaxComplexity] != 42) {
+      clang_analyzer_warnIfReached(); // Unreachable.
+    }
+    if (1 == fread(buffer, sizeof(int), PastMaxComplexity, fp)) {
+      if (buffer[PastMaxComplexity] != 42) {
+        clang_analyzer_warnIfReached(); // expected-warning{{REACHABLE}}
+      }
+    }
+    fclose(fp);
+  }
+}
+
+void test_small_read() {
+  int buffer[10];
+  buffer[5] = 42;
+  if (FILE *fp = fopen("/home/test", "rb+")) {
+    clang_analyzer_dump(buffer[5]); // expected-warning{{42 S32b}}
+    if (1 == fread(buffer, sizeof(int), 5, fp)) {
+      clang_analyzer_dump(buffer[5]); // expected-warning{{42 S32b}}
+    }
+    fclose(fp);
+  }
+}

Copy link
Contributor

@NagyDonat NagyDonat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for publishing this commit, it's a nice refinement of the modelling capabilities, and mostly LGTM.

In the inline comments I added some very minor remarks and one question about the handling of a fread that reads into the beginning of an array region.

Comment on lines 729 to 730
if (!llvm::isa_and_nonnull<SubRegion>(Buffer))
return State;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd slightly prefer a dyn_cast_or_null instead of this isa check that's eventually followed by a bare cast.

Comment on lines 739 to 740
auto StartIndexVal = UnboxAsInt(StartIndex);
auto CountVal = UnboxAsInt(Count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto StartIndexVal = UnboxAsInt(StartIndex);
auto CountVal = UnboxAsInt(Count);
std::optional<int64_t> StartIndexVal = UnboxAsInt(StartIndex);
std::optional<int64_t> CountVal = UnboxAsInt(Count);

The explicitly specified type would make this code easier to read (without it, my first guess was that these are int variables because the lambda is named ...AsInt).

long b;
};

void comopund_write1() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void comopund_write1() {
void compound_write1() {

Just a typo.

}
}

void comopund_write2() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void comopund_write2() {
void compound_write2() {

Comment on lines 5 to 16
#define EOF (-1)

extern "C" {
typedef __typeof(sizeof(int)) size_t;
typedef struct _FILE FILE;

FILE *fopen(const char *filename, const char *mode);
int fclose(FILE *stream);
size_t fread(void *buffer, size_t size, size_t count, FILE *stream);
int fgetc(FILE *stream);
void *malloc(size_t size);
}
Copy link
Contributor

@NagyDonat NagyDonat May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file is written as .cpp file, but it's testing plain C functionality and it uses fread, malloc etc. as if it was plain C code (under C++, <cstdio> and <cstdlib> declares these as std::fread, std::malloc...).

This doesn't cause any problems right now, but it'd be better to turn this into a .c file and include one of the "standard" system header emulator files instead of re-declaring these functions at the top of this test file.

Comment on lines 995 to 996
if (const auto *BufferFirstElem =
dyn_cast_or_null<ElementRegion>(Call.getArgSVal(0).getAsRegion())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we see an ElementRegion in the case when we're freading into the beginning of an array variable?

I see that the element region layer should be there if we did pointer arithmetic or if this is a symbolic region converted to a type; but not sure that this covers the "simply read into an array" case as well. Could you add a simple testcase that validates that the individual element invalidation activates in a situation like

int arr[10];
fread(arr, sizeof(int), 5, <FILE pointer>);

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case fread reads to the beginning of a buffer, we won't have an ElementRegion, thus the heuristic for eagerly binding the invalidated elements won't trigger. This is unfortunate, but you can think of this as we keep the previous behavior.
To circumvent this, I'd need to know the type for for the pointee.
This would imply that I should special-case TypedValueRegion and SymbolicRegion.

I'll think about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented.

EscapingVals.reserve(*CountVal);

RegionAndSymbolInvalidationTraits ITraits;
for (auto Idx : llvm::seq(*StartIndexVal, *StartIndexVal + *CountVal)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop does not work if the type of the array is not the same as the "size" parameter passed to fread.:

int buffer[100];
fread(buffer + 1, 3, 5, file);

In this case 3*5 bytes should be read by fread into the array. If sizeof(int)==4 4 elements should be invalidated in the buffer starting from index 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a problem.
To properly invalidate element-wise, I must bind the correctly typed conjured symbols so that subsequent lookups in the store would hit the entry.
Consequently, I can't consider the buffer as a char array and invalidate each chars in the range.
I must use ElementRegions, thus I need properly aligned item-wise iteration and invalidation.

So, I could try to convert the element size and item count into a start byte offset and length in bytes. After this I could check if the start bytes offset would land on an item boundary and if the length would cover a whole element object.

This way we could cover the case like:

int buffer[100]; // assuming 4 byte ints.
fread(buffer + 1, 2, 10, file); // Reads 20 bytes of data, which is dividable by 4, thus we can think of the 'buffer[1:6]' elements as individually invalidated.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented.

const MemRegion *Buffer = BufferFirstElem->getSuperRegion();
QualType ElemTy = BufferFirstElem->getElementType();
SVal FirstAccessedItem = BufferFirstElem->getIndex();
SVal ItemCount = Call.getArgSVal(2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is some redundancy in getting the argument values, SizeVal and NMembVal are already available and could be used by the function (if it would take NonLoc).

@steakhal
Copy link
Contributor Author

steakhal commented Jun 3, 2024

Fixed most NFC typos and suggestions.
Let's continue the discussion.

@steakhal steakhal requested review from NagyDonat and balazske June 3, 2024 14:08
@steakhal
Copy link
Contributor Author

steakhal commented Jun 3, 2024

Checkout the new implementation and the added "weird" fread tests.
FYI unfortunately our store interferes a bit, as you will see in the last test (test_unaligned_start_read) when the store does not purge the previous binding when we have an overlapping write to the store.
Consequently, when reading, we can read the old value - which was partially overwritten by the escape/invalidation.

Copy link
Collaborator

@balazske balazske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks relatively good, I found only smaller issues.

auto Zero = [&SVB] {
BasicValueFactory &BVF = SVB.getBasicValueFactory();
return nonloc::ConcreteInt(BVF.getIntValue(0, /*isUnsigned=*/false));
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not better to store the Zero value in a variable instead of using this lambda?

Copy link
Contributor Author

@steakhal steakhal Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way it's lazily evaluated. I wanted to avoid sideffects inside BVF and it seemed like a plausible way achieving it.

}

void test_partial_elements_read(void) {
clang_analyzer_dump(sizeof(int)); // expected-warning {{4 S32b}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some (buildbot) platforms are different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pinned the target triple in the RUN line to account for this.

if (20 == fread(buffer + 1, 3, 20, fp)) {
clang_analyzer_dump(buffer[0]); // expected-warning{{1 S32b}}
clang_analyzer_dump(buffer[20]); // expected-warning{{conj_}}
clang_analyzer_dump(buffer[21]); // expected-warning{{1st function call argument is an uninitialized value}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these index values correct? buffer[15] should be the last changed item, buffer[16] should be 3, and buffer[17] is the uninitialized value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I made a bug when calling escapeByStartIndexAndCount. I should have passed NumElementsRead instead of CountVal. Thanks for noticing this!

long buffer[10];
// Cannot reason about index.
size_t res = fread(buffer + startIndex, sizeof(long), 5, fp);
if (5 == res) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to explain why these checks are here in this way. Is it not enough to have only the longest if branch for all cases if all are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just upstreaming this hunk. I wasn't here when it was written.
My guess what these could test is the fact that we write to the buffer from a symbolic index, up to 5 long values.
The engine should deduce from this that at the buffer[startIndex + 5] we definitely have uninitialized values - However, as it's currently implemented right now, we can't. It is because I only handle the case when the index is concrete.

W.r.t. why we have different branches depending on the return value of read, my guess would be that if we modeled that it only read 4 elements, then the buffer[startIndex + 4] and subsequent elements should be uninitialized, etc.

I'll rework the test case to covey this interpretation.

if (4 == fread(asChar + 1, 1, 4, fp)) {
void clang_analyzer_printState(void);
clang_analyzer_printState();
clang_analyzer_dump(buffer[0]); // expected-warning{{3 S32b}} FIXME Reading a 'char' should not result in a 'S32b' value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FIXME is not understandable to me, if this is about the problem seen in the PR remark (problem with store and overlapping read) it would be better to make this comment more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to highlight that reading a 8 bit wide char element should never result in a value of 32 bits.
And this is casued by the current Store implementation, as: it does a lookup and it returns whatever is associated with that read start bit offset. Given that the last write the Store saw was the 32 bit wide int 3, it will just return that - no matter if that bitwidth would make sense or not.

I've added the 8 bit assumption to the FIXME. I hope it's good now.

}
}

void compound_write2(void) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me why some of the test functions are called "write" and some "read". It would be better to have all of them read in the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the author thought that the read() function call will have a write effect to the buffer.
TBH I agree with you and I'll rename these.

bool DivisibleAccessSpan = (NumBytesRead % ElemSizeInChars) == 0;
int64_t NumElementsRead = NumBytesRead / ElemSizeInChars;
constexpr int MaxInvalidatedElementsLimit = 64;
if (DivisibleAccessSpan && NumElementsRead <= MaxInvalidatedElementsLimit) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to round the NumElementsRead upwards instead of skipping the escape, in the not divisible case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied a patch to this code such that the last partial element is considered as a full access.
This is indeed better than doing a fallback and invalidating the full buffer.
Also modified the test to demonstrate this, but there is yet another bug in the current Store:
If we read from a byte offset of an element, we still get unknown as a result: int buffer[2] = {2,3} then reading from ((char*)buffer)[1] results in unknown. Compiler explorer.

@steakhal
Copy link
Contributor Author

steakhal commented Jun 7, 2024

Let's do another round.

@steakhal steakhal requested a review from balazske June 7, 2024 12:34
Copy link
Collaborator

@balazske balazske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks now OK at least to my knowledge.

@balazske
Copy link
Collaborator

It should be possible in StreamChecker to perform the invalidations even if the StreamState is not found (stream was not opened). Another possible solution is that a generic invalidation support is added to StdLibraryFunctionsChecker that can be used for stream related and other functions.

@steakhal
Copy link
Contributor Author

Thanks for the green light!

Another possible solution is that a generic invalidation support is added to StdLibraryFunctionsChecker that can be used for stream related and other functions.

Yes, that would be so nice. However, it's out of scope for me this time.

@steakhal steakhal merged commit 69bc159 into llvm:main Jun 13, 2024
7 checks passed
@steakhal steakhal deleted the refine-fread-escape branch June 13, 2024 14:13
@vabridgers
Copy link
Contributor

vabridgers commented Jun 25, 2024

Hi @steakhal , this change seems to have exposed a div/0 error in a very particular corner we came across after integrating this change. Could you try this case to see if can repro?

The div/0 is occurring at line 1093:
bool IncompleteLastElement = (NumBytesRead % ElemSizeInChars) != 0;

clang -cc1 -analyze -analyzer-checker=unix.Stream -analyzer-opt-analyze-headers test.c

This is the test source.

# 1 "" 3
typedef FILE;
void *b;
a() {
FILE f = fopen("", "");
fread(b, 1, 1, f);
}

The crash ...
$ clang -cc1 -analyze -analyzer-checker=unix.Stream -analyzer-opt-analyze-headers test.c
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0. Program arguments: clang -cc1 -analyze -analyzer-checker=unix.Stream -analyzer-opt-analyze-headers test.c
1. <eof> parser at end of file
2. While analyzing stack:
#0 Calling a
3. :5:3: Error evaluating statement
4. :5:3: Error evaluating statement
#0 0x000000000686046a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) llvm/lib/Support/Unix/Signals.inc:723:22
#1 0x00000000068608a4 PrintStackTraceSignalHandler(void*) llvm/lib/Support/Unix/Signals.inc:798:1
#2 0x000000000685e1bd llvm::sys::RunSignalHandlers() llvm/lib/Support/Signals.cpp:105:20
#3 0x000000000685fe54 SignalHandler(int) llvm/lib/Support/Unix/Signals.inc:413:1
#4 0x00007f931d2c1630 __restore_rt sigaction.c:0:0
#5 0x0000000009fc96ff tryToInvalidateFReadBufferByElements(llvm::IntrusiveRefCntPtr<clang::ento::ProgramState const>, clang::ento::CheckerContext&, clang::ento::CallEvent const&, clang::ento::NonLoc, clang::ento::NonLoc) clang/lib/StaticAnalyzer/Checkers/StreamChecker.cpp:1093:48

@steakhal
Copy link
Contributor Author

steakhal commented Jun 25, 2024

Hi @steakhal , this change seems to have exposed by div/0 error in a very particular corner we came across after integrating this change. Could you try this case to see if can repro?

Yea, it seems to crash. I'll fix it once I have some time. Probably early next week if I can't spare some time during the weekend.
Thanks for raising this.

EDIT: The fix is proposed in #97199.

steakhal added a commit that referenced this pull request Jul 1, 2024
We can get zero type size (thus div by zero crash) if the region is for a 'void*' pointer.
In this patch, let's just override the void type with a char type to avoid the crash.

Fixes
#93408 (comment)
lravenclaw pushed a commit to lravenclaw/llvm-project that referenced this pull request Jul 3, 2024
…#97199)

We can get zero type size (thus div by zero crash) if the region is for a 'void*' pointer.
In this patch, let's just override the void type with a char type to avoid the crash.

Fixes
llvm#93408 (comment)
kbluck pushed a commit to kbluck/llvm-project that referenced this pull request Jul 6, 2024
…#97199)

We can get zero type size (thus div by zero crash) if the region is for a 'void*' pointer.
In this patch, let's just override the void type with a char type to avoid the crash.

Fixes
llvm#93408 (comment)
EthanLuisMcDonough pushed a commit to EthanLuisMcDonough/llvm-project that referenced this pull request Aug 13, 2024
This change enables more accurate modeling of the write effects of
`fread`. In particular, instead of invalidating the whole buffer, in a
best-effort basis, we would try to invalidate the actually accesses
elements of the buffer. This preserves the previous value of the buffer
of the unaffected slots. As a result, diagnose more uninitialized buffer
uses for example.

Currently, this refined invalidation only triggers for `fread` if and
only if the `count` parameter and the buffer pointer's index component
are concrete or perfectly-constrained symbols.
Additionally, if the `fread` would read more than 64 elements, the whole
buffer is invalidated as before. This is to have safeguards against
performance issues.

Refer to the comments of the assertions in the following example to see
the changes in the diagnostics:

```c++
void demo() {
  FILE *fp = fopen("/home/test", "rb+");
  if (!fp) return;
  int buffer[10]; // uninitialized
  int read_items = fread(buffer+1, sizeof(int), 5, fp);
  if (5 == read_items) {
    int v1 = buffer[1]; // Unknown value but not garbage.
    clang_analyzer_isTainted(v1); // expected-warning {{YES}} <-- Would be "NO" without this patch.
    clang_analyzer_dump(v1); // expected-warning {{conj_}} <-- Not a "derived" symbol, so it's directly invalidated now.
    int v0 = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}} <-- Had no report here before.
    (void)(v1 + v0);
  } else {
    // If 'fread' had an error.
    int v0 = buffer[0]; // expected-warning {{Assigned value is garbage or undefined}} <-- Had no report here before.
    (void)v0;
  }
  fclose(fp);
}
```

CPP-3247, CPP-3802

Co-authored by Marco Borgeaud (marco-antognini-sonarsource)
EthanLuisMcDonough pushed a commit to EthanLuisMcDonough/llvm-project that referenced this pull request Aug 13, 2024
steakhal added a commit that referenced this pull request Sep 12, 2024
In #93408
69bc159
I refined how invalidation is done for `fread`. It can crash, if the
"size" or "count" parameters of "fread" is a perfectly constrained
negative value. In such cases, when it will try to allocate a
SmallVector with a negative size, which will cause a crash.

To mitigate this issue, let's just guard against negative values.

CPP-3247
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:static analyzer clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants