Skip to content

allow retry on netErrors in safe situations #871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 29 additions & 13 deletions conn.go
Original file line number Diff line number Diff line change
Expand Up @@ -600,7 +600,7 @@ func (cn *conn) gname() string {
func (cn *conn) simpleExec(q string) (res driver.Result, commandTag string, err error) {
b := cn.writeBuf('Q')
b.string(q)
cn.send(b)
cn.mustSendRetryable(b)

for {
t, r := cn.recv1()
Expand Down Expand Up @@ -632,7 +632,7 @@ func (cn *conn) simpleQuery(q string) (res *rows, err error) {

b := cn.writeBuf('Q')
b.string(q)
cn.send(b)
cn.mustSendRetryable(b)

for {
t, r := cn.recv1()
Expand Down Expand Up @@ -765,7 +765,7 @@ func (cn *conn) prepareTo(q, stmtName string) *stmt {
b.string(st.name)

b.next('S')
cn.send(b)
cn.mustSendRetryable(b)

cn.readParseResponse()
st.paramTyps, st.colNames, st.colTyps = cn.readStatementDescribeResponse()
Expand Down Expand Up @@ -882,13 +882,29 @@ func (cn *conn) Exec(query string, args []driver.Value) (res driver.Result, err
return r, err
}

func (cn *conn) send(m *writeBuf) {
_, err := cn.c.Write(m.wrap())
func (cn *conn) send(m *writeBuf) (int, error) {
return cn.c.Write(m.wrap())
}

func (cn *conn) mustSend(m *writeBuf) {
_, err := cn.send(m)
if err != nil {
panic(err)
}
}

func (cn *conn) mustSendRetryable(m *writeBuf) {
sentBytes, err := cn.send(m)
if err != nil {
if _, ok := err.(*net.OpError); ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this makes something safe to retry. If you send something to a server and get a network error, it is still possible for the server to have received and processed the result, but the error happens during return to the client. In that case it's not safe to retry. Do I understand the circumstances where this can happen correctly? If so, this is closer to a "maybe it worked?" error, which means the client has to check if it is safe to retry or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, @mjibson and @cbandy!

If you send something to a server and get a network error, it is still possible for the server to
have received and processed the result, but the error happens during return to the client.

My networking knowledge is a bit rusty but from my understanding this can not happen:

When send() is called the data is only written to the kernel's send buffer and then the call returns. The send() call does not block until the peer acknowledged receiving the data.
Therefore it can not happen that cn.send(m) returns an error because something went wrong during returning an ACK to the sender.

Also errors that are returned by send() only indicate local errors.

Quote from Linux manpage of the send() syscall:

No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1.

http://man7.org/linux/man-pages/man2/send.2.html

(Golang uses internally the write() syscall to write the data to the fd of the socket. It's the same then using the send() syscall without additional flags.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of assumptions that care about how the linux kernel and go function and I'm not at all comfortable changing this until there are tests that clearly demonstrate these assumptions. Also, what about if the client and or server are on windows, mac, or some version of linux or go that doesn't happen to function this way? I'm still not convinced that it's always safe to retry if there's an error here. Distributed systems indeed have a "maybe this failed?" error because of this exact problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjibson ok, I can make the code more safe by narrowing the situation when a send() error is identified as retryable by:

  1. Also check if Conn.Write() returned that 0 bytes were transferred when it returned a net.OpError error.
  2. Narrow down the error types to:
  • syscall.EPIPE: "The local end has been shut down on a connection oriented socket.".
  • syscall.ENOTSOCK: "The file descriptor sockfd does not refer to a socket.".
  • syscall.EINVAL: "Invalid argument passed."
  • syscall.ECONNRESET: "Connection reset by peer."
  • syscall.EBADF: "sockfd is not a valid open file descriptor."
  1. Narrow down the error type to syscall.EPIPE only. The PR would then address specifically the problem described in db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870.

What do you think?
I would favor doing 1 and 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what exactly the EPIPE error means actually happened and what didn't happen. Can it change between OS or version of linux? Is the bytes written count always non-zero if any bytes were sent even if there's an error return for all paths? These questions make me very wary to do even 1 and 3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what exactly the EPIPE error means actually happened and what didn't happen.

You can look it up in the documentation for the related syscalls:

Posix: https://pubs.opengroup.org/onlinepubs/009695399/functions/write.html
The documentation in the e.g. linux or freebsd manpages are more clear and they implement the Posix standard, see e.g.:
https://www.freebsd.org/cgi/man.cgi?write(2) or http://man7.org/linux/man-pages/man2/write.2.html

See for Windows: https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile

Google can also help with understanding what EPIPE means:
https://www.google.com/search?q=what+does+%22broken+pipe%22+mean&oq=what+does+%22broken+pipe%22+mean

Can it change between OS or version of linux?

The syscall API of operating systems is stable. It does not change in downwards incompatible manner. This would break tons of applications for the effected operation system.
Unix-like operating systems implement the POSIX API which specify this behavior. This includes OSX.

Windows Winsock API is not POSIX compliant.
But in this case it behaves the same, the WriteFile() returns ERROR_BROKEN_PIPE when the filedescriptor is closed. ERROR_BROKEN_PIPE is mapped to the same error then EPIPE by golang

See:
https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile
https://golang.org/src/syscall/zerrors_windows.go#L120

Is the bytes written count always non-zero if any bytes were sent even if there's an error return for all paths?

Yes
See the implementation of Golangs conn.Write() and the parent methods that it calls.
https://golang.org/src/net/net.go?s=6460:6503#L175
Unix: https://golang.org/src/internal/poll/fd_unix.go#L254
Windows: https://golang.org/src/internal/poll/fd_windows.go#657

fd.Write() calls syscall.Write() in a loop until all bytes were written out.
If it fails in the first loop iteration, it only returns an error nothing was sent out in this case.
If it fails on subsequent loop iterations it returns with the error how many bytes were send out.

if sentBytes == 0 {
err = &netErrorNoWrite{err}
}
}
panic(err)
}
}

func (cn *conn) sendStartupPacket(m *writeBuf) error {
_, err := cn.c.Write((m.wrap())[1:])
return err
Expand Down Expand Up @@ -1109,7 +1125,7 @@ func (cn *conn) auth(r *readBuf, o values) {
case 3:
w := cn.writeBuf('p')
w.string(o["password"])
cn.send(w)
cn.mustSend(w)

t, r := cn.recv()
if t != 'R' {
Expand All @@ -1123,7 +1139,7 @@ func (cn *conn) auth(r *readBuf, o values) {
s := string(r.next(4))
w := cn.writeBuf('p')
w.string("md5" + md5s(md5s(o["password"]+o["user"])+s))
cn.send(w)
cn.mustSend(w)

t, r := cn.recv()
if t != 'R' {
Expand All @@ -1145,7 +1161,7 @@ func (cn *conn) auth(r *readBuf, o values) {
w.string("SCRAM-SHA-256")
w.int32(len(scOut))
w.bytes(scOut)
cn.send(w)
cn.mustSend(w)

t, r := cn.recv()
if t != 'R' {
Expand All @@ -1165,7 +1181,7 @@ func (cn *conn) auth(r *readBuf, o values) {
scOut = sc.Out()
w = cn.writeBuf('p')
w.bytes(scOut)
cn.send(w)
cn.mustSend(w)

t, r = cn.recv()
if t != 'R' {
Expand Down Expand Up @@ -1219,9 +1235,9 @@ func (st *stmt) Close() (err error) {
w := st.cn.writeBuf('C')
w.byte('S')
w.string(st.name)
st.cn.send(w)
st.cn.mustSend(w)

st.cn.send(st.cn.writeBuf('S'))
st.cn.mustSend(st.cn.writeBuf('S'))

t, _ := st.cn.recv1()
if t != '3' {
Expand Down Expand Up @@ -1299,7 +1315,7 @@ func (st *stmt) exec(v []driver.Value) {
w.int32(0)

w.next('S')
cn.send(w)
cn.mustSend(w)

cn.readBindResponse()
cn.postExecuteWorkaround()
Expand Down Expand Up @@ -1601,7 +1617,7 @@ func (cn *conn) sendBinaryModeQuery(query string, args []driver.Value) {
b.int32(0)

b.next('S')
cn.send(b)
cn.mustSendRetryable(b)
}

func (cn *conn) processParameterStatus(r *readBuf) {
Expand Down
15 changes: 15 additions & 0 deletions error.go
Original file line number Diff line number Diff line change
Expand Up @@ -460,6 +460,18 @@ func errorf(s string, args ...interface{}) {
panic(fmt.Errorf("pq: %s", fmt.Sprintf(s, args...)))
}

// NetErrorNoWrite is a network error that occured before a message that
// indicates the operation to execute was transfered to the server.
// These operations are safe to retry. This error should be replaced with
// driver.ErrBadConn before it's passed to the caller.
type netErrorNoWrite struct {
Err error
}

func (e *netErrorNoWrite) Error() string {
return "netErrorNoWrite: " + e.Err.Error()
}

// TODO(ainar-g) Rename to errorf after removing panics.
func fmterrorf(s string, args ...interface{}) error {
return fmt.Errorf("pq: %s", fmt.Sprintf(s, args...))
Expand Down Expand Up @@ -492,6 +504,9 @@ func (c *conn) errRecover(err *error) {
} else {
*err = v
}
case *netErrorNoWrite:
c.bad = true
*err = driver.ErrBadConn
case *net.OpError:
c.bad = true
*err = v
Expand Down