-
Notifications
You must be signed in to change notification settings - Fork 58
docs_bulk issue #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thanks for your message @paulmoto - looking at this now, I don't think PUT vs. POST should be a problem, but I'll check. Where is your Elasticsearch instance running? How big is the data you're putting in? |
Running ES 1.5.2 I've got about 3 GB total, but I've broken it onto ~10MB chunks. Digging deeper, I see character encoding issues on the files that do upload without crashing, maybe this is related? Sending ü is giving me a even though it's coming from a utf-8 encoded source. |
Probably the character encoding problems are making it take a long time, and I can figure out the character encoding (hopefully). The issue here is that R crashes when the response finally comes back. |
@paulmoto regarding the encoding error. Can you try uploading via curl on the command line and see if you get the same error? Trying to figure out whether this is a problem with the |
Weird that R crashes. I haven't had that problem. |
Using curl I get my response faster, the character encoding issues are still there, and for the files that crashed R the response is gigantic a gigantic list of 2-3 digit numbers followed by the standard response saying which documents were created. I imagine the giant list of numbers filled my memory and crashed R. |
Oh, i wonder if the output from |
Are you loading in via a data.frame or list in R, or via a file? |
I'm using .txt files. The character encoding was due to R not saving the txt as UTF encoded, not a big deal. I'm looking in to what is causing this extremely long response. |
Hmm, we don't write your file if you pass in file path, see https://github.com/ropensci/elastic/blob/master/R/docs_bulk.r#L98-L112 so we're not changing encoding |
The file was generated from another elasticsearch query which I parsed and wrote to a text file using write instead of writeLines, which caused the character issue. This doesn't look like a problem with docs_bulk. There's gotta be something in my text files causing the extremely long response, which is the problem here. |
The only difference between using PUT in httr and docs_bulk is that if I use PUT, I can view my response without crashing. |
Hmm, okay, I'll keep looking into that, though the documentation I've seen always uses POST have you seen the docs for bulk API https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html especially the section at the bottom on size of data. Have you played around with size of each chunk of documents? e.g., in the |
I figured out my issue. In jsonlite, the toJSON function drops the names of named vectors(had to switch to using lists). This was happening for around 10% of my documents, generating lots of parsing errors, which was the long response. I was able to parse the response I got out of PUT to find the errors, whereas using docs_bulk I crashed when trying to view the response. |
can you share one of the documents that produced a problem? So I can see if I can fix in the package |
Unfortunately, I can't share my data because it contains user-specific information. The files were about 10 MB with about 5000-7000 documents in them (approx 10% of documents didn't match my mapping). I would imagine that trying to upload any data that doesn't match the mapping would do the same thing. |
@paulmoto okay, understand about personal data. So when you changed to inputting R lists, |
Yes, when everything is correct, docs_bulk() works fine. |
bump 99 ver and remove date from description file
can you try reinstalling there is a hidden parameter in instead, i'm just checking for a vector and converting into a list with |
@paulmoto forgot to mention you, let me know if that works |
I wasn't using docs_bulk to upload an R object, so I doubt this would affect me at all. I had written my own (bad) code to write the text files. If you've made changes that would affect me, I can test this tomorrow. |
@paulmoto okay, so then I don't understand how you're using |
Yes. The toJSON vector issue was why I had incorrectly formatted files that I then passed to docs_bulk. The long response generated by elasticsearch because of the incorrectly formatted files caused R to crash somehow when I passed the file through docs_bulk. Passing the same file with a PUT received the same response (presumably) and did not crash. |
@paulmoto Can you update to |
When doing uploads with docs_bulk, it takes a long time, then I get 400 errors and R crashes. I can see that about 90% of my documents are getting indexed before it crashes. Using a PUT with httr works correctly, so the files are formatted correctly. POST did the same thing as docs_bulk, maybe it's using post instead of PUT? I don't see why it should matter which is used, but it apparently does.
The text was updated successfully, but these errors were encountered: