Skip to content

docs_bulk issue #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
paulmoto opened this issue Jun 16, 2015 · 24 comments
Closed

docs_bulk issue #75

paulmoto opened this issue Jun 16, 2015 · 24 comments
Assignees

Comments

@paulmoto
Copy link

When doing uploads with docs_bulk, it takes a long time, then I get 400 errors and R crashes. I can see that about 90% of my documents are getting indexed before it crashes. Using a PUT with httr works correctly, so the files are formatted correctly. POST did the same thing as docs_bulk, maybe it's using post instead of PUT? I don't see why it should matter which is used, but it apparently does.

@sckott
Copy link
Contributor

sckott commented Jun 16, 2015

thanks for your message @paulmoto - looking at this now, I don't think PUT vs. POST should be a problem, but I'll check.

Where is your Elasticsearch instance running?

How big is the data you're putting in?

@paulmoto
Copy link
Author

Running ES 1.5.2 I've got about 3 GB total, but I've broken it onto ~10MB chunks. Digging deeper, I see character encoding issues on the files that do upload without crashing, maybe this is related? Sending ü is giving me a
"MapperParsingException[failed to parse [concept]]; nested: JsonParseException[Invalid UTF-8 start byte 0x96\n at [Source: [B@4601be2d; line: 1, column: 26]]; "

even though it's coming from a utf-8 encoded source.

@paulmoto
Copy link
Author

Probably the character encoding problems are making it take a long time, and I can figure out the character encoding (hopefully). The issue here is that R crashes when the response finally comes back.

@sckott
Copy link
Contributor

sckott commented Jun 16, 2015

@paulmoto regarding the encoding error. Can you try uploading via curl on the command line and see if you get the same error? Trying to figure out whether this is a problem with the elastic pkg or not

@sckott
Copy link
Contributor

sckott commented Jun 16, 2015

Weird that R crashes. I haven't had that problem.

@paulmoto
Copy link
Author

Using curl I get my response faster, the character encoding issues are still there, and for the files that crashed R the response is gigantic a gigantic list of 2-3 digit numbers followed by the standard response saying which documents were created. I imagine the giant list of numbers filled my memory and crashed R.

@sckott
Copy link
Contributor

sckott commented Jun 16, 2015

Oh, i wonder if the output from docs_bulk() printing to the R console is what you're talking about, and may be the problem here, overflowing what the R console can handle or something like that

@sckott
Copy link
Contributor

sckott commented Jun 16, 2015

Are you loading in via a data.frame or list in R, or via a file?

@paulmoto
Copy link
Author

I'm using .txt files. The character encoding was due to R not saving the txt as UTF encoded, not a big deal. I'm looking in to what is causing this extremely long response.

@sckott
Copy link
Contributor

sckott commented Jun 16, 2015

Hmm, we don't write your file if you pass in file path, see https://github.com/ropensci/elastic/blob/master/R/docs_bulk.r#L98-L112 so we're not changing encoding

@paulmoto
Copy link
Author

The file was generated from another elasticsearch query which I parsed and wrote to a text file using write instead of writeLines, which caused the character issue. This doesn't look like a problem with docs_bulk. There's gotta be something in my text files causing the extremely long response, which is the problem here.

@paulmoto
Copy link
Author

The only difference between using PUT in httr and docs_bulk is that if I use PUT, I can view my response without crashing.

@sckott
Copy link
Contributor

sckott commented Jun 16, 2015

Hmm, okay, I'll keep looking into that, though the documentation I've seen always uses POST

have you seen the docs for bulk API https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html especially the section at the bottom on size of data. Have you played around with size of each chunk of documents? e.g., in the docs_bulk() function you can change chunk size with the chunk_size parameter. perhaps your chunk size is too big?

@paulmoto
Copy link
Author

I figured out my issue. In jsonlite, the toJSON function drops the names of named vectors(had to switch to using lists). This was happening for around 10% of my documents, generating lots of parsing errors, which was the long response. I was able to parse the response I got out of PUT to find the errors, whereas using docs_bulk I crashed when trying to view the response.

@sckott
Copy link
Contributor

sckott commented Jun 17, 2015

can you share one of the documents that produced a problem? So I can see if I can fix in the package

@paulmoto
Copy link
Author

Unfortunately, I can't share my data because it contains user-specific information. The files were about 10 MB with about 5000-7000 documents in them (approx 10% of documents didn't match my mapping). I would imagine that trying to upload any data that doesn't match the mapping would do the same thing.

@sckott
Copy link
Contributor

sckott commented Jun 17, 2015

@paulmoto okay, understand about personal data.

So when you changed to inputting R lists, docs_bulk() worked for you? Or were there still problems?

@paulmoto
Copy link
Author

Yes, when everything is correct, docs_bulk() works fine.

sckott added a commit that referenced this issue Jun 17, 2015
bump 99 ver and remove date from description file
@sckott
Copy link
Contributor

sckott commented Jun 17, 2015

can you try reinstalling elastic from github then try docs_bulk() again with your data?

there is a hidden parameter in jsonlite::toJSON that let's us convert named vectors to lists, though the maintainer says he will drop it soon jeroen/jsonlite#82 so don't want to rely on that.

instead, i'm just checking for a vector and converting into a list with as.listI() before we get to jsonlite::toJSON

@sckott
Copy link
Contributor

sckott commented Jun 17, 2015

@paulmoto forgot to mention you, let me know if that works

@paulmoto
Copy link
Author

I wasn't using docs_bulk to upload an R object, so I doubt this would affect me at all. I had written my own (bad) code to write the text files. If you've made changes that would affect me, I can test this tomorrow.

@sckott
Copy link
Contributor

sckott commented Jun 17, 2015

@paulmoto okay, so then I don't understand how you're using docs_bulk() - are you passing in file paths to the function?

@paulmoto
Copy link
Author

Yes. The toJSON vector issue was why I had incorrectly formatted files that I then passed to docs_bulk. The long response generated by elasticsearch because of the incorrectly formatted files caused R to crash somehow when I passed the file through docs_bulk. Passing the same file with a PUT received the same response (presumably) and did not crash.

@sckott sckott self-assigned this Jun 29, 2015
@sckott
Copy link
Contributor

sckott commented Jul 1, 2015

@paulmoto Can you update to httr v1 (now on CRAN) and let me know if you still get the same problem? Just checking to see if new httr fixes this

@sckott sckott closed this as completed Aug 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants