-
Notifications
You must be signed in to change notification settings - Fork 14
Caching files #36
base: master
Are you sure you want to change the base?
Caching files #36
Conversation
This still needs to be added to the documentation and exposed to the roslaunch system
Codecov Report
@@ Coverage Diff @@
## master #36 +/- ##
==========================================
+ Coverage 77.45% 82.41% +4.96%
==========================================
Files 3 4 +1
Lines 275 364 +89
==========================================
+ Hits 213 300 +87
- Misses 62 64 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I can't get the code coverage up because of #37 . When those tests do start running, I expect that it will be a good idea to add random strings to the end of any test text that need to be pinging the server. Otherwise repeated runs of tests will just hit the cache. There also need to be some tests which intentionally hit the cache. |
@AAlon is this something that you all are interested in? Should I keep working on it? |
Definitely! thank you for contributing. We'll be reviewing this soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this. I'm a fan of the approach. Just a few things that I think could show up as bugs and some suggestions.
I'll get back to you on how we can deal with test coverage after I understand #37 a bit better.
@cevans87 Thanks for the feedback, sorry for being slow to get back on it. I have gotten rid of the size table per your recommendation and pulled some stuff in the db class to improve readability. I have tested under both normal operation and deleting cached files from the disk without telling the db. |
I just found a bug with this: If there is no network connection, then the error utterance (' the polly service cannot be reached ...') gets cached as whatever is being said. Then the system will always say that when the text is being said. |
Are you actively working on that? if so we should probably wait before re-reviewing. It'd be great if you could add some unit tests too. Thanks! |
@AAlon I took a look at it a while back, but haven't had time to come up with a good fix. It seems like the way that errors are handled is a little odd. I will take a look at fixing it eventually, but it will require the errors to propagate up rather than just being handled by passing a different sound file. I actually think that would be a good thing. In many situations the current error handling would be poor for human robot interaction. Allowing integrators (like me) to customize how errors are handled would be better. I should get to that at the latest by the middle of February. I'm pretty busy until then. I'm not sure how to add unit tests, ref #37 |
Want to check back in on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the significant contribution!
I have some comments and questions.
class DummyEngine: | ||
"""A dummy engine which exists to facilitate testing. Can either | ||
be set to act as if it is connected or disconnected. Will create files where | ||
they are expected, but they will not be actual audio files.""" | ||
|
||
def __init__(self): | ||
self.connected = True | ||
self.file_size = 50000 | ||
|
||
def __call__(self, **kwargs): | ||
"""put a file at the specified location and return resonable dummy | ||
values. If not connected, fills in the Exception fields. | ||
|
||
Args: | ||
**kwarks: dictionary with fields: output_format, voice_id, sample_rate, | ||
text_type, text, output_path | ||
|
||
Returns: A json version of a string with fields: Audio File, Audio Type, | ||
Exception (if there is an exception), Traceback (if there is an exception), | ||
and if succesful Amazon Polly Response Metadata | ||
""" | ||
if self.connected: | ||
with open(kwargs['output_path'], 'wb') as f: | ||
f.write(os.urandom(self.file_size)) | ||
output_format = kwargs['OutputFormat'] if 'OutputFormat' in kwargs else 'ogg_vorbis' | ||
resp = json.dumps({ | ||
'Audio File': kwargs['output_path'], | ||
'Audio Type': output_format, | ||
'Amazon Polly Response Metadata': {'some header': 'some data'} | ||
}) | ||
return SynthesizerResponse(resp) | ||
else: | ||
current_dir = os.path.dirname(os.path.abspath(__file__)) | ||
error_ogg_filename = 'connerror.ogg' | ||
error_details = { | ||
'Audio File': os.path.join(current_dir, '../src/tts/data', error_ogg_filename), | ||
'Audio Type': 'ogg', | ||
'Exception': { | ||
'dummy head': 'dummy val' | ||
# 'Type': str(exc_type), | ||
# 'Module': exc_type.__module__, | ||
# 'Name': exc_type.__name__, | ||
# 'Value': str(e), | ||
}, | ||
'Traceback': 'some traceback' | ||
} | ||
return SynthesizerResponse(json.dumps(error_details)) | ||
|
||
|
||
def set_connection(self, connected): | ||
"""set the connection state | ||
|
||
Args: | ||
connected: boolean, whether to act connected or not | ||
""" | ||
self.connected = connected | ||
|
||
def set_file_sizes(self, size): | ||
"""Set the target file size for future files in bytes | ||
|
||
Args: | ||
size: the number of bytes to make the next files | ||
""" | ||
self.file_size = size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of on the fence about this class being here instead of being under test
, such as in test_unit_synthesizer.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this belongs under test/
import uuid | ||
|
||
db = DB() | ||
init_num_files = db.get_num_files() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you have an assert that init_num_files == 0
? So later on, your test should be db.get_num_files() == 1
instead of db.get_num_files() == init_num_files + 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may need a deletion of /tmp/polly.db
at the end of each test case.
What is the right way to delete a database and all the audio files associated with it?
self.assertEqual(db.get_num_files(), init_num_files + 1) | ||
|
||
|
||
def test_multiple_novel(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably better named test_multiple_unique
.
self.assertEqual(db.get_num_files(), init_num_files) | ||
|
||
def test_no_connection_existing(self): | ||
from tts.db import DB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import.
import json | ||
import os | ||
|
||
db = DB() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though, it would be better to have some tests for db.get_num_files()
.
self.assertEqual(audio_file1, audio_file2) | ||
self.assertTrue(os.path.exists(audio_file2)) | ||
|
||
def test_no_connection_novel(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_no_connection_nonexisting
?
|
||
self.assertFalse(os.path.exists(audio_file1)) | ||
|
||
def test_big_db(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think test_db_size_limit
would be a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for taking the time and effort to iterate on this @mjsobrep! Most of my comments are very minor, but I do have one concern around performance and SQLite I'd like to have addressed before finalizing on this implementation.
class DummyEngine: | ||
"""A dummy engine which exists to facilitate testing. Can either | ||
be set to act as if it is connected or disconnected. Will create files where | ||
they are expected, but they will not be actual audio files.""" | ||
|
||
def __init__(self): | ||
self.connected = True | ||
self.file_size = 50000 | ||
|
||
def __call__(self, **kwargs): | ||
"""put a file at the specified location and return resonable dummy | ||
values. If not connected, fills in the Exception fields. | ||
|
||
Args: | ||
**kwarks: dictionary with fields: output_format, voice_id, sample_rate, | ||
text_type, text, output_path | ||
|
||
Returns: A json version of a string with fields: Audio File, Audio Type, | ||
Exception (if there is an exception), Traceback (if there is an exception), | ||
and if succesful Amazon Polly Response Metadata | ||
""" | ||
if self.connected: | ||
with open(kwargs['output_path'], 'wb') as f: | ||
f.write(os.urandom(self.file_size)) | ||
output_format = kwargs['OutputFormat'] if 'OutputFormat' in kwargs else 'ogg_vorbis' | ||
resp = json.dumps({ | ||
'Audio File': kwargs['output_path'], | ||
'Audio Type': output_format, | ||
'Amazon Polly Response Metadata': {'some header': 'some data'} | ||
}) | ||
return SynthesizerResponse(resp) | ||
else: | ||
current_dir = os.path.dirname(os.path.abspath(__file__)) | ||
error_ogg_filename = 'connerror.ogg' | ||
error_details = { | ||
'Audio File': os.path.join(current_dir, '../src/tts/data', error_ogg_filename), | ||
'Audio Type': 'ogg', | ||
'Exception': { | ||
'dummy head': 'dummy val' | ||
# 'Type': str(exc_type), | ||
# 'Module': exc_type.__module__, | ||
# 'Name': exc_type.__name__, | ||
# 'Value': str(e), | ||
}, | ||
'Traceback': 'some traceback' | ||
} | ||
return SynthesizerResponse(json.dumps(error_details)) | ||
|
||
|
||
def set_connection(self, connected): | ||
"""set the connection state | ||
|
||
Args: | ||
connected: boolean, whether to act connected or not | ||
""" | ||
self.connected = connected | ||
|
||
def set_file_sizes(self, size): | ||
"""Set the target file size for future files in bytes | ||
|
||
Args: | ||
size: the number of bytes to make the next files | ||
""" | ||
self.file_size = size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this belongs under test/
# need to look at the hash itself. | ||
db = DB() | ||
db_search_result = db.ex( | ||
'SELECT file, audio_type FROM cache WHERE hash=?', tmp_filename).fetchone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synthesizer shouldn't need to know SQL, ex
should be private and the DB class needs to abstract queries away - via more methods like db.find_record(key)
, db.update_last_accessed(key)
, db.add(record)
. Maybe also rename DB to SQLiteBackedCache
or something?
Alternatively, keep DB
as is and create a new SQLiteBackedCache
to abstract the functionality of accessing a database.
|
||
:param kw: what AmazonPolly needs to synthesize | ||
:return: response from AmazonPolly | ||
""" | ||
if 'output_path' not in kw: | ||
tmp_filename = hashlib.md5(kw['text']).hexdigest() | ||
tmp_filepath = os.path.join(os.sep, 'tmp', 'voice_{}_{}'.format(tmp_filename, str(time.time()))) | ||
tmp_filename = hashlib.md5( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs better a name. filename_hash
maybe.
not given, the utterance is added to the cache. If a | ||
filename is specified, then we will assume that the | ||
file is being managed by the user and it will not | ||
be added to the cache. | ||
|
||
:param kw: what AmazonPolly needs to synthesize | ||
:return: response from AmazonPolly | ||
""" | ||
if 'output_path' not in kw: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to break this function into 2-3 smaller ones (perhaps at the if/else blocks)
tmp_filepath = os.path.join(os.sep, 'tmp', 'voice_{}_{}'.format(tmp_filename, str(time.time()))) | ||
tmp_filename = hashlib.md5( | ||
json.dumps(kw, sort_keys=True)).hexdigest() | ||
tmp_filepath = os.path.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was like this before but the prefix tmp_
doesn't make much sense to me, especially now that we're caching things.
tmp_filename = hashlib.md5( | ||
json.dumps(kw, sort_keys=True)).hexdigest() | ||
tmp_filepath = os.path.join( | ||
os.sep, 'tmp', 'voice_{}'.format(tmp_filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tempfile.gettempdir()
instead of hardcoding.
while db.get_size() > self.max_cache_bytes and db.get_num_files() > 1: | ||
remove_res = db.ex( | ||
'select file, min(last_accessed), size from cache' | ||
).fetchone() | ||
db.remove_file(remove_res['file']) | ||
rospy.loginfo('removing %s to maintain cache size, new size: %i', | ||
remove_res['file'], db.get_size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should be part of the DB class' add
method (or some intermediary Cache class).
# TODO: add a test that deletes a file without telling the db and tries to synthesize it | ||
if os.path.exists(db_search_result['file']): | ||
file_found = True | ||
db.ex('update cache set last_accessed=? where hash=?', | ||
current_time, tmp_filename) | ||
synth_result = PollyResponse(json.dumps({ | ||
'Audio File': db_search_result['file'], | ||
'Audio Type': db_search_result['audio_type'], | ||
'Amazon Polly Response Metadata': '' | ||
})) | ||
rospy.loginfo('audio file was already cached at: %s', | ||
db_search_result['file']) | ||
else: | ||
rospy.logwarn( | ||
'A file in the database did not exist on the disk, removing from db') | ||
db.remove_file(db_search_result['file']) | ||
if not file_found: # havent cached this yet | ||
rospy.loginfo('Caching file') | ||
synth_result = self.engine(**kw) | ||
res_dict = json.loads(synth_result.result) | ||
if 'Exception' not in res_dict: | ||
file_name = res_dict['Audio File'] | ||
if file_name: | ||
file_size = os.path.getsize(file_name) | ||
db.ex('''insert into cache( | ||
hash, file, audio_type, last_accessed,size) | ||
values (?,?,?,?,?)''', tmp_filename, file_name, | ||
res_dict['Audio Type'], current_time, file_size) | ||
rospy.loginfo( | ||
'generated new file, saved to %s and cached', file_name) | ||
# make sure the cache hasn't grown too big |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance concerns
- Potentially doing many more transactions than needed; only need to do at most one transaction per _call_engine call. For example, if the new file caused us to go beyond the size limit by 1000 bytes, and we need to remove the 10 oldest files in order to clear that much space, we'd be doing more than 10 transactions.
- Seems like everything is synchronous, so even if we only have one transaction (say, to insert a new entry) - returning
synth_result
is still blocked on completion of that transaction. Using WAL and performing checkpointing in a separate thread is one way to overcome that.
Performance Impact discussion
In part, the value of this node is in enabling Human-Robot interaction via Polly, and that heavily depends on the latency; if I'm asking the robot a question and it answers 5 seconds later, that's not very useful.
We know that with this implementation there will be increased latency, the question is how much. Would it add 1ms, or 500ms? maybe it's even less than 1ms, I really don't know. Do you?
It would feel a bit premature to me to merge this without doing a simple benchmark in which we mock out line 228 (the self.engine() call) and measure the added latency (in lieu of any other data).
Path forward
Generally, it seems like we're reinventing the wheel here by trying to use SQLite directly. There are dozens of libraries that implemented all the convenience functionality we'd ever need (see DiskCache) as well as performance / transaction optimizations. So to summarize:
- Why not use some of the battle tested disk-backed caches out there? Did you choose sqlite because of lack of integration with ROS? (i.e. not available via rosdep?)
- If we have to continue using SQLite directly, we need to test the effect on latency. Again, maybe there's no issue at all, but we need to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I have been so slow to respond. I'm going to start here before going anywhere else, since the answers on this thread may change what is being done elsewhere.
I totally agree that latency is a major issue in these types of systems. SQLite is fast and I would imagine that it is at least an order of magnitude faster than making a web call that downloads a file, probably multiple orders of magnitude better. But you are correct, I don't know that for a fact.
I chose SQLite because it is built into python, it is stable, and it is well known/documented. I could certainly see another tool being used, but I don't know which other tools would check those boxes. DiskCache that you link to looks great. Is that a dependency you all are ok with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @AAlon, want to check back in on this. If you would like me to refactor this to DiskCache or make some timing tests for the current implementation, I can do either of those. DiskCache would make for cleaner more maintainable code, but not sure if that is a dependency that y'all want in here. Just let me know, I'm hoping to get some code done in a few places over the next few weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine, but we should probably take a look at the benchmarking mentioned previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is "that" in "that's fine"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'll be specific. @mjsobrep if you would like to implement some timing tests, in order to provide a ground truth, and test your changes here (along with possibly refactoring to DiskCache) I think that's a good start.
I'm not sure if you'e seen it, but the DiskCache timing info seems promising.
class DB(object): | ||
"""DB a class to manage the database for tracking cached files""" | ||
|
||
def __init__(self, db_location='/tmp/polly.db'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tempfile.gettempdir() instead of /tmp?
def make_db(self): | ||
self.ex('''CREATE TABLE IF NOT EXISTS cache ( | ||
hash text PRIMARY KEY, | ||
file text NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file
is actually a filepath
, correct?
Issue: #19
Description of changes:
Added support for caching files by saving them and keeping note of them in sqlite db. Cache file size is tracked and when a maximum is exceeded, the least recently used file is deleted.
TODO
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.