Skip to content

Error when augmenting dataset (single machine) #2

@gnarayan

Description

@gnarayan

Hi Kyle,

I'm running your avocado code as part of putting together the PLAsTiCC validation paper, and I'm running into an error generating the augmented dataset, with pandas claiming a file isn't open.

(avocado) gnarayan@dhcp194|~/work/plasticc> avocado_augment plasticc_train plasticc_augment
Loading augmentor...
Processing the dataset in 100 chunks...
Chunk:   0%|                                                                                                                                                                                                                                                                                                                                        | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/gnarayan/miniconda3/envs/avocado/bin/avocado_augment", line 4, in <module>
    __import__('pkg_resources').run_script('avocado==0.1', 'avocado_augment')
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1453, in run_script
    exec(code, namespace, namespace)
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/avocado-0.1-py3.7.egg/EGG-INFO/scripts/avocado_augment", line 84, in <module>
    process_chunk(augmentor, chunk, args, verbose=False)
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/avocado-0.1-py3.7.egg/EGG-INFO/scripts/avocado_augment", line 16, in process_chunk
    num_chunks=args.num_chunks,
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/avocado-0.1-py3.7.egg/avocado/dataset.py", line 182, in load
    num_chunks=num_chunks, **kwargs)
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/avocado-0.1-py3.7.egg/avocado/utils.py", line 167, in read_dataframes
    key_store = store.get_storer(key)
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/pandas/io/pytables.py", line 1249, in get_storer
    group = self.get_node(key)
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/pandas/io/pytables.py", line 1239, in get_node
    self._check_if_open()
  File "/Users/gnarayan/miniconda3/envs/avocado/lib/python3.7/site-packages/pandas/io/pytables.py", line 1360, in _check_if_open
    raise ClosedFileError("{0} file is not open!".format(self._path))
pandas.io.pytables.ClosedFileError: ./data/plasticc_train.h5 file is not open!

Now the data very much does exist, and the files are about the right size:

(avocado) gnarayan@dhcp194|~/work/plasticc> ls -lh data/
total 57899032
drwxr-xr-x  19 gnarayan  staff   608B Sep  5 11:49 plasticc_raw
-rw-r--r--   1 gnarayan  staff    28G Sep  5 12:17 plasticc_test.h5
-rw-r--r--   1 gnarayan  staff    88M Sep  5 11:50 plasticc_train.h5

I don't believe this is an environment issue, but attaching the conda env below.

avocado.txt

Have you run into this before? Alternately, if you are willing to provide a copy of the augmented dataset you used, that'd be fine as well. We're just trying to get a sense of classification performance vs original training sample, augmented sample, and an effectively infinitely large simulated sample.

Best,
-Gautham

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions