Skip to content

Releases: etiennebacher/tidypolars

tidypolars 0.18.0

31 Mar 13:44

Choose a tag to compare

tidypolars requires polars >= 1.10.0.

New features

  • Added support for the following functions:

  • Added decreasing and na.last arguments support to sort() (@Yousa-Mirage, #328).

  • Added na.last and ties.method arguments support to rank() (@Yousa-Mirage, #329).

  • Better error message in filter() when a condition uses = instead of == (#341).

  • count() and add_count() now work with expressions, e.g. count(mtcars, mpg + 1)
    (#346).

  • as_tibble() on grouped Polars DataFrames or LazyFrames now returns a grouped
    tibble (#348).

Bug fixes

  • Fix NA handling in cummin(), cumprod(), cumsum() (@Yousa-Mirage, #326).

  • Fix NA handling in is.finite(), is.infinite(), and is.nan() (#331).

  • In arrange(), if the data was grouped, the order was never maintained even if
    maintain_order = TRUE was passed in group_by(). This is now fixed (#332).

  • When exporting to CSV, null_values alone did not apply and could override explicitly
    provided null_value. This is now fixed (@Yousa-Mirage, #334).

  • Fix sample() to make it work correctly (@Yousa-Mirage, #338).

  • Fix unite() behavior when na.rm = TRUE (#344).

  • Fix a bug in fill() where groups set in .by would be preserved after the
    operation (hence returning a grouped output) (#348).

tidypolars 0.17.0

12 Feb 11:03

Choose a tag to compare

tidypolars requires polars >= 1.9.0 and dplyr >= 1.2.0.

Breaking changes and deprecations

  • The following functions (deprecated since 0.10.0, August 2024) are now removed
    (#303):

    • describe(), use summary() instead.
    • describe_plan() and describe_optimized_plan(), use
      explain(optimized = TRUE/FALSE) instead.
  • make_unique_id() is deprecated and will be removed in a future version. This
    is because the underlying Polars function isn't guaranteed to give the same
    results across different versions. This function doesn't have a replacement in
    tidypolars (#304).

  • In partition_by_key() and partition_by_max_size() (both already deprecated
    in 0.16.0), the argument per_partition_sort_by has been removed (#322).

New features

  • Added support for dplyr::near() (#311).

  • pivot_wider() now works with Polars LazyFrames (#318).

  • Added support for several functions implemented in dplyr 1.2.0:

    • filter_out() (#280)
    • recode_values() (#308)
    • replace_values() (#308)
    • replace_when() (#307)
    • when_any() (#306)
    • when_all() (#306)
  • separate() now supports regex in the sep argument (#320).

Other changes

  • Several changes to make tidypolars more aligned with the tidyverse output
    in general (#316):

    • in count(), if sort = TRUE and there are some ties, then other variables
      are sorted in increasing order.
    • coalesce() no longer has a default argument. This was an implementation
      mistake since dplyr::coalesce() never had this argument.
    • ungroup() used to remove the group-specific attributes in the original
      grouped data, even if the result of the operation was not assigned. This is
      fixed.
    • replace_na() on a Polars DataFrame or LazyFrame now errors if replacement
      is not a list.
    • slice_*() functions on grouped data return columns in the same order as in
      the input.
    • summarize() with only NULL expressions now returns one row per unique
      group instead of the entire data.
    • unite() now returns columns in the correct order, and doesn't duplicate the
      sep in the output if some values are NA.

Bug fixes

  • bind_rows_polars() now uses input names in .id if not all inputs are named,
    for example bind_rows_polars(x1 = x1, x2, .id = "id") (#317).

tidypolars 0.16.0

21 Jan 21:39
d29609f

Choose a tag to compare

tidypolars requires polars >= 1.8.0.

New features

  • New function unnest_longer_polars() to unnest list-columns into rows,
    equivalent to tidyr::unnest_longer(). It supports the parameters values_to,
    indices_to, keep_empty, as well as the {col} templates for column
    naming. (#212, #281, @Yousa-Mirage)

  • New functions separate_longer_delim_polars() and separate_longer_position_polars()
    to split string columns into rows by delimiter or fixed width, equivalent to
    tidyr::separate_longer_delim() and tidyr::separate_longer_position().
    (#57, #285, @Yousa-Mirage)

  • New argument .by in fill() (this was introduced in tidyr 1.3.2). (#283)

  • wday() now supports arbitrary week_start values (1~7), allowing for
    custom week start days. (#292, @Yousa-Mirage)

  • Add support for argument type in nchar (#288).

  • It is now possible to use translated functions without loading the package
    they come from. For example, the following code can run without loading
    stringr in the session:

    data |>
      mutate(y = .tp$str_extract_stringr(x, "\\d+"))

    This can be useful to benefit from polars speed while using the interface of
    tidyverse functions, without adding additional tidyverse dependencies. This
    may be useful to avoid installing extra dependencies, but it is not the
    recommended usage because it makes it harder to convert tidypolars code to
    run with other tidyverse-based backends. More information with ?.tp (#293).

  • New argument mkdir in write_parquet_polars() (this already existed in
    sink_parquet()). (#298)

  • New (experimental) function partition_by() to write partitioned output in
    sink_*() and write_*_polars(). The following functions are deprecated and
    will be removed in a future release (#299):

    • partition_by_key() can be replaced with partition_by(key =)
    • partition_by_max_size() can be replaced with partition_by(max_rows_per_file =)

Changes

  • collect() now returns a tibble instead of a data.frame, for consistency
    with other collect() methods (#273).

Bug fixes

  • arrange() now works with literal values, such as arrange(x, 1:2) (#296).

Documentation

  • Removed the "FAQ" vignette, which was outdated and wasn't particularly helpful.

tidypolars 0.15.1

16 Nov 12:54

Choose a tag to compare

tidypolars requires polars >= 1.6.0.

tidypolars 0.15.0

03 Nov 14:03
1aa2173

Choose a tag to compare

Breaking changes

  • For consistency with dplyr, distinct() now only keeps the selected columns.
    To keep all columns, use .keep_all = TRUE (#227, @ppanko).

New features

  • New argument mkdir in all sink_*() functions to recursively create the
    folder(s) specified in the path(s) to files (#236).

  • New functions partition_by_key() and partition_by_max_size() that can be
    used in the path argument of sink_*() functions. Those enable writing a
    LazyFrame to several files as partitioned output. See more details in
    ?sink_parquet() (#237).

  • bind_cols_polars() now works with more than two LazyFrames (#244).

  • Add support for gsub() (#250).

  • Add partial support for stringr::str_equal() (#228).

  • Add support for lubridate functions rollbackward(), rollback(), and rollforward() (#252).

  • Support stringr::fixed() in more stringr functions (#250).

  • Add support for argument ignore.case in grepl() (#251).

  • Add support for argument .keep_all in distinct() (#227, @ppanko).

Bug fixes

  • Better error message in group_by() for unsupported argument .drop (#230).

  • Better error message in group_by() when passing named expressions in ....
    dplyr supports those but it is more and more recommended to use the .by /
    by argument in individual functions rather than using group_by() and
    ungroup() (#238).

  • Better error message in count() when passing named expressions in ... (#239).

  • Fix bug in join_where() when all common column names between two DataFrames
    are used in the join conditions (#254).

  • Using %in% with NA now retains the NA in the data. Using %in% NA will
    error (#256).

  • Remove occasional deprecation message coming from Polars when using %in%
    (#259, @ppanko).

  • Better handling of functions prefixed with <pkg>:: (#261).

  • Fix wrong behavior of paste() and paste0() with collapse (#263).

Documentation

  • New vignette "How to benchmark tidypolars" (#232).

  • Better documentation for all read_*() and scan_*() functions (#241).

tidypolars 0.14.1

06 Aug 08:27

Choose a tag to compare

  • tidypolars requires polars >= 1.1.0 (#222).

Bug fixes

  • Fix a corner case when filter() was used in a custom function with missing
    arguments (#220).

  • In grepl(), the argument fixed is now used correctly (thanks @gernophil
    for the report, #223).

  • if_else() and ifelse() now work when using named arguments (#224).

tidypolars 0.14.0

22 Jul 15:33
754622c

Choose a tag to compare

  • tidypolars requires polars >= 1.0.0. This release of polars contains
    many breaking changes. Those should be invisible to tidypolars users, with
    the exception of deprecation messages (see below). However, if your code
    contains user-defined functions that use polars syntax, you may need to
    revise those (#194).

Deprecations and breaking changes

  • The following arguments are deprecated and will be removed in a future
    version. The recommended replacement is indicated on the right of the arrow
    (#194):

    • in compute() and collect(): streaming -> engine;
    • in read_csv_polars() and scan_csv_polars():
      • dtypes -> schema_overrides
      • reuse_downloaded -> no replacement
    • in read_ndjson_polars and scan_ndjson_polars():
      • reuse_downloaded -> no replacement
    • in read_ipc_polars and scan_ipc_polars():
      • memory_map -> no replacement
    • in write_csv_polars() and sink_csv():
      • null_values -> null_value
      • quote -> quote_char
    • in write_ndjson_polars():
      • pretty -> no replacement
      • row_oriented -> no replacement
    • in write_ipc_polars():
      • future -> compat_level
  • fetch() is deprecated, use head() before collect() instead (#194).

  • group_keys() now returns a tibble and not a data.frame anymore (#194).

  • lubridate::make_date(), lubridate::make_datetime(), and ISOdatetime()
    now error if some components go over their expected range, e.g. month = 20
    or hour = 25. Before, those functions were returning NA in this situation
    (#194).

  • summary() returns an additional row for the 50% percentile (#194).

New features

  • Added support for various lubridate functions:

    • force_tz() and with_tz() (@atsyplenkov, #170);
    • date() (@atsyplenkov, #181);
    • today() and now() (#183);
    • weeks(), days(), hours(), minutes(), seconds(), milliseconds(),
      microseconds(), nanoseconds() (#184).
  • tidypolars can now use expressions that contain non-translated functions
    if those expressions do not use columns from the data.

    Example:

    dat <- pl$DataFrame(foo = c(2, 1, 2))
    a <- c("d", "e", "f")
    dat |>
      filter(foo >= agrep("a", a))

    agrep() is not a translated function so this used to error:

    Error in `filter()`:
    ! `tidypolars` doesn't know how to translate this function: `agrep()`.
    

    However, we see that agrep("a", a) doesn't use any column but instead an
    object in the environment so it can be evaluated without caring whether
    tidypolars knows this function or not:

    shape: (1, 1)
    ┌─────┐
    │ foo │
    │ --- │
    │ f64 │
    ╞═════╡
    │ 2.0 │
    └─────┘
    

    Note that this is evaluated before running polars in the background so this
    expression can't benefit from polars parallel evaluation for instance.
    Thanks @mgacc0 for the suggestion.

  • Add support for as.Date() for character columns (#190).

  • Error messages due to untranslated functions now suggest opening an issue to
    ask for their translation (#197).

  • Add support for %>% in expressions (#200).

  • Add support for dplyr::tally() (#203).

  • count() and add_count() now warn or error when argument wt is used
    since it is not supported. The behavior depends on the global option
    tidypolars_unknown_args (#204).

  • tidypolars has experimental support for fallback to R when a function is not
    internally translated to polars syntax. The default behavior is still to
    error, but the user can now set options(tidypolars_fallback_to_r = TRUE)
    to handle those unknown functions. See ?tidypolars_options for
    details on the drawbacks of this approach (#205).

  • Large performance improvement when using selection helpers (such as
    contains()) on data with many columns (#211).

  • tidypolars now exports rules to be used with flir for detecting deprecated
    functions describe_plan() and describe_optimized_plan(). Those can be
    used in your project by following this article.
    Note that this requires flir 0.5.0.9000 or higher (#214).

Bug fixes

  • Fix behavior of mutate() and summarize() when they don't contain any
    expression (#191).

  • Fix error in count() when it includes grouping variables (#193).

  • Passing . in an anonymous function in across() now works (#216).

tidypolars 0.13.0

10 Mar 18:01

Choose a tag to compare

New features

  • Added support for stringr::str_replace_na() (#153).

  • Better checks for unknown and unsupported arguments in compute(),
    collect(), *_join(), pivot_*(), sink_*(), slice_sample() and
    uncount()(#158, thanks @fkohrt for the report). Now, when those
    functions receive:

    • an argument that exists in the tidyverse implementation but not supported
      by tidypolars, they warn the user. This default behaviour can be changed
      to error instead with options(tidypolars_unknown_args = "error").
    • an argument that doesn't exist at all, they error.
  • Add support for argument explicit in tidyr::complete().

  • Add option to keep track of filenames in scan_csv_polars() (#171, @ginolhac).

  • Add partial support for seq() (argument length.out is not supported) and
    seq_len().

  • complete() now accepts named elements, e.g. complete(df, group, value = 1:4)
    (#176).

  • Add support for several lubridate functions:

    • am(), pm(), leap_year(), days_in_month() (#178);

Bug fixes

  • Fix edge cases in the tidypolars implementation of stringr::str_sub()
    and substr() compared to their original implementation (#159).

  • arrange() now places NA values last, like dplyr.

tidypolars 0.12.0

19 Nov 16:03

Choose a tag to compare

tidypolars requires polars >= 0.21.0.

Breaking changes

  • summarize() now drops the last group of the output by default (for
    consistency with dplyr). Previously it kept the same groups as in the input
    data (#149).

New features

  • Add support for argument .groups in summarize(). Value "rowwise" is not
    supported for now (#149).

  • Added support for dplyr::lead(). In dplyr::lead() and dplyr::lag(), the
    arguments default and order_by are now supported (#151).

tidypolars 0.11.0

17 Oct 10:20

Choose a tag to compare

tidypolars requires polars >= 0.20.0.

Breaking changes

  • arrange() now errors with unknown variable names (like dplyr::arrange()).
    Previously, unknown variables were silently ignored. Using expressions (like
    a + b) is now accepted (#144).

  • The parameter inherit_optimization is removed from all sink_*() functions.

New features

  • The power operators ^ and ** now work.

  • New function sink_ndjson() to write the results of a lazy query to a NDJSON
    file without collecting it in memory.

  • inner_join() now accepts inequality joins in the by argument, including
    the following helpers: between(), overlaps(), within() (#148).

Bug fixes

  • Using an external object in case_when(), ifelse() and ifelse() now works.

  • str_sub() doesn't error anymore when start is positive and end is negative.

  • read_*_polars() functions used to return a standard data.frame by mistake.
    They now return a Polars DataFrame.

  • Using [ for subsetting in expressions now works. Thanks @ginolhac for the
    report (#141).

  • bind_cols_polars() and bind_rows_polars() now error (as expected before) if
    elements are a mix of Polars DataFrames and LazyFrames.