Binary files /tmp/tmpRUv9Ck/NkvYvL1qqI/r-cran-readr-1.3.1/build/readr.pdf and /tmp/tmpRUv9Ck/seRWpQ5Yin/r-cran-readr-1.4.0/build/readr.pdf differ Binary files /tmp/tmpRUv9Ck/NkvYvL1qqI/r-cran-readr-1.3.1/build/vignette.rds and /tmp/tmpRUv9Ck/seRWpQ5Yin/r-cran-readr-1.4.0/build/vignette.rds differ diff -Nru r-cran-readr-1.3.1/debian/changelog r-cran-readr-1.4.0/debian/changelog --- r-cran-readr-1.3.1/debian/changelog 2020-05-31 05:54:11.000000000 +0000 +++ r-cran-readr-1.4.0/debian/changelog 2020-10-08 22:55:14.000000000 +0000 @@ -1,14 +1,13 @@ -r-cran-readr (1.3.1-1build2) groovy; urgency=medium +r-cran-readr (1.4.0-1) unstable; urgency=medium - * No-change rebuild against r-api-4.0 + * New upstream release - -- Graham Inggs Sun, 31 May 2020 05:54:11 +0000 + * debian/control: Set Build-Depends: to current R version + * debian/control: Switch to virtual debhelper-compat (= 11) + * debian/compat: Removed + * debian/control: Add new Build-Depends: on r-cran-cpp11 -r-cran-readr (1.3.1-1build1) focal; urgency=medium - - * No-change rebuild for libgcc-s1 package name change. - - -- Matthias Klose Mon, 23 Mar 2020 07:24:37 +0100 + -- Dirk Eddelbuettel Thu, 08 Oct 2020 17:55:14 -0500 r-cran-readr (1.3.1-1) unstable; urgency=medium diff -Nru r-cran-readr-1.3.1/debian/compat r-cran-readr-1.4.0/debian/compat --- r-cran-readr-1.3.1/debian/compat 2018-11-24 23:43:03.000000000 +0000 +++ r-cran-readr-1.4.0/debian/compat 1970-01-01 00:00:00.000000000 +0000 @@ -1 +0,0 @@ -9 diff -Nru r-cran-readr-1.3.1/debian/control r-cran-readr-1.4.0/debian/control --- r-cran-readr-1.3.1/debian/control 2018-12-25 21:33:08.000000000 +0000 +++ r-cran-readr-1.4.0/debian/control 2020-10-08 22:55:14.000000000 +0000 @@ -2,8 +2,8 @@ Section: gnu-r Priority: optional Maintainer: Dirk Eddelbuettel -Build-Depends: debhelper (>= 9), r-base-dev (>= 3.5.2), dh-r, r-cran-rcpp, r-cran-tibble, r-cran-hms, r-cran-r6, r-cran-bh, r-cran-clipr -Standards-Version: 4.2.1 +Build-Depends: debhelper-compat (= 11), r-base-dev (>= 4.0.2), dh-r, r-cran-rcpp, r-cran-tibble, r-cran-hms, r-cran-r6, r-cran-bh, r-cran-clipr, r-cran-cpp11 +Standards-Version: 4.5.0 Vcs-Browser: https://salsa.debian.org/edd/r-cran-readr Vcs-Git: https://salsa.debian.org/edd/r-cran-readr.git Homepage: https://github.com/tidyverse/readr diff -Nru r-cran-readr-1.3.1/DESCRIPTION r-cran-readr-1.4.0/DESCRIPTION --- r-cran-readr-1.3.1/DESCRIPTION 2018-12-21 09:40:02.000000000 +0000 +++ r-cran-readr-1.4.0/DESCRIPTION 2020-10-05 08:50:03.000000000 +0000 @@ -1,40 +1,62 @@ Package: readr -Version: 1.3.1 Title: Read Rectangular Text Data -Description: The goal of 'readr' is to provide a fast and friendly way to read - rectangular data (like 'csv', 'tsv', and 'fwf'). It is designed to flexibly - parse many types of data found in the wild, while still cleanly failing when - data unexpectedly changes. -Authors@R: c( - person("Hadley", "Wickham", , "hadley@rstudio.com", "aut"), - person("Jim", "Hester", , "james.hester@rstudio.com", c("aut", "cre")), - person("Romain", "Francois", role = "aut"), - person("R Core Team", role = "ctb", comment = "Date time code adapted from R"), - person("RStudio", role = c("cph", "fnd")), - person("Jukka", "Jylänki", role = c("ctb", "cph"), comment = "grisu3 implementation"), - person("Mikkel", "Jørgensen", role = c("ctb", "cph"), comment = "grisu3 implementation")) -Encoding: UTF-8 -Depends: R (>= 3.1) -LinkingTo: Rcpp, BH -Imports: Rcpp (>= 0.12.0.5), tibble, hms (>= 0.4.1), R6, clipr, crayon, - methods -Suggests: curl, testthat, knitr, rmarkdown, stringi, covr, spelling +Version: 1.4.0 +Authors@R: + c(person(given = "Hadley", + family = "Wickham", + role = "aut", + email = "hadley@rstudio.com"), + person(given = "Jim", + family = "Hester", + role = c("aut", "cre"), + email = "james.hester@rstudio.com"), + person(given = "Romain", + family = "Francois", + role = "ctb"), + person(given = "R Core Team", + role = "ctb", + comment = "Date time code adapted from R"), + person(given = "RStudio", + role = c("cph", "fnd")), + person(given = "Jukka", + family = "Jylänki", + role = c("ctb", "cph"), + comment = "grisu3 implementation"), + person(given = "Mikkel", + family = "Jørgensen", + role = c("ctb", "cph"), + comment = "grisu3 implementation")) +Description: The goal of 'readr' is to provide a fast and + friendly way to read rectangular data (like 'csv', 'tsv', and 'fwf'). + It is designed to flexibly parse many types of data found in the wild, + while still cleanly failing when data unexpectedly changes. License: GPL (>= 2) | file LICENSE +URL: https://readr.tidyverse.org, https://github.com/tidyverse/readr BugReports: https://github.com/tidyverse/readr/issues -URL: http://readr.tidyverse.org, https://github.com/tidyverse/readr +Depends: R (>= 3.1) +Imports: cli, clipr, crayon, hms (>= 0.4.1), methods, rlang, R6, + tibble, utils, lifecycle +Suggests: covr, curl, dplyr, knitr, rmarkdown, spelling, stringi, + testthat, xml2 +LinkingTo: BH, cpp11 VignetteBuilder: knitr -RoxygenNote: 6.1.1 -SystemRequirements: GNU make +Config/Needs/website: pkgdown, tidyverse/tidytemplate +Config/testthat/edition: 3 +Config/testthat/parallel: false +Encoding: UTF-8 Language: en-US +RoxygenNote: 7.1.1 +SystemRequirements: C++11 +RdMacros: lifecycle NeedsCompilation: yes -Packaged: 2018-12-20 16:06:40 UTC; jhester +Packaged: 2020-10-01 15:34:22 UTC; jhester Author: Hadley Wickham [aut], Jim Hester [aut, cre], - Romain Francois [aut], + Romain Francois [ctb], R Core Team [ctb] (Date time code adapted from R), RStudio [cph, fnd], Jukka Jylänki [ctb, cph] (grisu3 implementation), Mikkel Jørgensen [ctb, cph] (grisu3 implementation) Maintainer: Jim Hester Repository: CRAN -Date/Publication: 2018-12-21 09:40:02 UTC +Date/Publication: 2020-10-05 08:50:03 UTC diff -Nru r-cran-readr-1.3.1/inst/doc/locales.html r-cran-readr-1.4.0/inst/doc/locales.html --- r-cran-readr-1.3.1/inst/doc/locales.html 2018-12-20 16:06:32.000000000 +0000 +++ r-cran-readr-1.4.0/inst/doc/locales.html 2020-10-01 15:34:19.000000000 +0000 @@ -1,45 +1,73 @@ - + - + - + Locales + + - + @@ -96,9 +146,6 @@ font-size: 14px; line-height: 1.35; } -#header { -text-align: center; -} #TOC { clear: both; margin: 0 0 10px 10px; @@ -266,9 +313,13 @@ code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } -code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } +code > span.fu { color: #900; font-weight: bold; } +code > span.er { color: #a61717; background-color: #e3d2d2; } + + + @@ -290,19 +341,19 @@

(Strictly speaking these are not locales in the usual technical sense of the word because they also contain information about time zones and encoding.)

To create a new locale, you use the locale() function:

- +
locale()
+#> <locale>
+#> Numbers:  123,456.78
+#> Formats:  %AD / %AT
+#> Timezone: UTC
+#> Encoding: UTF-8
+#> <date_names>
+#> Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
+#>         (Thu), Friday (Fri), Saturday (Sat)
+#> Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
+#>         June (Jun), July (Jul), August (Aug), September (Sep), October
+#>         (Oct), November (Nov), December (Dec)
+#> AM/PM:  AM/PM

This rest of this vignette will explain what each of the options do.

All of the parsing function in readr take a locale argument. You’ll most often use it with read_csv(), read_fwf() or read_table(). Readr is designed to work the same way across systems, so the default locale is English centric like R. If you’re not in an English speaking country, this makes initial import a little harder, because you have to override the defaults. But the payoff is big: you can share your code and know that it will work on any other system. Base R takes a different philosophy. It uses system defaults, so typical data import is a little easier, but sharing code is harder.

Rather than demonstrating the use of locales with read_csv() and fields, in this vignette I’m going to use the parse_*() functions. These work with a character vector instead of a file on disk, so they’re easier to use in examples. They’re also useful in their own right if you need to do custom parsing. See type_convert() if you need to apply multiple parsers to a data frame.

@@ -311,59 +362,58 @@

Names of months and days

The first argument to locale() is date_names, and it controls what values are used for month and day names. The easiest way to specify it is with a ISO 639 language code:

- +
locale("ko") # Korean
+#> <locale>
+#> Numbers:  123,456.78
+#> Formats:  %AD / %AT
+#> Timezone: UTC
+#> Encoding: UTF-8
+#> <date_names>
+#> Days:   일요일 (일), 월요일 (월), 화요일 (화), 수요일 (수), 목요일 (목), 금요일
+#>         (금), 토요일 (토)
+#> Months: 1월, 2월, 3월, 4월, 5월, 6월, 7월, 8월, 9월, 10월, 11월, 12월
+#> AM/PM:  오전/오후
+locale("fr") # French
+#> <locale>
+#> Numbers:  123,456.78
+#> Formats:  %AD / %AT
+#> Timezone: UTC
+#> Encoding: UTF-8
+#> <date_names>
+#> Days:   dimanche (dim.), lundi (lun.), mardi (mar.), mercredi (mer.), jeudi
+#>         (jeu.), vendredi (ven.), samedi (sam.)
+#> Months: janvier (janv.), février (févr.), mars (mars), avril (avr.), mai (mai),
+#>         juin (juin), juillet (juil.), août (août), septembre (sept.),
+#>         octobre (oct.), novembre (nov.), décembre (déc.)
+#> AM/PM:  AM/PM

If you don’t already know the code for your language, Wikipedia has a good list. Currently readr has 185 languages available. You can list them all with date_names_langs().

Specifying a locale allows you to parse dates in other languages:

- +
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
+#> [1] "2015-01-01"
+parse_date("14 oct. 1979", "%d %b %Y", locale = locale("fr"))
+#> [1] "1979-10-14"

For many languages, it’s common to find that diacritics have been stripped so they can be stored as ASCII. You can tell the locale that with the asciify option:

- +
parse_date("1 août 2015", "%d %B %Y", locale = locale("fr"))
+#> [1] "2015-08-01"
+parse_date("1 aout 2015", "%d %B %Y", locale = locale("fr", asciify = TRUE))
+#> [1] "2015-08-01"

Note that the quality of the translations is variable, especially for the rarer languages. If you discover that they’re not quite right for your data, you can create your own with date_names(). The following example creates a locale with Māori date names:

- +
maori <- locale(date_names(
+  day = c("Rātapu", "Rāhina", "Rātū", "Rāapa", "Rāpare", "Rāmere", "Rāhoroi"),
+  mon = c("Kohi-tātea", "Hui-tanguru", "Poutū-te-rangi", "Paenga-whāwhā",
+    "Haratua", "Pipiri", "Hōngongoi", "Here-turi-kōkā", "Mahuru",
+    "Whiringa-ā-nuku", "Whiringa-ā-rangi", "Hakihea")
+))

Timezones

Unless otherwise specified, readr assumes that times are in UTC, the Universal Coordinated Time (this is a successor to GMT and for almost all intents is identical). UTC is most suitable for data because it doesn’t have daylight savings - this avoids a whole class of potential problems. If your data isn’t already in UTC, you’ll need to supply a tz in the locale:

- +
parse_datetime("2001-10-10 20:10")
+#> [1] "2001-10-10 20:10:00 UTC"
+parse_datetime("2001-10-10 20:10", locale = locale(tz = "Pacific/Auckland"))
+#> [1] "2001-10-10 20:10:00 NZDT"
+parse_datetime("2001-10-10 20:10", locale = locale(tz = "Europe/Dublin"))
+#> [1] "2001-10-10 20:10:00 IST"

You can see a complete list of time zones with OlsonNames().

If you’re American, note that “EST” is a Canadian time zone that does not have DST. It’s not Eastern Standard Time! Instead use:

    @@ -373,110 +423,128 @@
  • EST/EDT = “US/Eastern”

(Note that there are more specific time zones for smaller areas that don’t follow the same rules. For example, “US/Arizona”, which follows mostly follows mountain time, but doesn’t have daylight savings. If you’re dealing with historical data, you might need an even more specific zone like “America/North_Dakota/New_Salem” - that will get you the most accurate time zones.)

-

Note that these are only used as defaults. If individual times have timezones and you’re using “%Z” (as name, e.g. “America/Chicago”) or “%z” (as offset from UTC, e.g. “+0800”), they’ll override the defaults. There’s currently no good way to parse times that use US abbreviations.

+

Note that these are only used as defaults. If individual times have timezones and you’re using “%Z” (as name, e.g. “America/Chicago”) or “%z” (as offset from UTC, e.g. “+0800”), they’ll override the defaults. There’s currently no good way to parse times that use US abbreviations.

Note that once you have the date in R, changing the time zone just changes its printed representation - it still represents the same instants of time. If you’ve loaded non-UTC data, and want to display it as UTC, try this snippet of code:

- +
is_datetime <- sapply(df, inherits, "POSIXct")
+df[is_datetime] <- lapply(df[is_datetime], function(x) {
+  attr(x, "tzone") <- "UTC"
+  x
+})

Default formats

-

Locales also provide default date and time formats. The time format isn’t currently used for anything, but the date format is used when guessing column types. The default date format is %Y-%m-%d because that’s unambiguous:

- -

If you’re an American, you might want you use your illogical date system::

- +

Locales also provide default date and time formats. The date format is used when guessing column types. The default date format is %AD, a flexible YMD parser (see ?parse_date):

+
str(parse_guess("2010-10-10"))
+#>  Date[1:1], format: "2010-10-10"
+str(parse_guess("2010/10/10"))
+#>  Date[1:1], format: "2010-10-10"
+

If you’re an American, you might want to use your illogical date system::

+
str(parse_guess("01/31/2013"))
+#>  chr "01/31/2013"
+str(parse_guess("01/31/2013", locale = locale(date_format = "%m/%d/%Y")))
+#>  Date[1:1], format: "2013-01-31"
+

The time format is also used when guessing column types. The default time format is %AT, a flexible HMS parser (see ?parse_time):

+
str(parse_guess("17:55:14"))
+#>  'hms' num 17:55:14
+#>  - attr(*, "units")= chr "secs"
+str(parse_guess("5:55:14 PM"))
+#>  'hms' num 17:55:14
+#>  - attr(*, "units")= chr "secs"
+# Example of a non-standard time
+str(parse_guess("h5m55s14 PM"))
+#>  chr "h5m55s14 PM"
+str(parse_guess("h5m55s14 PM", locale = locale(time_format = "h%Hm%Ms%S %p")))
+#>  'hms' num 17:55:14
+#>  - attr(*, "units")= chr "secs"

Character

All readr functions yield strings encoded in UTF-8. This encoding is the most likely to give good results in the widest variety of settings. By default, readr assumes that your input is also in UTF-8. This is less likely to be the case, especially when you’re working with older datasets.

The following code illustrates the problems with encodings:

- +
library(stringi)
+x <- "Émigré cause célèbre déjà vu.\n"
+y <- stri_conv(x, "UTF-8", "latin1")
+
+# These strings look like they're identical:
+x
+#> [1] "Émigré cause célèbre déjà vu.\n"
+y
+#> [1] "Émigré cause célèbre déjà vu.\n"
+identical(x, y)
+#> [1] TRUE
+
+# But they have difference encodings:
+Encoding(x)
+#> [1] "UTF-8"
+Encoding(y)
+#> [1] "latin1"
+
+# That means while they print the same, their raw (binary)
+# representation is actually quite different:
+charToRaw(x)
+#>  [1] c3 89 6d 69 67 72 c3 a9 20 63 61 75 73 65 20 63 c3 a9 6c c3 a8 62 72 65 20
+#> [26] 64 c3 a9 6a c3 a0 20 76 75 2e 0a
+charToRaw(y)
+#>  [1] c9 6d 69 67 72 e9 20 63 61 75 73 65 20 63 e9 6c e8 62 72 65 20 64 e9 6a e0
+#> [26] 20 76 75 2e 0a
+
+# readr expects strings to be encoded as UTF-8. If they're
+# not, you'll get weird characters
+parse_character(x)
+#> [1] "Émigré cause célèbre déjà vu.\n"
+parse_character(y)
+#> [1] "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu.\n"
+
+# If you know the encoding, supply it:
+parse_character(y, locale = locale(encoding = "latin1"))
+#> [1] "Émigré cause célèbre déjà vu.\n"

If you don’t know what encoding the file uses, try guess_encoding(). It’s not 100% perfect (as it’s fundamentally a heuristic), but should at least get you pointed in the right direction:

- +
guess_encoding(x)
+#> # A tibble: 3 x 2
+#>   encoding     confidence
+#>   <chr>             <dbl>
+#> 1 UTF-8              1   
+#> 2 windows-1250       0.34
+#> 3 windows-1252       0.26
+guess_encoding(y)
+#> # A tibble: 2 x 2
+#>   encoding   confidence
+#>   <chr>           <dbl>
+#> 1 ISO-8859-2        0.4
+#> 2 ISO-8859-1        0.3
+
+# Note that the first guess produces a valid string, but isn't correct:
+parse_character(y, locale = locale(encoding = "ISO-8859-2"))
+#> [1] "Émigré cause célčbre déjŕ vu.\n"
+# But ISO-8859-1 is another name for latin1
+parse_character(y, locale = locale(encoding = "ISO-8859-1"))
+#> [1] "Émigré cause célèbre déjà vu.\n"

Numbers

Some countries use the decimal point, while others use the decimal comma. The decimal_mark option controls which readr uses when parsing doubles:

- -

Additionally, when writing out big numbers, you might have 1,000,000, 1.000.000, 1 000 000, or 1'000'000. The grouping mark is ignored by the more flexible number parser:

- +
parse_double("1,23", locale = locale(decimal_mark = ","))
+#> [1] 1.23
+

Additionally, when writing out big numbers, you might have 1,000,000, 1.000.000, 1 000 000, or 1'000'000. The grouping mark is ignored by the more flexible number parser:

+
parse_number("$1,234.56")
+#> [1] 1234.56
+parse_number("$1.234,56", 
+  locale = locale(decimal_mark = ",", grouping_mark = ".")
+)
+#> [1] 1234.56
+
+# readr is smart enough to guess that if you're using , for decimals then
+# you're probably using . for grouping:
+parse_number("$1.234,56", locale = locale(decimal_mark = ","))
+#> [1] 1234.56
+ + + + - + @@ -96,9 +146,6 @@ font-size: 14px; line-height: 1.35; } -#header { -text-align: center; -} #TOC { clear: both; margin: 0 0 10px 10px; @@ -266,9 +313,13 @@ code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } -code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } +code > span.fu { color: #900; font-weight: bold; } +code > span.er { color: #a61717; background-color: #e3d2d2; } + + + @@ -299,21 +350,21 @@

Atomic vectors

parse_logical(), parse_integer(), parse_double(), and parse_character() are straightforward parsers that produce the corresponding atomic vector.

- +
parse_integer(c("1", "2", "3"))
+#> [1] 1 2 3
+parse_double(c("1.56", "2.34", "3.56"))
+#> [1] 1.56 2.34 3.56
+parse_logical(c("true", "false"))
+#> [1]  TRUE FALSE

By default, readr expects . as the decimal mark and , as the grouping mark. You can override this default using locale(), as described in vignette("locales").

Flexible numeric parser

parse_integer() and parse_double() are strict: the input string must be a single number with no leading or trailing characters. parse_number() is more flexible: it ignores non-numeric prefixes and suffixes, and knows how to deal with grouping marks. This makes it suitable for reading currencies and percentages:

- +
parse_number(c("0%", "10%", "150%"))
+#> [1]   0  10 150
+parse_number(c("$1,234.5", "$12.45"))
+#> [1] 1234.50   12.45

Date/times

@@ -323,12 +374,12 @@
  • times: number of seconds since midnight.
  • datetimes: number of seconds since midnight 1970-01-01.
  • - +
    parse_datetime("2010-10-01 21:45")
    +#> [1] "2010-10-01 21:45:00 UTC"
    +parse_date("2010-10-01")
    +#> [1] "2010-10-01"
    +parse_time("1:00pm")
    +#> 13:00:00

    Each function takes a format argument which describes the format of the string. If not specified, it uses a default value:

    • parse_datetime() recognises ISO8601 datetimes.

    • @@ -336,96 +387,99 @@
    • parse_time() uses the time_format specified by the locale(). The default value is %At which uses an automatic time parser that recognises times of the form H:M optionally followed by seconds and am/pm.

    In most cases, you will need to supply a format, as documented in parse_datetime():

    - +
    parse_datetime("1 January, 2010", "%d %B, %Y")
    +#> [1] "2010-01-01 UTC"
    +parse_datetime("02/02/15", "%m/%d/%y")
    +#> [1] "2015-02-02 UTC"

    Factors

    -

    When reading a column that has a known set of values, you can read directly into a factor. parse_factor() will generate generate a warning if a value is not in the supplied levels.

    - +

    When reading a column that has a known set of values, you can read directly into a factor. parse_factor() will generate a warning if a value is not in the supplied levels.

    +
    parse_factor(c("a", "b", "a"), levels = c("a", "b", "c"))
    +#> [1] a b a
    +#> Levels: a b c
    +parse_factor(c("a", "b", "d"), levels = c("a", "b", "c"))
    +#> Warning: 1 parsing failure.
    +#> row col           expected actual
    +#>   3  -- value in level set      d
    +#> [1] a    b    <NA>
    +#> attr(,"problems")
    +#> # A tibble: 1 x 4
    +#>     row   col expected           actual
    +#>   <int> <int> <chr>              <chr> 
    +#> 1     3    NA value in level set d     
    +#> Levels: a b c

    Column specification

    It would be tedious if you had to specify the type of every column when reading a file. Instead readr, uses some heuristics to guess the type of each column. You can access these results yourself using guess_parser():

    - +
    guess_parser(c("a", "b", "c"))
    +#> [1] "character"
    +guess_parser(c("1", "2", "3"))
    +#> [1] "double"
    +guess_parser(c("1,000", "2,000", "3,000"))
    +#> [1] "number"
    +guess_parser(c("2001/10/10"))
    +#> [1] "date"

    The guessing policies are described in the documentation for the individual functions. Guesses are fairly strict. For example, we don’t guess that currencies are numbers, even though we can parse them:

    - -

    The are two parsers that will never be guessed: col_skip() and col_factor(). You will always need to supply these explicitly.

    +
    guess_parser("$1,234")
    +#> [1] "character"
    +parse_number("$1,234")
    +#> [1] 1234
    +

    There are two parsers that will never be guessed: col_skip() and col_factor(). You will always need to supply these explicitly.

    You can see the specification that readr would generate for a column file by using spec_csv(), spec_tsv() and so on:

    - +
    x <- spec_csv(readr_example("challenge.csv"))
    +#> 
    +#> ── Column specification ────────────────────────────────────────────────────────
    +#> cols(
    +#>   x = col_double(),
    +#>   y = col_logical()
    +#> )

    For bigger files, you can often make the specification simpler by changing the default column type using cols_condense()

    - +
    mtcars_spec <- spec_csv(readr_example("mtcars.csv"))
    +#> 
    +#> ── Column specification ────────────────────────────────────────────────────────
    +#> cols(
    +#>   mpg = col_double(),
    +#>   cyl = col_double(),
    +#>   disp = col_double(),
    +#>   hp = col_double(),
    +#>   drat = col_double(),
    +#>   wt = col_double(),
    +#>   qsec = col_double(),
    +#>   vs = col_double(),
    +#>   am = col_double(),
    +#>   gear = col_double(),
    +#>   carb = col_double()
    +#> )
    +mtcars_spec
    +#> cols(
    +#>   mpg = col_double(),
    +#>   cyl = col_double(),
    +#>   disp = col_double(),
    +#>   hp = col_double(),
    +#>   drat = col_double(),
    +#>   wt = col_double(),
    +#>   qsec = col_double(),
    +#>   vs = col_double(),
    +#>   am = col_double(),
    +#>   gear = col_double(),
    +#>   carb = col_double()
    +#> )
    +
    +cols_condense(mtcars_spec)
    +#> cols(
    +#>   .default = col_double()
    +#> )

    By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in challenge.csv the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:

    - +
    x <- spec_csv(readr_example("challenge.csv"), guess_max = 1001)
    +#> 
    +#> ── Column specification ────────────────────────────────────────────────────────
    +#> cols(
    +#>   x = col_double(),
    +#>   y = col_date(format = "")
    +#> )

    Another way is to manually specify the col_type, as described below.

    @@ -438,78 +492,80 @@
  • read_log() for web log files
  • Each of these functions firsts calls spec_xxx() (as described above), and then parses the file according to that column specification:

    - +
    df1 <- read_csv(readr_example("challenge.csv"))
    +#> 
    +#> ── Column specification ────────────────────────────────────────────────────────
    +#> cols(
    +#>   x = col_double(),
    +#>   y = col_logical()
    +#> )
    +#> Warning: 1000 parsing failures.
    +#>  row col           expected     actual                                                                                                                file
    +#> 1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/private/var/folders/9x/_8jnmxwj3rq1t90mlr6_0k1w0000gn/T/RtmpQzf4Ag/Rinstcd9f584b28dd/readr/extdata/challenge.csv'
    +#> 1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/private/var/folders/9x/_8jnmxwj3rq1t90mlr6_0k1w0000gn/T/RtmpQzf4Ag/Rinstcd9f584b28dd/readr/extdata/challenge.csv'
    +#> 1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/private/var/folders/9x/_8jnmxwj3rq1t90mlr6_0k1w0000gn/T/RtmpQzf4Ag/Rinstcd9f584b28dd/readr/extdata/challenge.csv'
    +#> 1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/private/var/folders/9x/_8jnmxwj3rq1t90mlr6_0k1w0000gn/T/RtmpQzf4Ag/Rinstcd9f584b28dd/readr/extdata/challenge.csv'
    +#> 1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/private/var/folders/9x/_8jnmxwj3rq1t90mlr6_0k1w0000gn/T/RtmpQzf4Ag/Rinstcd9f584b28dd/readr/extdata/challenge.csv'
    +#> .... ... .................. .......... ...................................................................................................................
    +#> See problems(...) for more details.

    The rectangular parsing functions almost always succeed; they’ll only fail if the format is severely messed up. Instead, readr will generate a data frame of problems. The first few will be printed out, and you can access them all with problems():

    - +
    problems(df1)
    +#> # A tibble: 1,000 x 5
    +#>      row col   expected       actual   file                                     
    +#>    <int> <chr> <chr>          <chr>    <chr>                                    
    +#>  1  1001 y     1/0/T/F/TRUE/… 2015-01… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#>  2  1002 y     1/0/T/F/TRUE/… 2018-05… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#>  3  1003 y     1/0/T/F/TRUE/… 2015-09… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#>  4  1004 y     1/0/T/F/TRUE/… 2012-11… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#>  5  1005 y     1/0/T/F/TRUE/… 2020-01… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#>  6  1006 y     1/0/T/F/TRUE/… 2016-04… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#>  7  1007 y     1/0/T/F/TRUE/… 2011-05… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#>  8  1008 y     1/0/T/F/TRUE/… 2020-07… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#>  9  1009 y     1/0/T/F/TRUE/… 2011-04… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#> 10  1010 y     1/0/T/F/TRUE/… 2010-05… '/private/var/folders/9x/_8jnmxwj3rq1t90…
    +#> # … with 990 more rows

    You’ve already seen one way of handling bad guesses: increasing the number of rows used to guess the type of each column.

    - +
    df2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
    +#> 
    +#> ── Column specification ────────────────────────────────────────────────────────
    +#> cols(
    +#>   x = col_double(),
    +#>   y = col_date(format = "")
    +#> )

    Another approach is to manually supply the column specification.

    Overriding the defaults

    In the previous examples, you may have noticed that readr printed the column specification that it used to parse the file:

    - +
    #> Parsed with column specification:
    +#> cols(
    +#>   x = col_integer(),
    +#>   y = col_character()
    +#> )

    You can also access it after the fact using spec():

    - +
    spec(df1)
    +#> cols(
    +#>   x = col_double(),
    +#>   y = col_logical()
    +#> )
    +spec(df2)
    +#> cols(
    +#>   x = col_double(),
    +#>   y = col_date(format = "")
    +#> )

    (This also allows you to access the full column specification if you’re reading a very wide file. By default, readr will only print the specification of the first 20 columns.)

    If you want to manually specify the column types, you can start by copying and pasting this code, and then tweaking it fix the parsing problems.

    - +
    df3 <- read_csv(
    +  readr_example("challenge.csv"), 
    +  col_types = cols(
    +    x = col_double(),
    +    y = col_date(format = "")
    +  )
    +)

    In general, it’s good practice to supply an explicit column specification. It is more work, but it ensures that you get warnings if the data changes in unexpected ways. To be really strict, you can use stop_for_problems(df3). This will throw an error if there are any parsing problems, forcing you to fix those problems before proceeding with the analysis.

    Available column specifications

    -

    #’ The available specifications are: (with string abbreviations in brackets)

    +

    The available specifications are: (with string abbreviations in brackets)

    • col_logical() [l], containing only T, F, TRUE or FALSE.
    • col_integer() [i], integers.
    • @@ -523,7 +579,39 @@
    • col_skip() [_, -], don’t import this column.
    • col_guess() [?], parse using the “best” type based on the input.
    -

    Use the col_types argument to override the default choices. There are two ways to use it: * With a string: "dc__d": read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with types that take additional parameters.) * With a (named) list of col objects: R read_csv("iris.csv", col_types = cols( Sepal.Length = col_double(), Sepal.Width = col_double(), Petal.Length = col_double(), Petal.Width = col_double(), Species = col_factor(c("setosa", "versicolor", "virginica")) )) Or, with their abbreviations: R read_csv("iris.csv", col_types = cols( Sepal.Length = "d", Sepal.Width = "d", Petal.Length = "d", Petal.Width = "d", Species = col_factor(c("setosa", "versicolor", "virginica")) )) Any omitted columns will be parsed automatically, so the previous call will lead to the same result as: R read_csv("iris.csv", col_types = cols( Species = col_factor(c("setosa", "versicolor", "virginica"))) ) You can also set a default type that will be used instead of relying on the automatic detection for columns you don’t specify: R read_csv("iris.csv", col_types = cols( Species = col_factor(c("setosa", "versicolor", "virginica")), .default = col_double()) ) If you only want to read specified columns, use cols_only(): R read_csv("iris.csv", col_types = cols_only( Species = col_factor(c("setosa", "versicolor", "virginica"))) )

    +

    Use the col_types argument to override the default choices. There are two ways to use it:

    +
      +
    • With a string: "dc__d": read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with types that take additional parameters.)

    • +
    • With a (named) list of col objects:

      +
      read_csv("iris.csv", col_types = cols(
      +  Sepal.Length = col_double(),
      +  Sepal.Width = col_double(),
      +  Petal.Length = col_double(),
      +  Petal.Width = col_double(),
      +  Species = col_factor(c("setosa", "versicolor", "virginica"))
      +))
      +

      Or, with their abbreviations:

      +
        read_csv("iris.csv", col_types = cols(
      +  Sepal.Length = "d",
      +  Sepal.Width = "d",
      +  Petal.Length = "d",
      +  Petal.Width = "d",
      +  Species = col_factor(c("setosa", "versicolor", "virginica"))
      +))
    • +
    +

    Any omitted columns will be parsed automatically, so the previous call will lead to the same result as:

    +
    read_csv("iris.csv", col_types = cols(
    +  Species = col_factor(c("setosa", "versicolor", "virginica")))
    +)
    +

    You can also set a default type that will be used instead of relying on the automatic detection for columns you don’t specify:

    +
    read_csv("iris.csv", col_types = cols(
    +  Species = col_factor(c("setosa", "versicolor", "virginica")),
    +  .default = col_double())
    +)
    +

    If you only want to read specified columns, use cols_only():

    +
    read_csv("iris.csv", col_types = cols_only(
    +  Species = col_factor(c("setosa", "versicolor", "virginica")))
    +)

    Output

    @@ -534,6 +622,9 @@ + + +