class: center, middle, inverse, title-slide .title[ # Stat 585 - String Manipulation ] .author[ ### Heike Hofmann and Susan Vanderplas ] --- ## Working with strings #### Basic operations: - separate a string into pieces - paste strings together - locate a search expression in a string - does it exist - at what position? - remove or replace parts of a string - change format: uppercase, lowercase, etc. ??? Basic things you might want to do with strings - chop it up, paste it together, search for something, change something --- ## Tidyverse packages .right-column[ `stringr` - Tidyverse package for string manipulation - basic function: `str_xxx(input_string, other_args)` - `str_replace`, `str_detect`, `str_extract`, `str_locate`, `str_trim`, `str_to_upper`... Good set of resources: http://stringr.tidyverse.org Cheat sheet: https://posit.co/wp-content/uploads/2022/10/strings-1.pdf ] <img src="stringr.png" width="50%" class="left-column move-down" /> ??? stringr is a tidyverse package for working with strings. Base R has lots of string functionality, but the nicest thing about stringr is that the arguments are in a predictable order, unlike base R functions. The input string is first, other arguments follow. So for str_replace, you'd pass in the input string, the "pattern" - what you want to replace, and the replacement. For str_trim, you pass in the input string and it will get rid of whitespace at the beginning and end. Many of these functions use similar arguments, so in general you can use help and/or tab completion to get a good idea of what arguments you need - the functions more or less are named what you'd expect them to be. We're going to start out with basic find/replace/count tasks you might do in a text editor. --- class:inverse ## Your Turn https://bit.ly/stat585-passwords contains a 100K sample from a database of leaked passwords. ```r library(stringr) passwords <- readLines("https://bit.ly/stat585-passwords") ``` Using functions from `stringr`, answer the following questions: - How many of the passwords have at least one space? - What is the most common character in a password? - What proportion of the passwords have `.`, `?`, and `!` characters? Hint: Use "\\\\" before the character to escape "special" characters - we'll talk about those next. ??? Play around with stringr functions and see if you can complete the following tasks. To do this, I've uploaded a relatively small sample from a database of passwords exposed during security breaches. See if you can have some fun using stringr functions to examine people's passwords. With string patterns, you do have to escape special characters like period and question mark. We'll talk about why those are special in a minute - for now, use two slashes to escape them. --- ## Your Turn Solutions ```r str_detect(passwords, " ") %>% sum() ``` ``` ## [1] 444 ``` ```r str_subset(passwords, " ") ``` ``` ## [1] "(SAMEH FAHD)5750554" ## [2] "{ B" ## [3] "* FELY *" ## [4] "1 ALEF" ## [5] "1 applewhite" ## [6] "1 archambeau" ## [7] "1 BACKES" ## [8] "1 bailleul" ## [9] "1 barnabas" ## [10] "1 BARRENECHEA" ## [11] "1 belina" ## [12] "1 bomber" ## [13] "1 BYRA" ## [14] "1 CARENA" ## [15] "1 CELLAURO" ## [16] "1 CHARLETTE" ## [17] "1 CHELONIS" ## [18] "1 CIAN" ## [19] "1 CRONKHITE" ## [20] "1 CUMPSTEN" ## [21] "1 DIEURENE" ## [22] "1 DILLER" ## [23] "1 DOVEY" ## [24] "1 duer" ## [25] "1 EDILBERTO" ## [26] "1 elenbaas" ## [27] "1 ENYEART" ## [28] "1 FACCHIN" ## [29] "1 fadlan" ## [30] "1 FEENAN" ## [31] "1 fladung" ## [32] "1 FRATICELLI" ## [33] "1 GIARRATANO" ## [34] "1 gilliom" ## [35] "1 GLADE" ## [36] "1 graniero" ## [37] "1 GU" ## [38] "1 GURPINAR" ## [39] "1 hackborn" ## [40] "1 henryk" ## [41] "1 HIFNER" ## [42] "1 HORVITZ" ## [43] "1 HOSICK" ## [44] "1 hospital" ## [45] "1 hviid" ## [46] "1 illya" ## [47] "1 IRISARRI" ## [48] "1 jackye" ## [49] "1 KARLICKI" ## [50] "1 kazuteru" ## [51] "1 KIRPALANI" ## [52] "1 kolner" ## [53] "1 LAGAE" ## [54] "1 LAYKO" ## [55] "1 LOVAT" ## [56] "1 LYONNAIS" ## [57] "1 marchesoni" ## [58] "1 mariadass" ## [59] "1 MARYSTELLA" ## [60] "1 mastrodimitrio" ## [61] "1 MESAROS" ## [62] "1 minozzi" ## [63] "1 mitton" ## [64] "1 MODRELL" ## [65] "1 mokuno" ## [66] "1 moray" ## [67] "1 mozo" ## [68] "1 MRAK" ## [69] "1 naftel" ## [70] "1 navone" ## [71] "1 NERVAIZ" ## [72] "1 nilam" ## [73] "1 oblad" ## [74] "1 OCCENA" ## [75] "1 PALIZZI" ## [76] "1 PASOTTI" ## [77] "1 PED" ## [78] "1 PEGRAM" ## [79] "1 PITOTTI" ## [80] "1 potchapornkul" ## [81] "1 radziyah" ## [82] "1 RAEJEAN" ## [83] "1 RAUCH" ## [84] "1 raybould" ## [85] "1 REEDS" ## [86] "1 RODLUN" ## [87] "1 ROLOFF" ## [88] "1 ROSSETTI" ## [89] "1 rottenburger" ## [90] "1 ROWLANDS" ## [91] "1 sakuta" ## [92] "1 SALDIVAR" ## [93] "1 SAPONARO" ## [94] "1 SCALLEY" ## [95] "1 SCAMPORRINO" ## [96] "1 SCHAUSS" ## [97] "1 sentovich" ## [98] "1 shain" ## [99] "1 SHIHAR" ## [100] "1 SHOMI" ## [101] "1 SIPPY" ## [102] "1 SISCHKA" ## [103] "1 SLAWIKOWSKI" ## [104] "1 SMOLKA" ## [105] "1 SMRDU" ## [106] "1 spender" ## [107] "1 stoker" ## [108] "1 studler" ## [109] "1 STURROCK" ## [110] "1 suitter" ## [111] "1 surdez" ## [112] "1 TAMIO" ## [113] "1 TERR" ## [114] "1 thuestad" ## [115] "1 TOGNETTI" ## [116] "1 tositti" ## [117] "1 uhler" ## [118] "1 VALERO" ## [119] "1 werkiser" ## [120] "1 wisnosky" ## [121] "1 WOLLEB" ## [122] "1 WOS" ## [123] "1 yusnani" ## [124] "1 ZAMEROSKI" ## [125] "10 ALVARADO" ## [126] "11 weller" ## [127] "12 SAWYER" ## [128] "16 elsie" ## [129] "16 herrera" ## [130] "165 Light" ## [131] "18 moi" ## [132] "18 sep 2007" ## [133] "19 melody" ## [134] "1St Empires" ## [135] "1st Reliance" ## [136] "2 barcelos" ## [137] "2 borger" ## [138] "2 DOWDELL" ## [139] "2 falconer" ## [140] "2 FARRARO" ## [141] "2 galaviz" ## [142] "2 KAMIYA" ## [143] "2 kawata" ## [144] "2 MIYATA" ## [145] "2 MOULKHEIR" ## [146] "2 NYQUIST" ## [147] "2 onulak" ## [148] "2 randell" ## [149] "2 SENO" ## [150] "2 sharma" ## [151] "2 VINCI" ## [152] "2 WEISE" ## [153] "20 SNOW" ## [154] "24 de abril del 2004" ## [155] "3 bogue" ## [156] "3 FIRTZLAFF" ## [157] "3 MCVEY" ## [158] "3 MOCK" ## [159] "3 nutter" ## [160] "3 REDDY" ## [161] "3 rule" ## [162] "359 pamela" ## [163] "4 APPLEBY" ## [164] "4 GEORGINA" ## [165] "4 gustavo" ## [166] "4 justin" ## [167] "4 theroux" ## [168] "4 wayman" ## [169] "5 BAY" ## [170] "6 GILLIAN" ## [171] "61 LEONG" ## [172] "7 AINI" ## [173] "7 thiam" ## [174] "7 winkler" ## [175] "9TH GRADE" ## [176] "A Bit Of" ## [177] "A G K" ## [178] "AA_MY BF" ## [179] "academy award songs" ## [180] "alea jacta est" ## [181] "alous 05" ## [182] "amarte siempre" ## [183] "Ann-Mari Max" ## [184] "Anna Mae" ## [185] "AQ SAYANG U" ## [186] "arkansas blues" ## [187] "ARTIE \"BLUES BOY\" WHITE" ## [188] "AUDIO ASSAULT SQUAD" ## [189] "axezah gurl" ## [190] "BABII GIRL" ## [191] "barrio fino 2" ## [192] "barry farms" ## [193] "BE MINE ONLY" ## [194] "BEN_BUSS _5" ## [195] "BETO TE AMO" ## [196] "bhe bhe ko" ## [197] "BHEZ JEN" ## [198] "big brother & holding company" ## [199] "BIG J" ## [200] "big n" ## [201] "bIG rED" ## [202] "billy vaughn" ## [203] "BLACK BOI" ## [204] "BLUE ALERT" ## [205] "BLUE OYSTER CULT" ## [206] "bonner sc" ## [207] "boy is suck" ## [208] "BRADLEY WOODS" ## [209] "briair nelson" ## [210] "bribot online" ## [211] "BUILT TO SPILL" ## [212] "cam cam1" ## [213] "Camp Favorite" ## [214] "CATS B" ## [215] "celtic 1888 rules" ## [216] "chelsey luvs u" ## [217] "cheret jelek" ## [218] "CHIN AI" ## [219] "chupete bronco" ## [220] "CIC CIC" ## [221] "clay d." ## [222] "CLINT HOLMES" ## [223] "color guard" ## [224] "crack baby" ## [225] "crazy chick26" ## [226] "CURLY CANDY BUBBLE" ## [227] "CUTE Q" ## [228] "CUTIE VINCE" ## [229] "cyrine cute08" ## [230] "D.I.Y.: BLANK GENERATION" ## [231] "Dallas Cowboys" ## [232] "DAN STEER" ## [233] "dannyboikicks ass" ## [234] "david moss" ## [235] "DAYTONA S " ## [236] "De Biase" ## [237] "dead velvet" ## [238] "demi lou" ## [239] "die thorn" ## [240] "dina sk" ## [241] "dj druok" ## [242] "dj hot" ## [243] "doggy ganster" ## [244] "DONT HATE" ## [245] "DONT AHET" ## [246] "DONT FORGET" ## [247] "dont know" ## [248] "dont you wish you was me" ## [249] "DRAGON FORCE" ## [250] "DUKE LEVINE" ## [251] "dx johncena" ## [252] "ECANOMICAL THUGS" ## [253] "EDDY E" ## [254] "eddy jr" ## [255] "eEm jHay121" ## [256] "entei adam" ## [257] "eric is cute" ## [258] "ERIC JOHNSON" ## [259] "ESTEFANY PEñA PEREZ" ## [260] "F TORRES" ## [261] "fly tye" ## [262] "FOLK 4 LIFE" ## [263] "france durand" ## [264] "FUCKING LADY" ## [265] "G AND H" ## [266] "GABRIELA GEORGIANA" ## [267] "gaby acosta" ## [268] "GARY GLITTER" ## [269] "girl rocks" ## [270] "go dreamer" ## [271] "good for you" ## [272] "GREEN NATS PEE" ## [273] "H HARGROVE" ## [274] "H I1985LJFLJF" ## [275] "hanah montana" ## [276] "hans rudi" ## [277] "HAVW A GREAT TIME O OUR DITE" ## [278] "hello daddy" ## [279] "Hello, world!" ## [280] "here icy" ## [281] "HI HOE" ## [282] "hi joy" ## [283] "Hidden Germ" ## [284] "hing sokrim2008-1986" ## [285] "how about no" ## [286] "hsm hsm" ## [287] "I G Gold" ## [288] "I LOVE JARROD" ## [289] "i love marilyn" ## [290] "I LOVE STARS" ## [291] "i loveash" ## [292] "I LOVEYOUR" ## [293] "i lv bebo" ## [294] "ice lemonade" ## [295] "if loves1" ## [296] "ilove password" ## [297] "im goin aroun" ## [298] "IM HOT AND YOUR NOT" ## [299] "IRINA K" ## [300] "IZZY KHAN" ## [301] "J C Enterprises" ## [302] "J S Serivces" ## [303] "JACOB'S MOUSE" ## [304] "jalisa roberson" ## [305] "jan alfred" ## [306] "jasiel robinson" ## [307] "JE-AN HALLEY" ## [308] "JERALD RAY" ## [309] "JIM BEAM" ## [310] "JNNB JH" ## [311] "jong hi" ## [312] "JOrdan J" ## [313] "juicy and poko" ## [314] "kill gangster" ## [315] "KILL HIM" ## [316] "king kobra" ## [317] "KIRSTY MACCOLL" ## [318] "kitty claws" ## [319] "kona shred" ## [320] "L B BRIDAL" ## [321] "LA DISTURBI" ## [322] "LA FLOW" ## [323] "Lana Schneider" ## [324] "laura and john" ## [325] "liana del" ## [326] "LIEBE DICH" ## [327] "light hope" ## [328] "LIIL CMC" ## [329] "lil freak" ## [330] "LIL JOKER" ## [331] "lil maki b" ## [332] "LIL RARA" ## [333] "lil thang" ## [334] "LITTLE M" ## [335] "liverpool lfc" ## [336] "lonesome sundown" ## [337] "love babe" ## [338] "love dinyal" ## [339] "love is life" ## [340] "love kass" ## [341] "love ly" ## [342] "LUPH AA" ## [343] "m m cleaning" ## [344] "MA LIFE WITHOUT ME" ## [345] "macho men" ## [346] "Mafija Corleone" ## [347] "maggie moo" ## [348] "MAHA CUTE" ## [349] "manhattan on the rocks" ## [350] "mario alberto" ## [351] "MARK CURRY" ## [352] "Mc Ginn" ## [353] "ME THE BEST" ## [354] "MEGAN 3" ## [355] "MICHAEL PENN" ## [356] "MIGHTY LEMON DROPS" ## [357] "mind nun" ## [358] "mo anam cara" ## [359] "mr warning alexis" ## [360] "ms horton" ## [361] "my gawdd" ## [362] "MYHUM P5" ## [363] "Naughty but nice12" ## [364] "nicco j" ## [365] "NICE NICOLE" ## [366] "niña angelical" ## [367] "NO DATAZUG_TRAIN1991" ## [368] "no love life" ## [369] "NOEL AN" ## [370] "NON-STOP DANCE HITS" ## [371] "NORMAN KATZ" ## [372] "not you 6ut me" ## [373] "nota loka" ## [374] "ODD SOCKS" ## [375] "Omegaj 76" ## [376] "OTIS GRIM" ## [377] "P A CLIENT" ## [378] "P C S" ## [379] "PANFILA PANFIS" ## [380] "panic atthedisco" ## [381] "parker lessig" ## [382] "PERE UBU" ## [383] "PETE MONIKA" ## [384] "peti 345" ## [385] "phleg camp" ## [386] "PHUNKE ASSFALT" ## [387] "PK AFONSO" ## [388] "poop head" ## [389] "praja kelana" ## [390] "pretty tasha" ## [391] "red 33" ## [392] "red hot chilis" ## [393] "red is cool" ## [394] "remote part" ## [395] "rip duke" ## [396] "ritchie blackmore" ## [397] "Rock On" ## [398] "rockyou account is required for voicemail" ## [399] "RODEO LIFE" ## [400] "saraisab el123" ## [401] "Scram Gravy" ## [402] "SEAN PAUL" ## [403] "sexy hava" ## [404] "sexy moma" ## [405] "SEXY RYAN" ## [406] "SHIK FEXA" ## [407] "SHOPPING1 4" ## [408] "SINISTER STREET" ## [409] "sj ab" ## [410] "SOB DTA" ## [411] "Sonya Emery" ## [412] "SPEEDY ANDY" ## [413] "STAV BA" ## [414] "STONE LEE" ## [415] "studio 3" ## [416] "SWEET WAFA" ## [417] "sweety girl" ## [418] "TAMA IS CORKAZ" ## [419] "te amo cesar" ## [420] "te amo luna" ## [421] "this email address is registered already" ## [422] "THIS IS" ## [423] "Thomas Frey" ## [424] "TIKA SETIAWAN" ## [425] "TOMY LEE JONES" ## [426] "toto bino" ## [427] "tracy camilla" ## [428] "TRIBE 8" ## [429] "troy so fit" ## [430] "U S Tire" ## [431] "van de stadt" ## [432] "van den Hurk" ## [433] "van hulst" ## [434] "Vicky Carrier" ## [435] "Viking Power" ## [436] "virus password" ## [437] "VITAL REMAINS" ## [438] "voodoo child" ## [439] "WAY GIRL" ## [440] "welcome love " ## [441] "wtrh th" ## [442] "yaniv i" ## [443] "YOU CANT SEE ME" ## [444] "YUYUT QUIERO" ``` --- ## Your Turn Solutions ```r # Get single characters str_split(passwords, '', simplify = F) %>% # get rid of lists for each character unlist() %>% table() %>% sort(decreasing = T) ``` ``` ## . ## a A e E i I o n r O ## 45061 42957 40995 39594 31610 30628 28687 28538 28258 27748 ## N R s S t l T - 1 L ## 27328 27015 26296 25915 23497 23295 23087 22545 22484 22414 ## 0 2 m C c M d D u U ## 17141 16861 15921 15847 15795 15226 14603 14429 14269 13760 ## h H 3 9 b 8 g B 4 G ## 13739 13702 12229 12148 10945 10869 10856 10843 10776 10528 ## P 5 p 7 K 6 y k Y f ## 10332 10310 10310 10069 9942 9865 9830 9743 9632 7113 ## F v V W w J j z Z x ## 6919 6338 6123 5887 5607 5161 5044 3592 3415 3033 ## X ' Q q _ ! * @ . ## 2951 2868 1842 1730 1310 544 525 457 305 283 ## \\ # / $ & + , ) ? ( ## 264 188 181 156 150 100 98 73 67 65 ## = < ; ] ` : [ % ^ " ## 58 49 45 39 34 33 31 31 21 16 ## ~ > จ ñ ค £ ถ { | ๅ ## 14 12 9 8 8 7 6 5 5 5 ## \xf6 } ç \xfc à Ñ ป \xdf ♣ ๘ ## 5 4 4 4 3 3 3 3 2 2 ## ó ฟ ภ ภึ ย ร อ เ \xe9 \xf4 ## 2 2 2 2 2 2 2 2 2 2 ## \xe2 \xe1 \xe4 \b ¡ ´ á å Å ä ## 2 2 2 1 1 1 1 1 1 1 ## í İ ô ő ø ü กำ คุ ง ต ## 1 1 1 1 1 1 1 1 1 1 ## ถุึ ท นื ผ ฟื ม ส า ุ แ ## 1 1 1 1 1 1 1 1 1 1 ## ๅุ 䬳 \xcd \xde \xfb \xf3 \xdc \xfe \xf0 \xe6 ## 1 1 1 1 1 1 1 1 1 1 ## \xf1 ## 1 ``` --- ## Your Turn Solutions `.` and `?` are special characters in strings and should be escaped. Normally, we'd escape things using `\`, but in R, that's a special character too... so it also has to be escaped. `\\.` and `\\?` will recognize `.` and `?` respectively. ```r pds <- str_detect(passwords, "\\.") qmk <- str_detect(passwords, "\\?") exc <- str_detect(passwords, "!") sum(pds | qmk | exc) ``` ``` ## [1] 704 ``` --- ## Regular Expressions .center[![:scale 70%](https://imgs.xkcd.com/comics/regular_expressions.png)] ??? In the tasks I just asked you to do, you didn't really need wildcards or other complicated search patterns. The Anyone else relate to this? Regular expressions have made my life easier many times, but I don't know that they've ever been superhero-worthy. --- ## Regular Expressions - __Regular expressions__ are special patterns that describe text sequences - [Quick Start Guide](https://www.regular-expressions.info/quickstart.html) - [Regex Pal](https://www.regexpal.com/) - test your regular expressions - [RegExplain](https://www.garrickadenbuie.com/project/regexplain/) package and RStudio addin ```r devtools::install_github("gadenbuie/regexplain") ``` -- - When all else fails, google "regular expression for xxx" and chances are pretty good you'll find something --- ## Regular Expressions - Basic rules: - Use `[]` to enclose a set of valid characters or a range of characters e.g. `[A-z]` matches all letters, `[ATCG]` matches DNA bases - `^` negates a selection (inside of the square brackets) e.g. `[^A-z]` matches anything that isn't a letter - `.` matches anything - To match `-`, put it first or last inside `[]` e.g. `[-abcde0-9]` will match `-`, a-e, and 0-9. - Special characters: `. \ | ( ) [ { ^ $ * + ?` --- ## Regular Expressions - Repetition operators: - `[xxx]{n, m}` will match a sequence of n to m characters in the set - `[xxx]{n}` matches exactly n characters - `[xxx]{n,}` matches at least n characters - `[xxx]+` will match 1 or more characters - `[xxx]?` will match 0 or 1 optional characters - `[xxx]*` will match 0 or more characters in the set (greedy) - `[xxx]*?` will match 0 or more characters in the set (lazy/not greedy) --- ## Regular Expressions - `^` (outside of `[]`) matches the beginning of a string e.g. `^Hello` matches `Hello World` but not `I just called to say 'Hello!'` - `$` matches the end of a string e.g. `end$` will match `The end` but not `The end is nigh` - `()` are used for grouping and information capture (more on this in a bit) - `|` can be used as "or" outside of `[]` e.g. `abc|xyz` will match `abcdefg` and `wxyz` but not `ab` or `stuv` --- ## Regular Expressions ```r test_str <- "Hello World!" str_extract_all(test_str, "[A-z]{1,}") ``` ``` ## [[1]] ## [1] "Hello" "World" ``` ```r str_extract_all(test_str, "[^A-z]{1,}") ``` ``` ## [[1]] ## [1] " " "!" ``` ```r str_remove_all(test_str, "[^A-z]") ``` ``` ## [1] "HelloWorld" ``` ```r str_extract_all(test_str, ".") ``` ``` ## [[1]] ## [1] "H" "e" "l" "l" "o" " " "W" "o" "r" "l" "d" "!" ``` --- ## Regular Expression Problems ![:scale 100%](https://imgs.xkcd.com/comics/perl_problems.png) Perl is a language with extensive regular expression capabilities. You can use Perl-style regular expressions in base R functions, but they don't work with `stringr`. --- class:inverse ## Your Turn ```r library(stringr) download.file("https://bit.ly/stat585-passwords", "passwords.txt", type = "wb") passwords <- readLines() ``` Using functions from `stringr` and regular expressions, answer the following questions: - How many of the passwords have at least one space, `-`, or `_`? - What proportion of the passwords have `.`, `?`, and `!` characters? - What proportion of the passwords have only lowercase letters? --- ## Your Turn Solutions ```r library(stringr) passwords <- readLines("passwords.txt") str_detect(passwords, "[ _-]") %>% sum() ``` ``` ## [1] 13420 ``` ```r str_detect(passwords, "[\\.\\?!]") %>% sum() ``` ``` ## [1] 704 ``` ```r # passwords with any non-lowercase chars str_detect(passwords, "[^a-z]") %>% magrittr::not() %>% # invert sum() ``` ``` ## [1] 22871 ``` --- ## Chaining Regular Expressions Patterns can be combined: ```r test_str <- "She sells sea shells by the seashore" str_extract_all(test_str, "[aeiou].") ``` ``` ## [[1]] ## [1] "e " "el" "ea" "el" "e " "ea" "or" ``` ```r str_extract_all(test_str, "[^aeiou]{1,}[a-z]") ``` ``` ## [[1]] ## [1] "She" " se" "lls se" " she" ## [5] "lls by the" " se" "sho" "re" ``` ```r str_extract_all(test_str, "[A-z]{3}[^A-z]") ``` ``` ## [[1]] ## [1] "She " "lls " "sea " "lls " "the " ``` --- ## Extended Regular Expressions - `\w`: `[A-z0-9_]` (alphanumeric characters), `\W` for the negation - `\d`: `[0-9]` (digits), `\D` matches non-digits - `\x`: Hexadecimal numbers (`[0-9A-Fa-f]`) - `\s`: white space (tab, space, endline), `\S` for non-whitespace - `\b`: empty string at the beginning or end of a word (`\B` for the negation) Remember, in R, you have to escape `\`, so any of these are `\\w`, `\\d`, `\\s` in R --- ## POSIX Regular Expressions Another way to match multiple characters: - `[[:alnum:]]` alphanumeric characters (`[[:alpha:]]` and `[[:digit:]]`) - `[[:blank:]]` blank characters - space, tab, non-breaking space - `[[:graph:]]` graphical characters (`[[:alnum:]]` and `[[:punct:]]`) - `[[:lower:]]` and `[[:upper:]]` for letters - `[[:space:]]` whitespace (tab, newline, vertical tab, carriage return, space, etc.) - `[[:xdigit:]]` hexadecimal characters (`[0-9A-Fa-f]`) --- class:inverse ## Your Turn - Write a regular expression for a valid phone number - Write a regular expression for a valid email address - Write a regular expression for an HTML image tag. Can you use your regular expression to pull all image tags from a wikipedia page? (try https://en.wikipedia.org/wiki/Emu_War if you need inspiration) .center[ <img src="https://upload.wikimedia.org/wikipedia/commons/5/57/Dromaius_novaehollandiae_%28head%29_Battersea_Park_Children%27s_Zoo.jpg" width="25%"/> ] --- ## Your Turn Solutions - Valid phone number: `\D?\d{3}\D?\d{3}\D?\d{4}` ```r phone_regex <- "\\D?\\d{3}\\D?\\d{3}\\D?\\d{4}" str_detect("515-867-5309", phone_regex) ``` ``` ## [1] TRUE ``` ```r str_detect("(515)867-5309", phone_regex) ``` ``` ## [1] TRUE ``` ```r str_detect("5158675309", phone_regex) ``` ``` ## [1] TRUE ``` ```r str_detect("515-use-regex", phone_regex) ``` ``` ## [1] FALSE ``` --- ## Your Turn Solutions - Validating an [email address](https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression) with regular expressions is harder than it looks! ```r naive_email_regex <- "\\w{1,}@\\w{1,}\\.[A-z]{1,}" str_detect("hofmann@iastate.edu", naive_email_regex) ``` ``` ## [1] TRUE ``` ```r str_detect("super_squirrel@netscape.net", naive_email_regex) ``` ``` ## [1] TRUE ``` ```r str_detect("@*&!#@gmail.com", naive_email_regex) ``` ``` ## [1] FALSE ``` --- ## Your Turn Solutions - Image regex ```r url <- "https://en.wikipedia.org/wiki/Emu_War" page_html <- readLines(url) ``` ``` ## Warning in readLines(url): incomplete final line found on ## 'https://en.wikipedia.org/wiki/Emu_War' ``` ```r img_regex <- "< ?img.*?/>" res <- str_extract_all(page_html, img_regex, simplify = T) img_res <- res[nchar(res) > 0] img_res ``` ``` ## [1] "<img alt=\"Page semi-protected\" src=\"//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png\" decoding=\"async\" width=\"20\" height=\"20\" srcset=\"//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x\" data-file-width=\"512\" data-file-height=\"512\" />" ## [2] "<img alt=\"Deceased emu during Emu War.jpg\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deceased_emu_during_Emu_War.jpg/240px-Deceased_emu_during_Emu_War.jpg\" decoding=\"async\" width=\"240\" height=\"289\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deceased_emu_during_Emu_War.jpg/360px-Deceased_emu_during_Emu_War.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deceased_emu_during_Emu_War.jpg/480px-Deceased_emu_during_Emu_War.jpg 2x\" data-file-width=\"712\" data-file-height=\"856\" />" ## [3] "<img src=\"//upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Fallow_caused_by_emus.jpg/220px-Fallow_caused_by_emus.jpg\" decoding=\"async\" width=\"220\" height=\"146\" class=\"thumbimage\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Fallow_caused_by_emus.jpg/330px-Fallow_caused_by_emus.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Fallow_caused_by_emus.jpg/440px-Fallow_caused_by_emus.jpg 2x\" data-file-width=\"450\" data-file-height=\"299\" />" ## [4] "<img src=\"//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Sir_George_Pearce.jpg/170px-Sir_George_Pearce.jpg\" decoding=\"async\" width=\"170\" height=\"238\" class=\"thumbimage\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Sir_George_Pearce.jpg/255px-Sir_George_Pearce.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Sir_George_Pearce.jpg/340px-Sir_George_Pearce.jpg 2x\" data-file-width=\"653\" data-file-height=\"914\" />" ## [5] "<img src=\"//upload.wikimedia.org/wikipedia/commons/thumb/4/41/Joseph_Lyons_seated.jpg/170px-Joseph_Lyons_seated.jpg\" decoding=\"async\" width=\"170\" height=\"238\" class=\"thumbimage\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/4/41/Joseph_Lyons_seated.jpg/255px-Joseph_Lyons_seated.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/41/Joseph_Lyons_seated.jpg/340px-Joseph_Lyons_seated.jpg 2x\" data-file-width=\"1864\" data-file-height=\"2610\" />" ## [6] "<img src=\"//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1\" alt=\"\" title=\"\" width=\"1\" height=\"1\" style=\"border: none; position: absolute;\" />" ## [7] "<img src=\"/static/images/footer/wikimedia-button.png\" srcset=\"/static/images/footer/wikimedia-button-1.5x.png 1.5x, /static/images/footer/wikimedia-button-2x.png 2x\" width=\"88\" height=\"31\" alt=\"Wikimedia Foundation\" loading=\"lazy\" />" ## [8] "<img src=\"/static/images/footer/poweredby_mediawiki_88x31.png\" alt=\"Powered by MediaWiki\" srcset=\"/static/images/footer/poweredby_mediawiki_132x47.png 1.5x, /static/images/footer/poweredby_mediawiki_176x62.png 2x\" width=\"88\" height=\"31\" loading=\"lazy\"/>" ``` --- ## Your Turn Solutions - Image regex ```r # Use regex to clean things up! res_links <- str_extract_all(img_res, "src=[\\S]{1,}") %>% str_remove("src=") %>% str_remove_all('\\"') res_links ``` ``` ## [1] "//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" ## [2] "//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deceased_emu_during_Emu_War.jpg/240px-Deceased_emu_during_Emu_War.jpg" ## [3] "//upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Fallow_caused_by_emus.jpg/220px-Fallow_caused_by_emus.jpg" ## [4] "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Sir_George_Pearce.jpg/170px-Sir_George_Pearce.jpg" ## [5] "//upload.wikimedia.org/wikipedia/commons/thumb/4/41/Joseph_Lyons_seated.jpg/170px-Joseph_Lyons_seated.jpg" ## [6] "//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" ## [7] "/static/images/footer/wikimedia-button.png" ## [8] "/static/images/footer/poweredby_mediawiki_88x31.png" ``` --- ## Your Turn Solutions ```r knitr::include_graphics(paste0("http://", res_links[2:4])) ``` <img src="http:////upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deceased_emu_during_Emu_War.jpg/240px-Deceased_emu_during_Emu_War.jpg" width="33%" /><img src="http:////upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Fallow_caused_by_emus.jpg/220px-Fallow_caused_by_emus.jpg" width="33%" /><img src="http:////upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Sir_George_Pearce.jpg/170px-Sir_George_Pearce.jpg" width="33%" /> --- ## Capture Groups - To "save" an expression (e.g. to use find/replace), enclose something in `()`. - Reference those expressions using `\\1`, `\\2`, and so on... corresponding to each pair of parentheses. (up to 9) ```r name <- "John Jacob Jingelheimer Schmidt" str_replace(name, "^(\\S{1,}).*(\\S{1,})$", "\\1 \\2") ``` ``` ## [1] "John t" ``` ```r # We need lazy evaluation str_replace(name, "^(\\S{1,}).*?(\\S{1,})$", "\\1 \\2") ``` ``` ## [1] "John Schmidt" ``` ```r #str_replace(name, "^(\\S{1,}).*\\s(\\S{1,})$", "\\1 \\2") ``` --- ## Capture Groups and `tidyr` `tidyr` has a function `separate_wider_regex`, which allows you to turn a single column into multiple columns `separate_wider_delim(data, cols, patterns, ..., names_sep = NULL, cols_remove = TRUE)` - `cols` is the column the data is currently in - `names` is a vector of characters with names of the new variables - `pattern` named character vector, where values are regular expressions. Names become column names. --- class:inverse ## Your Turn ```r library(xml2) url <- "https://commons.wikimedia.org/wiki/Category:Panoramic_photographs" html <- readLines(url) %>% paste(collapse = " ") panoramic_pics <- tibble(orightml = str_extract_all(html, "<div class=.gallerytext.>.*?</div>") %>% unlist()) panoramic_pics$orightml[1] ``` ``` ## [1] "<div class=\"gallerytext\"> <a href=\"/wiki/File:360_Old_town_square_(14626242520).jpg\" class=\"galleryfilename galleryfilename-truncate\" title=\"File:360 Old town square (14626242520).jpg\">360 Old town square (14626242520).jpg</a> 17,378 × 4,058; 31.15 MB<br /> \t\t\t</div>" ``` Use `tidyr::separate_wider_regex()` to create separate columns for the file link\*, image dimensions, and the image size. Can you find the widest panoramic image? Get the image url and plot it using magick or `knitr::include_graphics()` if you're working in Rmarkdown. * Hint: Paste "https://commons.wikimedia.org" on to the front to make the link work --- ## Your Turn Solutions ```r base_url <- "https://commons.wikimedia.org" panoramic_pics <- panoramic_pics %>% separate_wider_regex(orightml, patterns=c(".*href=.", link='[^"]+', ".*?", ".*?",im_width="[\\d,]{1,} . [\\d,]{1,}", ";.*?", imgsize="[\\d\\.]{1,} [MKGmkg]?[Bb]", ".*?"), cols_remove = FALSE) ``` --- ## Your Turn Solutions ```r page_url <- panoramic_pics %>% arrange(desc(im_width)) %>% slice(1) %>% magrittr::extract2("link") img_url <- paste0(base_url, page_url) %>% readLines() %>% paste(collapse = " ") %>% # Get the correct tag str_extract("<a [^>]{1,}>Original file</a>") %>% # Get the link out of the tag str_extract("https\\S{1,}") %>% # Remove the last quote str_remove("\\\"$") ``` ``` ## Warning in readLines(.): incomplete final line found on ## 'https://commons.wikimedia.org/wiki/File:Olympia_(CURTIS_1662).jpeg' ``` ```r knitr::include_graphics(img_url) ``` ![](https://upload.wikimedia.org/wikipedia/commons/c/c1/Olympia_%28CURTIS_1662%29.jpeg)<!-- -->