Ancestors

Toot

Written by John MacKintosh on 2024-12-20 at 08:04

[#]rstats

Strategies for dealing with tidying multiple large CSV files, each of varying dimensions, which are in a list.

They all have the first 4 rows of useless text. Varying column widths.

The next several rows (could be one, could be four) are what should be column headers. No way of knowing how many there are without painstakingly going through each.

The last 6 rows are useless, and can be discarded.

I have a hacky solution but interested to hear how others would start to tackle this

=> More informations about this toot | More toots from johnmackintosh@fosstodon.org

Descendants

Written by John MacKintosh on 2024-12-20 at 18:02

Think I've cracked it lads.

Edge cases have been addressed, and 70 tidy (and massive) .tsv files are now in place.

Next stop, duckdb and / or parquet

=> More informations about this toot | More toots from johnmackintosh@fosstodon.org

Written by Michael McCarthy on 2024-12-20 at 08:54

@johnmackintosh I think we need a reprex for this one 💀

=> More informations about this toot | More toots from mccarthymg@fosstodon.org

Written by Tamás Stirling on 2024-12-20 at 09:02

@johnmackintosh Many fuctions that import tables have an argument for skipping the first N rows, I would use this built in functionality. I prefer read.csv() for importing tables it usually does the job.

If you can generalise your operations such that you do the same thing with each csv, then I think lapply() will also help. I would probably start with a vector of csv paths and then do each processing step with lapply().

=> More informations about this toot | More toots from stitam@fosstodon.org

Written by Tamás Stirling on 2024-12-20 at 09:10

@johnmackintosh Column header rows: Try to find a pattern which you can generalise to find the number of rows to extraxt as headers. Blind guess: the number of columns in the data set should be the same as the number of column header rows. Once you know these numbers you can use them when skipping the first N rows during import. Since these might be different for each data set, I'd use mapply instead of lapply.

=> More informations about this toot | More toots from stitam@fosstodon.org

Written by Dave Mason on 2024-12-20 at 18:52

@johnmackintosh

Glad you got it worked out.

I didn't have any good #rstats advice. But you made me think about how I'd handle it via T-SQL...

=> More informations about this toot | More toots from DaveMasonDotMe@mastodon.social

Written by danwwilson :rstats: on 2024-12-20 at 19:08

@johnmackintosh If you use parquet, stick with arrow for now. Last time I tried, duckdb would not recognise and keep factors correctly.

=> More informations about this toot | More toots from danwwilson@rstats.me

Proxy Information

Original URL: gemini://mastogem.picasoft.net/thread/113684110409159637
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 274.967197 milliseconds
Gemini-to-HTML Time: 1.498602 milliseconds

This content has been proxied by September (3851b).