Alternatively, you could add a UNIQUE or PRIMARY KEY constraint in DuckDB (here are the internals and examples), but that could generate too much work during loading large amounts of data. This value is not guaranteed to be unique, so you might want to check for uniqueness in your python code. Instead, I generated a random number in a specified range using random. You’ll notice I commented out generating the ID as a US Social Security Number (SSN), because that’s just scary and bad practice. Here’s a simple example of using Python Faker to generate a person record, with a name, email, company, etc.: import random But oftentimes you just need someone who looks quacks like a dock, but is not an actual duck. Keep in mind that the data we generate won’t be perfect unless we tune the out-of-the-box code. Many of the included and community providers are even localized for different regions. I’d rather use generated data where analysts can focus on how ducking awesome DuckDB is instead of how unclean the data is.Īs a bonus, using generated data allows us to create data that’s better aligned with real-world uses cases for the average analyst, as Anna Geller requests in a recent tweet.Īs a user, I would appreciate some randomly generated datasets where folks can analyze real world things like costs and revenue rather than petal lengths- Anna Geller JanuUsing Python Fakerįaker is a Python package for generating fake data, with a large number of providers for generating different types of data, such as people, credit cards, dates/times, cars, phone numbers, etc. Of course, I could clean these up, but using these records as-is makes me frequently question my SQL skills. Generate random values for username, MAC address, IP address, SysId, and DateTime 2. Others have documented additional issues with dirty data. Mock Data Generator for IP Lookup Table 1. Interestingly all trips with dates in the future are posted from a single vendor (see data dictionary). │ tpep_pickup_datetime │ VendorID │ passenger_count │ fare_amount │ Based on the fare_amount for the following 5 person trip in 2098, I’d say we can safely conclude that inflation will be on a downward or lateral trend over the next 60 years. You can see here that some taxi trips were taken seriously far in the future. We’re very lucky to have this dataset, but like many data sources, the data is in need of cleaning. The DuckDB community regularly uses the NYC Taxi Data to demonstrate and test features as it’s a reasonably large set of data (billions of records) and it’s data the public understands. There is a plethora of interesting public data out there. Free Resources for Generating Realistic Fake Data (1) Faker (2) Mockaroo (3) GenerateData (4) JSON Schema Faker (5) FakeStoreAPI (6) Mock Turtle Before.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |