The Cornerstone of the Metaverse Era Meta Exceed Chain

With the arrival of the Web3.0 era, the metaverse has gradually become a new hotspot for global technological development. Against this backdrop, Meta Exceed Chain was born as a fundamental public…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Alternative approach

This is not another copy/paste script, it requires a period of learning and understanding, however once attained, data will never look the same.

Six weeks ago I explained this technique to my boss. He googled looking for similar techniques or anything closely resembling it and came up empty. There was no documentation stating it was ‘best practice’.

I built something my boss described as ‘better than he’d ever seen anywhere ever’ (insert quote from boss here). My entire team agreed. Nobody had seen anything that performs so well, in fact unless they saw it they’d wouldn’t believe it.

If you want to see just the code or video of this working it’s at the bottom of the article, however to replicate this reading this article should help.

The problem was I knew I was trying to speed things up, but it shouldn’t have sped up this much.

Curious about the speed, I investigated further, trialled things, refined concepts, learnt a new technique. Here’s what I’ve learnt.

Columnar databases work differently. Each column has a file where as non-columnar databases have a file per table.

Your new model comprise will have just one table, many files but just one table.

This is a concept most can’t initially grasp.

So here’s a little mind trick that may help. There’s still the same bunch of files just traditionally we grouped them by table names, now we group by column names.

Cool? If so carry on.

If not read again.

10 Billion rows with a single value takes almost no disk in a columnar database. The same table created with 2 distinct values and there’s a small increase in disk used. Randomly choose a number between 0–9 disk grows a little more. Only when each and every one of the 10 Billion values are unique do we utilise the same space as a non-columnar database.

Attributes such as the number of rows or datatypes become way way less important than one thing.

Uniqueness.

Don’t look at your data in terms of rows. See only unique values.

The reason for this is compression — similar to a zip file or codec used to stream media.

Seeing and doing is believing!

Before continuing please trust no-one especially me, so please test this out yourself, this will speedup your learning period.

Your database needs to be columnar — my favourites are Snowflake which runs on each of the cloud platforms and Google’s Big Query.

Here’s a snowflake code snippet you should be able to copy and paste into your browser and see for yourself — go on do it.

32 KB for 1 Billion Rows

Remember I mentioned datatypes no longer matter … here’s the test, let’s now generate 1 of 10 dates at random and again populate 1 Billion rows. You’ll hopefully see the size on disk is identical to the billion integers.

Hopefully you proved to yourself that disk space is now defined by unique values.

Try to visualise unique datapoints post compression neatly queued in a line, for this is how the columnar database sees data.

This path is a whole lot easier if you stay true to first principles and focus on solving for answering business questions.

Our goal is to answer any question as quickly as possible from a single information source.

Firstly all create a table or view of your data model at the lowest level of granularity with a sample of 1000 rows.

If you have an existing traditional star schema model you’re joining it all together and creating one super wide table. Include every piece of data, including surrogates and stuff you don’t think is important. This way you can prove to yourself the lossless nature of this model. Later you can speed the model up even further, but for now it’s like for like.

OK cool. Why? Given the option of joining tables or not joining on a columnar database you want the latter.

Why?

So how should we model data now?

Most columnar databases will provide some sort aggregation function and datatype to stored data.

Using an appropriate aggregate function we’re going to gather together a multi layered object within an object within an object. Snowflake and GCP provide somewhere in the vicinity of 16 layers of depth, in conjunction with a size limit — Snowflakes is 16MB compressed.

16 MB compressed is massive when we’re storing a row for every day and every customer. Every day every customer that sounds crazy right ? Wrong.

We’re able to create huge tables both wide and deep by effectively creating an index for the date or customer dimensions.

In fact creating identical versions of the same table can be super beneficial. The first table is ordered by date and the second is ordered by what ever other dimension you desire in my case customer.

If you ever predicate on customer the fastest would be a table laid in customer order. However if your business users commonly wanted Products by some time dimension (say month or quarter) it’s beneficial to also provide a table laid in date order.

Doing this provides optimal results — however once you play around with different table orders and look at the # of partitions needing to be accessed by sed queries you get a pretty good idea of what works.

In my experience ordering by customer got a 20–50 fold reduction in partitions scanned on 3.3 million customers over 3.2 billion rows containing (400 Billion rows if resolved to lowest granularity).

Thats 20–50 times faster and cheaper when predicating on any set of customer id’s.

This can be determined by looking at the statistics. Here’s a table ordered by customer and predicating on customer id. We only had to search 1 partition out of 2732.

This is the secret — the # partitions required to be searched in most cases can be reduced to 1. This has resulted in some of my queries literally beating the Snowflakes cached times -> this should not be possible.

Identical table laid in Date order has to search 252 out of 4429 partitions.

The less we have to search the more efficient everything works providing a faster query time consuming less data and compute which lowers the cost of the query itself.

Ok I know some of you were just about to say that’s a lot of work to do right? Extra tables .. yuck.

No worries this can all be automated once you’ve chosen your key dimensions that essentially provide you with super index functionality on the columnar database.

Cool eh ? Gets cooler read on.

Once you’ve decided which attributes you’re most often going to predicate on then stack the objects accordingly.

You can actually stack multiple objects (each of depth up to 16). This is similar to what we used to do for drill downs in Microstrategy if you ever played with that tool.

The key here is understanding then exploiting a columnar database to solve our problem.

By overloading all your columns into an object (snowflake) or struct (gcp) you essentially drop workload by 15.

We have no control over the micro-partitions. However as they do adhere to a couple rules — prior to issuing any commands the number of micro-partitions required to search is determined by min/max etc. By forcing all our columns into a single micro-partition we can bypass this black box and take more control back. This will only work if you’re 100% aware of exactly how your query works.

Will the vendors fix this under the cover ? I’m not sure here as 99.99999999% of the queries running on their platforms rely on this and any changes here would be quite hard to test, so I’m guessing the status quo remains.

Let me know in the comments if you’d like a more in-depth article on this will more proofs.

When creating the objects inside each customer for each day you also want to order by something. Why ?

Your instinct is not to order because ordering for no reason is expensive right?

By ordering the objects inside the objects consistently guess what happens. You lower the distinct count for the objects, thereby allowing the compression to work better and reduce the size used on the disk.

You’re using a new set of functionality that you might not be used to or familiar with. Don’t panic. Both GCP and Snowflake provide extensive and great documentation on how to use aggregate functions to both create and extract data from within the objects.

Once you get over the learning curve you’re going to have speed you never thought possible. Learn it. I use REGEX quite often to pull out prices I’ve stored inside objects then can do all sorts of cool maths on them.

Remember because you’re predicating the massive table to maybe a year or cohort of customers, the actual volume of data you’re needing to compute is dramatically reduced, so expensive functions even like count distinct’s becomes fast.

I’ll be providing scripts to create the table you saw in the demo from scratch for yourself shortly. I highly recommend grabbing running it yourself. You can freely create a Snowflake account allowing you try this all out.

Play with the resulting table — then once convinced comeback re-read the above article and you can easily apply this onto your own dataset.

If you get stuck there’s contact details below, I’m more than willing to help anyone willing to try this themselves. If you’re in Auckland happy to pop by and help you hands on, or via a Zoom/Teams call if you’re somewhere else.

You can become the second person to trial this technique and I’d love to hear from you and your experiences.

For the first person/company that reaches out to me personally here’s the challenge. Give me 1 week and a sample of your data model. If I can produce a lossless dataset that works faster and cheaper than your current system, you donate 15% of your theoretical 1 months savings to a mutually agreed Ukraine charity. No savings no charge. No cost of any of my time.

As soon as I get the first request I’ll remove the above paragraph, so if you’ve read it, nobody has taken up my offer.

The bigger the data set the better — the more complex the problem better still. I really hope we can take a new direction with data as this really is just the start.

Kind Regards,

Adrian White

Mid March I’ve repeatedly reached reached out to a couple of the Snowflake team (Regan Murphy and Felipe Hoffa in particular) in asking for feedback/opinion/comments on this article prior to publishing, as soon as they respond (glass half full) I will update comments here.

Update 24th March 5pm — Regan Murphy joined me for a live demo and supplied this quote

Regan also explained to me that it was in fact possible to beat the cache speed if the data is simply located in cluster that doesn’t require Remote I/O. Very stoked to hear this.

So hopefully you’re excited to try this technique out which I’ve seen both increase speed 100 fold whilst also decreasing costs 100 fold without losing any dimensionality. This will hopefully delight your users and management alike.

Please reach out if you have any questions or improvements I’d love to hear your thoughts and experiences.

Also I did quite a bit of this alone, however on a regular basis I talked to an awesome colleague called V who through again curiosity and questions also seeks to go further. He’s just connected GPT3 for us so super fast marries super smart.

I’d also love it if you’d help me name this technique.

I was thinking ‘Data Laying’ as we’re essentially re-laying the data with understanding on how the DBMS will access it.

Here’s the script you can run and replicate the video you just saw for yourself.

I’ve not tuned this, after tuning I do get quite a lift in performance however I had to write this last night so you just got the basics, things only get better if you tune.

Here’s the view you can place over the top so you can revert to the old SQL ways.

Add a comment

Related posts:

Culture Shocks and Reverse Culture Shocks from Living Around The World

In Germany talking to a waiter is merely a transaction. Small talk doesn’t happen; it is inefficient. Efficiency permeates the German way of life — except when wanting to pay, the waiter always…

Poloniex Suspends Monero Withdrawals to Prevent MoneroV Fork Adoption

The cryptocurrency exchange, Poloniex, has chosen to pause all Monero deposits and withdrawals until the infamous MoneroV fork is launched. Their reasoning for this is to allow for the network to…

Charity turns to crypto to fund projects

United Nations International Children’s Emergency Fund Australia (UNICEF Australia) has turned to crypto mining to raise funds for its international projects in Bangladesh. Forbes reports UNICEF…