PropertyBrain helps real estate professionals work smarter through the implementation of tools and learning. Our goal is to help you grow your real estate business and increase your income.
When you’re analyzing real estate data, there’s a battle between the quality of the data and the quantity of the data.
If you focus on finding extremely similar properties, you likely won't have very many.
If you focus on finding lots of properties, they likely won't be very similar.
In this post, we are going to break down each of these scenarios, talk about the pros and cons of each, discuss which methods of data analysis are impacted, and talk about how you can find a happy middle ground between the two.
Are you ready to get started?
Let’s jump in and look at the first scenario…
This is a scenario where you've got very similar properties, but you probably don't have that many.
For example, let's say you're looking for homes that are 2-story design, with 3,500 SqFt, 3 bedrooms, 2 bathrooms, and you’re only looking for properties that meet that exact criteria.
In this case, you're likely not going to have very many results.
However, the results that you do have are going to be very high quality since they are so similar.
This may be similar to the scenario you find yourself in when you're picking comparables for a home you are going to list, or maybe even one your client wants to make an offer on.
A lot of times you end up finding maybe one or two properties and basing your entire analysis around those, right?
While it might be great that those two are very similar when you take a step back to look at the overall data analysis of what's happening in that market, those one or two data points don't really give you enough support to accurately analyze that market.
So what are the pros and cons of a high-quality, low-quantity data analysis?
First, It can give you a really good idea of what is happening in price over time.
One of the biggest issues when looking at what's happening in a changing market is the fact that sometimes you're actually looking at different styles or sizes of homes that have sold, rather than homes similar to the subject that have sold for a different price.
Let's say you look at a market and initially think that home prices have gone up 5% or 10%.
The question you really need to ask yourself is, “Did prices actually go up 5% or 10% or did different types of homes (unrelated to your subject) sell for 5% or 10% more?”
When you're looking at a very low quantity of data, but with high quality, it's likely that you can see a little bit better picture in price changes over time.
Especially if they're spread out pretty evenly over the course of the year.
Let's say you've got 12 months of data and you have 5 sales that happened every couple of months. You can easily see what happened with the price changes to those homes over the course of that 12 months.
Did they go up in price?
Did they go down?
Or, maybe they stayed relatively stable.
It also allows you to calculate adjustments on the property pretty well, because ultimately if you're basing adjustments on the statistics that come from homes that are nearly identical, you're likely to get adjustments that are calculated accurately for that property.
So now that we know the pros, what are the cons of this type of approach?
The biggest con is the fact that you don't have much data. When you're looking at things from a statistical standpoint, you don't have a lot of data to back up whatever conclusions you get.
If you're looking at price changes over time and you only have 4-5 data points, it makes it a little bit harder to be confident that the statistics results are actually accurate.
The same is true when you're looking at making adjustments.
If you're breaking down those couple of similar homes and getting adjustments from it, they're not really statistically supportable due to the fact that there are just so few of them.
The biggest problem comes with potential outliers.
Let's say you have 5 homes and one of those sold for significantly more or less. In that case, that one outlier could really change the market and skew your time adjustments or other adjustments that are being calculated from the data, because of a unique scenario.
Maybe somebody was really tied to that house or there was something about it that made them overpay or underpay.
In that case, that one piece of data could drastically change the market because it has a much larger impact due to the minimal amount of data that is in that dataset.
Alright, so let's talk about the other scenario. First, we looked at high quality and low quantity. The next option is the exact opposite.
This is when you look at a lot of properties in an area, but you're not really narrowing it down to make sure that they're similar.
This would be a similar approach to what a lot of AVMs use. They're looking at a really broad dataset and they're trying to use their algorithms to narrow it down to come up with the actual price of a property that falls within there.
But the problem with this type of dataset is that you're basing it off a ton of different types of homes. When you're looking at this quantity of data, you generally lose quality.
For example, if you did a home search within one mile of your subject property and included all design types, square footage, bedroom and bathroom count, garage count, etc. and you get, let's say, 500 results instead of 5 in the previous scenario.
So, what are the pros of this type of approach?
A larger data set can make it easier to run statistics. If you are using some type of data analysis, like a regression or something like that, it becomes a little bit more supportable in the fact that you have more data points going into the analysis and therefore, theoretically, you would get a better result.
However, because we're looking at low quality, sometimes that can be skewed a bit.
The second pro is that outliers have a lot less impact. If you have one or two properties that sell way above or below where everything else sold, but you have 500 total properties, those couple of data points have almost no impact.
So you don't have to worry about those skewing the data in a way that would be confusing.
Okay so, those are the pros of high quantity and low quality, but what are the cons?
We already hit on one a little bit earlier in this post - when you're looking at this amount of data and you don't have a lot of quality, the problem becomes that anything that you calculate, while it might be a little bit more statistically supportable, may or may not actually represent your subject property.
For example, let's circle back to the time adjustment topic we were talking about earlier when you had 5 results in the high quality, low quantity segment.
We said that if they were spaced out evenly, you could get a pretty good picture as to whether or not the prices had increased or decreased over time.
In this scenario, you're going to look at 500 data points, for example. You might be able to get a trend line that shows that the market is increasing or decreasing.
But what we don't know is whether or not those homes that led to the increase or decrease were similar homes that ultimately sold for more or less. Or was it that within that market area, different types of homes sold?
If we are looking at that 3,500 sqft, 3 bed, 2 bath example, in the first data set we had, we know that all of them were this similar style and size of home.
So, if we saw that prices went up over time, we could be confident that the market had actually increased and people were willing to pay more for that same house.
However, in the second scenario, the 5% increase that we see could just be showing a trend of bigger homes that had sold (unrelated to our subject). Remember we didn't narrow down the square footage, bedroom count, bathroom count, etc, so it's a potential that the increase actually has more to do with the type of home or size of home that sold over time, rather than homes actually appreciating or depreciating.
In that scenario, having too many data points can actually confuse the analysis and make it seem like prices are changing when in reality it was just different homes that were selling.
Now that we've looked at both of those, let's talk about how to find the perfect mix between quantity and quality.
You have to figure out the best market criteria to be able to analyze the data correctly, and this battle is something you're likely always going to struggle with.
So how do you handle it?
When you sit down and do a comp search, you're generally going to fall one way or the other.
We'd recommend that you fall towards high quality & low quantity rather than high quantity & low quality.
However, the goal would be to find a good middle ground.
I think everyone in real estate knows when you look at Zillow or other resources that estimate the value of a property, they are looking at massive data sets and running complex algorithms that, most times, are not extremely accurate.
Again, the reason for that is that they're looking at so many unique scenarios that it's hard to estimate whether or not those inputs actually have any significance on the subject property itself.
How do you find a good middle ground?
When you're doing a comp search, the best thing to do is to start really narrow.
So let's take the example of the 3,500 sqft, 3 bed, 2 bath home. Start out and draw in boundaries for your neighborhood. Next, look for 2-story design homes that has 3 or 4 bedrooms, with 2 or 3 bathrooms, and are between 3,300 and 3,700 sqft.
See what that data set gives you - likely it’s going to be a pretty small dataset.
From there, what you want to do is branch out to make sure that you're including everything the typical buyer for that house would consider.
Your job is to ask yourself, “What else would they consider?”
Would they consider a 3,100 sqft, 2-story with 2 bedrooms, 2 baths?
If they would, add that to the dataset.
If they'd consider a 4,000 sqft house with 4 or 5 bedrooms, add it.
A rancher?
Maybe they'd go 3 bed, 2 bath around that same sqft, but they would consider a rancher or some other design style.
Maybe there's a neighborhood that's nearby that they would also consider, with similar overall property values, similar quality homes, etc. So you could expand your search gradually to get a little bit more data while still trying to keep the quality as high as possible.
What you're trying to look for is ideally somewhere between 50 and 200 results.
That is going to give you enough data to run these statistics on, while not giving you so much that it's going to be confusing.
As a rule of thumb, we like to shoot for around 120 properties, which would give you approximately 10 per month.
This generally becomes a pretty good data set to analyze so that you're not letting outliers influence the dataset, but at the same time you have enough properties to use a statistical analysis to make sure that you're getting good calculations based upon the inputs.
That's not always going to happen, but ultimately, you want to make sure that you've got good data.
The first thing you can look at is the range of sold prices.
If the typical house is selling for $500,000 and the top of your range is something like $550,00 to $575,000 and the bottom is around $450,000 to $475,000, you might have a good range.
You have a high-quality dataset, and now you need to assess whether or not you have enough quantity.
One way to know if you're on track, to figure out if you have high-quality data, is to plot the data points on a chart to see when they sold and what they sold for. Then look at how that distribution is in comparison to a high concentration around the trend line.
You would ideally want to see all the sales concentrated pretty tightly around the trend line.
The second thing you want to look at is to see if there are any outliers.
It's one thing to have a couple of outliers - most times it's going to be really hard to avoid having that in your dataset. Ultimately, if you have enough data having a few outliers won't actually impact the trend analysis.
But if you have a lot of outliers, they will impact it.
The thing you want to look for when it comes to outliers is if they fall at the beginning or end of a period. Let's say you're looking at a 12-month timeframe. If you have an outlier, or multiple outliers, at the beginning or the end of the 12 months, it's likely going to affect the outcome at least somewhat.
When you analyze it mathematically, it's going to shift to the trend line in a direction toward the outliers a bit more when they fall on either end.
If there are a couple of outliers at the beginning or end of your timeframe, you would ideally want to remove those from your data set by either changing your search criteria or manually removing them, depending upon what you're doing to actually run the analysis. That way, those outliers won't have a significant influence.
If they fall somewhere in the middle it's unlikely they're going to have any significant impact so you can be pretty confident that the quality of that analysis will be pretty good because you have a good quality dataset overall even though you have some of those outliers.
I hope that you have a little bit better understanding of the quality versus quantity issue after reading this post. Anytime you do a comp search you're going to face this issue - it's just the nature of real estate data.
People spend different amounts of money on homes for different reasons. In certain markets, it seems to make more sense than other ones, where prices are rapidly increasing or decreasing.
Sometimes it can get really confusing as to why people made the decisions that they made.
However, by being able to analyze the properties and create a market data set that is good enough quantity, but focuses on the quality, you're going to have the best overall analysis possible.