Getting random N rows by SQL query which will be proportional to the total number of rows in different sections
I have a table that persists a lot of questions, each question belongs to a section:
Id Question SectionId 1 What is ... 3 2 Who is... 3 3 When is... 2 4 Why is... 1 5 How is... 3
There is like 1000 questions, and around 50 sections. However, my query is simple, I select a given number of questions from the table from specific sections, for example
SELECT TOP 10 [Id], [Question] FROM [Questions] WHERE [SectionId] IN (1,2) ORDER BY NEWID()
This is simple and working fine, except that sometimes I get 5 questions out of the requested 10 from a section that has only 6 questions, and 2 from a section that has 100 questions, and 3 from a section that has 20 questions.
How can I make the result "proportional" with the number of the questions in each section. For example if I request 10 questions, I get more questions from the section that has more questions, and less questions from the sections with less questions.
The only I can think of currently is to make multiple queries, first one to get the number of questions in each section, then do some math and decide how many questions from each section, and then make another few queries to get the number of questions as I want. This sound intensive and I hope there's a more practical way.
Note: An SQL query, or EF Linq query would work.
For a stratified sample, do an nth sample on the ordering. This is a little tricky, but this should work:
SELECT TOP (10) q.* FROM (SELECT q.*, ROW_NUMBER() OVER (ORDER BY section, NEWID()) as seqnum, COUNT(*) OVER (ORDER BY section, NEWID()) as cnt FROM [Questions] q WHERE [SectionId] IN (1, 2) ) q ORDER BY seqnum % (cnt / 10);
There may be some boundary conditions on this logic, but as the number of questions grows and the sample is large enough, it should do what you want.
Different ways to get random data for SQL Server data sampling, You are looking to retrieve a random sample from a query result set. Selecting the top 10 rows of data yields this result (just to give you an You can see in my example that from a total count of more than 19,000 rows in the Person. but instead of being in proportion, there may be different numbers of select * from table where random() < (N / (select count(1) from table)) limit N; This will generally sample most of the table, but can return less than N rows. If having some bias is acceptable, the numerator can be changed from N to 1.5*N or 2*N to make it very likely that N rows will be returned.
I can't think of a way to do this in a single step, unless you know in advance the number of sections and the proportions of each.
If these values have to be calculated at query time, you will need to run a query to get the sections and proportions and use that to build a Dynamic SQL query.
Use a GROUP BY query to get the SectionIDs and the number of questions in each Section, filtered by the Sections you want to include.
Iterate through that result to build a dynamic UNION ALL query that gets a TOP n (calculate n based on the percentage of the Section's Count / Total Count) of questions for each Section (one query per section), so that you end up dynamically building something that looks something like this:
SELECT TOP 5 ID, Question --because SectionID 1 is 50% of the questions FROM Questions WHERE SectionID=1 ORDER BY NEWID() UNION ALL SELECT TOP 3 ID, Question --because SectionID 2 is 30% of the questions FROM Questions WHERE SectionID=2 ORDER BY NEWID() UNION ALL SELECT TOP 2 ID, Question --because SectionID 3 is 20% of the questions FROM Questions WHERE SectionID=3 ORDER BY NEWID()
Another approach you could think about is to create an artificial ranking column that is factored by the relative density of the section.
What I mean, for example (super simplifying it) is suppose Section 1 was 75% of the questions, and Section 2 was 25%.
You'd use ROW_NUMBER(), partitioned by
SectionID, ordered by NEWID() and factored so that:
Section 1 would have values like 1,2,3,5,6,7, etc (3 out of every 4 cardinal values)
Section 2 would have values like 1, 5, 9, 10 etc (1 out of every 4)
Then Order your query result by this artificial column.
Oracle 12c: SQL, If only certain rows are being retrieved from a small table, for example, the CBO might decide not to use an index, as it wouldn't improve the query's on database statistics: • Selectivity—The proportion of rows from the row set to be of best response time to return the first n number of rows; n can equal 1, 10, 100, or 1000. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. Learn more Select n amount of random rows where n is proportionate to each value's % of total population
This is untested in the absence of sample data, however, something like this might work:
WITH CTE AS( SELECT ID, Question, SectionID, ROW_NUMBER() OVER (ORDER BY NEWID()) AS RN, (COUNT(ID) OVER (PARTITION BY SectionID) / (COUNT(ID) OVER () *1.0)) *10 AS Perc FROM YourTable ) SELECT TOP 10 ID, Question, SectionID FROM CTE WHERE RN <= CEILING(Perc) ORDER BY RN ASC;
Data Sampling: Techniques for Efficiently Finding a Random Row , One would like to do "SELECT ORDER BY RAND() LIMIT 10" to get 10 rows at random. But this is all the rows. There are many techniques that require a full table scan, or at least an index scan. Each has a time proportional to the number of rows returned. Virtually all This has long bugged me about SQL. Could one SQL SERVER – Generate A Single Random Number for Range of Rows of Any Table – Very interesting Question from Reader; SQL SERVER – Random Number Generator Script – SQL Query; However, I have not blogged about following trick before. Let me share the trick here as well. You can generate random scripts using following methods as well.
Another alternative, for example...return 20% of total rows per section
DECLARE @percentage numeric(10,2) SET @percentage = 0.20 --20% of total question for section SELECT [SectionID],[ID],[Question] FROM ( SELECT [ID], [Question], [SectionID], ROW_NUMBER() OVER(PARTITION BY SectionID ORDER BY NEWID()) [idx], COUNT(1) OVER(PARTITION BY SectionID) * @percentage AS [Proportional] FROM [Questions]) tbl WHERE (tbl.[SectionID] = 1 AND tbl.[idx] <= [Proportional]) OR (tbl.[SectionID] = 2 AND tbl.[idx] <= [Proportional]) OR (tbl.[SectionID] = 3 AND tbl.[idx] <= [Proportional])
Proceedings of the Fifteenth International Conference on Very , SELECT * FROM T ORDER BY CI Barring data skew, the performance of a sort The cost complexity of n-to-1 communication in contrast is proportional to nr. Section 4 describes briefly the ARBRE prototype built at the IBM Almaden the number of processors required is a function of the number of records to be sorted). How to Return Random Rows Efficiently in SQL Server 01 Dec 2010 Introduction. When building an application, sometimes you need some random rows from a database table whether it is for testing purpose or some other. There are different ways to select random rows from a table in SQL Server.
You can use the NTILE(100) function along with a over clause partition by section to get
SELECT TOP 10 [Id], [Question] FROM [Questions] WHERE [SectionId] IN (1,2) ORDER BY NEWID()
declare @limit int = 10; ;with data as ( SELECT NTILE(100) over (partition by sectionid ORDER BY NEWID() ) as Centile, [Id], [Question] FROM [Questions] WHERE [SectionId] IN (1,2) ) select * from data where centile <= @limit
SQL Tutorial: How To Write Better Querie, If the optimizer gets a poorly formulated query, it will only be able to do as much… Take a look at the following section to learn more about anti-patterns and to retrieve too many records that don't necessarily satisfy your query goal. of O(n), because a full table scan will be required unless the total row If PARTITION BY is not specified, the function treats all rows of the query result set as a single group. For more information, see OVER Clause (Transact-SQL). order_by_clause The ORDER BY clause determines the sequence in which the rows are assigned their unique ROW_NUMBER within a specified partition. It is required.
TABLESAMPLE Clause, You might use this clause with aggregation queries, such as finding the specified percentage based on the total number of bytes for the entire set of table data. in that column; therefore, one file could contain many more rows than another. a TABLESAMPLE SYSTEM(10) clause would sample data files representing at In previous post, we learnt how to get top n rows from table. We can either provide row count or percent of records we want to get from a table by using TOP in our select query.
34.1, When you select records randomly from a larger data set (or some master In this section, we'll investigate sampling without replacement. You might also want to change the proportion 0.30 to various other numbers between 0 and 1 variable _N_ equa1s 1, set the variable n to the value of the variable total (here, 50). The general concept behind this solution is simple: you add a new column to your query, fill it with a list of random numbers, sort on those random numbers, and retrieve the top n rows, where n is a number between 1 and the number of rows in your underlying data. There’s only one complicating factor: to create the random number, you need to
[PDF] Random sampling in Apache Hive, Next few sections describes about various random sampling techniques. SQL interface to execute Hive QL queries which are converted to map reduce jobs . To select a random sample of P percent of data can be achieved with row level to draw a sample of size S from a table T. Let the total number of rows in the. The NTILE(N) ranking window function is used to distribute the rows in the rows set into a specified number of groups, providing each row in the row set with a unique group number, starting with the number 1 that shows the group this row belongs to, where N is a positive number, which defines the number of groups you need to distribute the rows