SQL query mixing aggregated results and single values
- by Paul Flowerdew
I have a table with transactions. Each transaction has a transaction ID, and accounting period (AP), and a posting value (PV), as well as other fields. Some of the IDs are duplicated, usually because the transaction was done in error. To give an example, part of the table might look like:
ID PV AP
123 100 2
123 -100 5
In this case the transaction was added in AP2 then removed in AP5.
Another example would be:
ID PV AP
456 100 2
456 -100 5
456 100 8
In the first example, the problem is that if I am analyzing what was spent in AP2, there is a transaction in there which actually shouldn't be taken into account because it was taken out again in AP5. In the second example, the second two transactions shouldn't be taken into account because they cancel each other out.
I want to label as many transactions as possible which shouldn't be taken into account as erroneous. To identify these transactions, I want to find the ones with duplicate IDs whose PVs sum to zero (like ID 123 above) or transactions where the PV of the earliest one is equal to sum(PV), as in the second example. This second condition is what is causing me grief.
So far I have
SELECT *
FROM table
WHERE table.ID IN (SELECT table.ID
FROM table
GROUP BY table.ID
HAVING COUNT(*) > 1
AND (SUM(table.PV) = 0
OR SUM(table.PV) = <PV of first transaction in each group>))
ORDER BY table.ID;
The bit in chevrons is what I'm trying to do and I'm stuck. Can I do it like this or is there some other method I can use in SQL to do this?
Edit 1: Btw I forgot to say that I'm using SQL Compact 3.5, in case it matters.
Edit 2: I think the code snippet above is a bit misleading. I still want to mark out transactions with duplicate IDs where sum(PV) = 0, as in the first example. But where the PV of the earliest transaction = sum(PV), as in the second example, what I actually want is to keep the earliest transaction and mark out all the others with the same ID. Sorry if that caused confusion.
Edit 3: I've been playing with Clodoaldo's solution and have made some progress, but still can't get quite what I want. I'm trying to get the transactions I know for certain to be erroneous. Suppose the following transactions are also in the table:
ID PV AP
789 100 2
789 200 5
789 -100 8
In this example sum(PV) < 0 and the earliest PV < sum(PV) so I don't want to mark any of these out.
If I modify Clodoaldo's query as follows:
select t.*
from
t
left join (
select id, min(ap) as ap, sum(pv) as sum_pv
from t
group by id
having sum(pv) <> 0
) s on t.id = s.id and t.ap = s.ap and t.pv = s.sum_pv
where s.id is null
This gives the result
ID PV AP
123 100 2
123 -100 5
456 -100 5
456 100 8
789 100 3
789 200 5
789 -100 8
Whilst the first 4 transactions are ok (they would be marked out), the 789 transactions are also there, and I don't want them. But I can't figure out how to modify the query so that they're not included. Any ideas?