Home › Forums › The XBRL API › Lists of similar concepts
- This topic has 121 replies, 15 voices, and was last updated 12 months ago by Peter W Reed.
-
AuthorPosts
-
-
Friday, January 4, 2019 at 5:14 PM #119805Tim BuiParticipant
Hello, as companies have the freedom to create unique names of their tags (concepts), does XBRLUS keep a list of all similar concepts on the edgar_db database so that the users can save time in doing standardization?
For example, to get a simple Total Revenue, so far I found 5 different concepts:
- Revenues
- SalesRevenueNet
- SalesRevenueGoodsNet
- TotalRevenuesAndOtherIncome
- RevenueFromContractWithCustomerIncludingAssessedTax
To find Nonrecurring charges (such as impairment and restructuring), so far I found 13 concepts:
- Assetacquisitioncharge
- AssetImpairmentCharges
- ImpairmentOfLongLivedAssetsToBeDisposedOf
- RestructuringCostsAndAssetImpairmentCharges
- ImpairmentOfIntangibleAssetsIndefinitelivedExcludingGoodwill
- Impairmentandrestructuringexpenses
- RestructuringSettlementAndImpairmentProvisions
- GoodwillImpairmentLoss
- RestructuringCharges
- RestructuringCosts
- RestructuringChargesAndAcquisitionRelatedCosts
- GoodwillAndIntangibleAssetImpairment
- ImpairmentOfLongLivedAssetsHeldForUse
The above 13 non-recurring items are exclusive of the additional 7 gain/loss concepts below:
- GainsLossesOnExtinguishmentOfDebt
- DerivativeGainLossOnDerivativeNet
- GainLossOnSaleOfBusiness
- GainLossOnSaleOfNonstrategicBusinessesAndAssets
- GainLossRelatedToLitigationSettlement
- GainLossOnDispositionOfAssets1
- GainLossOnDispositionOfIntangibleAssets
Thank you!
-
Monday, January 7, 2019 at 2:18 PM #119925David TaurielloKeymaster
Tim – we don’t maintain lists like the ones posted here; you’ve done a great job to create a few groups for standardizing elements and tagged it so others can find it easily and post additional groups or additions to those listed above.
-
Monday, January 7, 2019 at 9:46 PM #119958Tim BuiParticipant
For trend analysis, we need to look at historical numbers. However, some companies changed the tag names so it is very difficult to pull a consistent list of numbers using the API. If anyone has a solution, please share with me.
For example, Microsoft (CIK: 0000789019) used 4 tags for its total revenue over the last few years:
– For FY 6/18 (which has data for 6/18, 6/17,and 6/16) it used “revenuefromcontractwithcustomerexcludingassessedtax”
– For FY 6/17, it used “SalesRevenueGoodsNet”
– For FY 6/15, it used “SalesRevenueNet”
– For FY 6/10, it used “Revenues”
Using only one of the above tags will get only 3 years of data. If we use all 4 tags, then we have duplications such as the list below. Even with 4 tags, I am not able to get Total Rev for FY2018.entity.cik period.fiscal-year period.end concept.local-name fact.value
0000789019 2009 2009-07-01 Revenues 58437000000
0000789019 2010 2010-07-01 Revenues 62484000000
0000789019 2009 2009-07-01 SalesRevenueNet 58437000000
0000789019 2010 2010-07-01 SalesRevenueNet 62484000000
0000789019 2011 2011-07-01 SalesRevenueNet 69943000000
0000789019 2012 2012-07-01 SalesRevenueNet 73723000000
0000789019 2013 2013-07-01 SalesRevenueNet 77849000000
0000789019 2014 2014-07-01 SalesRevenueNet 86833000000
0000789019 2014 2014-07-01 SalesRevenueGoodsNet 72948000000
0000789019 2017 2017-07-01 SalesRevenueGoodsNet 57190000000
0000789019 2017 2017-07-01 SalesRevenueNet 89950000000
0000789019 2016 2016-07-01 SalesRevenueGoodsNet 61502000000
0000789019 2016 2016-07-01 SalesRevenueNet 85320000000
0000789019 2015 2015-07-01 SalesRevenueGoodsNet 75956000000
0000789019 2015 2015-07-01 SalesRevenueNet 93580000000-
Monday, August 9, 2021 at 3:24 PM #193159Peter W ReedParticipant
Tim I have the same experience as you. I have found label mapping in the “XBRL TAXONOMY EXTENSION LABEL LINKBASE” file to be the best source for mapping common names with XBRL names. Below is an example from Microsoft’s latest 10-Q filing. It maps “us-gaap:NetIncomeLoss” to the common name “Net income”. The Python package “xmlschema” is the best library I’ve found to parse Edgar xml files. BeautifulSoup does a great job too. I prefer BeautifulSoup due to familiarity and the time it takes xmlSchema to execute (15 seconds) is too long.
To get the history you mention parsing needs to be done on each individual filing, yuck. Even then additional data cleansing is required. I know a better way is out there.
<link:label xml:lang=”en-US” xlink:label=”us-gaap_NetIncomeLoss_lbl” xlink:type=”resource” xlink:role=”http://www.xbrl.org/2003/role/totalLabel”>Net income</link:label>
-
Tuesday, August 10, 2021 at 11:02 AM #193173Tim BuiParticipant
Thank you for sharing the info on label mapping, Peter! I am reading this topic now. I also need to learn BeautifulSoup so that I can parse the xml files.
-
Tuesday, August 10, 2021 at 1:35 PM #193175Peter W ReedParticipant
I can upload snips from my program to github, if you think there is value. The code is sloppy and meant for personal use only. But it would be a good start with BeautifulSoup. The xmlschema library is very easy to use. I will include code snips for xml parsing too. My assumption is you are familiar with Python data structures like Pandas and default dictionary.
Below is a sample output of a default dictionary using a list factory. This is from Walmart’s Q1 2016 10-Q filing. There are a total of 216 keys. Below is a small sample of us-gaap mapping to common names (list):
‘stockholdersequityincludingportionattributabletononcontrollinginterestabstract’: [‘Equity:’,
“Stockholders’ Equity, Including Portion Attributable to Noncontrolling Interest [Abstract]”],
‘commonstockvalue’: [‘Common stock’,
‘Common Stock, Value, Issued’],
‘additionalpaidincapital’: [‘Capital in excess of par value’,
‘Additional Paid in Capital’],
‘retainedearningsaccumulateddeficit’: [‘Retained earnings’,
‘Retained Earnings (Accumulated Deficit)’],
‘stockholdersequity’: [“Total Walmart shareholders’ equity”,
“Stockholders’ Equity Attributable to Parent”],
‘minorityinterest’: [‘Nonredeemable noncontrolling interest’,
“Stockholders’ Equity Attributable to Noncontrolling Interest”],
‘liabilitiesandstockholdersequity’: [‘Total liabilities and equity’,
‘Liabilities and Equity’],
‘comprehensiveincomenotetextblock’: [‘Accumulated Other Comprehensive Income (Loss)’,
‘Comprehensive Income (Loss) Note [Text Block]’],
‘incomestatementabstract’: [‘Income Statement [Abstract]’],
‘revenuesabstract’: [‘Revenues:’, ‘Revenues [Abstract]’],
‘salesrevenuenet’: [‘Net sales’, ‘Revenue, Net’], -
Tuesday, August 10, 2021 at 1:52 PM #193176Tim BuiParticipant
Please upload your program to Github, Peter. I would learn a lot from it. I started to learn Python and Pandas and can get around Python comfortably. Would you also please send out your Github link? Thank you!
-
Tuesday, August 10, 2021 at 5:18 PM #193181Peter W ReedParticipant
I placed the code snippets in a readme for a new project. Creating a .py file didn’t make sense since it wasn’t a full script or algorithm. Please find a link to the readme file below.
-
Tuesday, August 10, 2021 at 5:51 PM #193182Tim BuiParticipant
Thank you, Peter! I am studying the codes now
-
Wednesday, August 11, 2021 at 6:25 PM #193201Peter W ReedParticipant
The xmlschema code snip will fail as written. It worked well for me because I had already defined the header in the BeautifulSoup program. I’ll edit the github to include the header definition. The SEC has an acceptable use policy that forbids (error code 403) the use of automated web scraping programs. The default header for request used in BeautifulSoup and urllib used by xmlschema identifies itself as a Python header. I believe they did not intent to prohibit programs using Python under user control like Arelle, or the programs I posted on github.
Use the header definition provided below. Note there are Python modules that perform more sophisticated header rotation. Edgar doesn’t seem to need that level of spoofing to work.
headers = {
“Host”: “www.sec.gov”,
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0”,
“Accept”: “text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8”,
“Accept-Language”: “en-US,en;q=0.5”,
“Accept-Encoding”: “gzip, deflate, br”,
“DNT”: “1”,
“Connection”: “keep-alive”,
“Upgrade-Insecure-Requests”: “1”,
“Sec-GPC”: “1”,
“Cache-Control”: “max-age=0”
}- This reply was modified 3 years, 4 months ago by Peter W Reed. Reason: typographical errors
-
Wednesday, August 11, 2021 at 6:51 PM #193203Peter W ReedParticipant
Be careful with the xmlschema output. The xbrl parsing is correct, but printing the output is precarious. Multiple styles are used in creating the 10-Q/K. The output “xmlTmp_xbrl” is a dictionary but the keys and other data structure elements change year to year and company to company. The number of variations is small but a test for which variation is in use needs to be determined.
-
Tuesday, May 31, 2022 at 7:32 AM #201042Timur MirzaevParticipant
Hi Tim,
Have you been able to resolve your issue with duplicate values once there is a tag change? I am faced with exactly the same issue.
Regards,
Timur
-
Tuesday, May 31, 2022 at 9:52 AM #201043Tim BuiParticipant
Hi Timur,
I am still struggling to make the vast amount of data provided by the SEC be useful for my analysis. So far I have no solutions to standardize the financial items to make them comparable inter-industry (financial items have to be compared to others in the same industry) or overtime (when the tag names changed). Your question brings up an idea for me to ask the SEC staff if they provide a code to alert users if there are changes in the tag names or reporting format. So far, the SEC only provides a code if a company changes its name (Submission file, code ‘former’ and ‘changed’ in the Financial Statement Data Sets and Financial Statement and Notes Data Sets)Regards,
Tim
-
Tuesday, May 31, 2022 at 11:36 AM #201055Peter W ReedParticipant
I hope this helps – In the “US_GAAP_Taxonomy_YYYY.xlsx” spreadsheet the tab ‘Elements’ has deprecation description. Columns ‘N’ and ‘O’ provide the information on a tags status.
I gave up scraping the 10-K/Q for just the reason mentioned. The XBRL tag names change over time. So I went back to Yahoo! and Alpha Vantage.
One reverse engineering activity I started doing was to map the Yahoo! element name to the XBRL tag name. The correlation id is the value. For example Alpha Vantage – Total Asset 10,000 and XBRL Asset 10,000. Good hunting!
-
Tuesday, May 31, 2022 at 12:00 PM #201057Tim BuiParticipant
Thank you for pointing this out, Peter!
As you pointed out col N and col O alerts the tags that were deprecated (referenced to col A which has the actual tag names that were deprecated).
I think it will take some programming skills (way beyond my current ability) to match these deprecated tags and their years to the point-in-time data to line up the numbers properly.
It is so frustrating to see there are so much data there but we individuals (at least those with low technical skills like me) can’t take advantage of them yet.Thanks again, Peter!
-
Wednesday, June 1, 2022 at 12:58 PM #201082Timur MirzaevParticipant
Hi, Tim and Peter!
Thank you both for your replies. Could you please guide me to where I can get the Taxonomy xlsx file mentioned above? I went to XBRL.us Taxonomy page but the xlsx links (for 2020-2022 periods) were broken (“BlobServer.txt” was all I was getting).
Thank you in advance!
Regards,
Timur -
Wednesday, June 1, 2022 at 2:24 PM #201086Peter W ReedParticipant
Something on the SEC site changed, or I’m not remembering correctly. Below is the link I remember using to get the Excel files for taxonomies. All I’m seeing now are xml schema (.xsd) files.
https://www.sec.gov/info/edgar/edgartaxonomies.shtml#IFRS2018
-
Wednesday, June 1, 2022 at 2:40 PM #201089Tim BuiParticipant
Timur, I tried to reply 3 times with the URLs but this text app probably does not allow URL.
You can google “US GAAP Taxonomies 2022” and it will lead you to the FASB site to download the Excel sheet.
You can also get the Excel sheet from this XBRL US website if you search for US Taxonomies
-
Wednesday, June 1, 2022 at 3:14 PM #201091Peter W ReedParticipant
Tim,
I didn’t have any success with the FASB website earlier today. That is where I remember getting it before, too.
So I tried a different track and went to the IFRS site. The link below is for 2019. The information is presented in a different manner – it may even be better for your application. I tried using the “Link” function for this link. I hope it works,
-
Tuesday, July 12, 2022 at 7:25 AM #201088Tim BuiParticipant
Timur, I had replied twice but some how my notes were not posted.
You can get the Excel files of US Taxonomy from these 2 sources:
– https://fasb.org/Page/PageContent?PageId=/xbrl/2022financial.html
– https://xbrl.us/xbrl-taxonomy/2022-us-gaap/ -
Tuesday, July 12, 2022 at 7:28 AM #201960David TaurielloKeymaster
Hi Tim – thanks for posting and apologies to you and Timur – posts with URLs are pushed to moderation and I didn’t get a notification for this.
-
Tuesday, July 12, 2022 at 7:29 AM #201085Tim BuiParticipant
Hi Timur, You can google “US GAAP Taxonomy 2022” then it will take you to this site: https://fasb.org/Page/PageContent?PageId=/xbrl/2022financial.html
In the section “Guidance and Supporting Documents”, you will see 2022 GAAP Taxonony(Excel Version). You can download and extract this large Excel file. The fist tab is “Elements” that contains Col N & O that Peter referred to.XBRL.US also has the same Excel (xls format) at this site: https://xbrl.us/xbrl-taxonomy/2022-us-gaap/
Tim
-
-
Monday, January 7, 2019 at 9:58 PM #119959Tim BuiParticipant
The trend in Gross Margin (GrossProfit / Total Revenue) says a lot about the business condition of a company. However not every company reports Gross Profit. For those who do not, we have to subtract Cost of Goods Sold from Total Revenue. I do not have a solution to get consistent numbers for Total Revenue yet, however, for Cost of Goods Sold, I have found a few different tags below:
– CostOfGoodsAndServicesSold
– CostOfRevenue
– CostOfGoodsSoldExcludingDepreciationDepletionAndAmortization
– CostsAndExpenses -
Thursday, February 14, 2019 at 7:30 PM #123764David TaurielloKeymaster
Tim – to get lists of concepts in the base (standard) US GAAP and IFRS Taxonomies, query:
/dts/search?dts.taxonomy=US%20GAAP,IFRS&fields=dts.id,dts.taxonomy-name
With each dts.id from above that you want concept information for, query like this:
/concept/search?dts.id=257590&fields=concept.local-name.sort(ASC),concept.*,label.*
Every company references a base taxonomy in its filing as a starting point. The company filing is essentially a taxonomy that is inheriting from the base and adding concepts as necessary according to the company’s policies and practices with respect to its financial statements.
NOTE: you could include concept.local-name with a comma-delimited list before the fields that are returned to get only specific concepts.
If you pull an entire taxonomy, understand that there’s a great deal of information and a significant number of records involved, so this may take a while. Fortunately, these taxonomies don’t change – new releases annually – so the details only need to be pulled one time.
-
Thursday, February 21, 2019 at 11:19 AM #124615Tim BuiParticipant
Hi David,
I finally got around to apply the codes that you described above. It’s just amazing how much information returned from the call.A couple questions please:
1. Using =CONCATENATE(A1&”/concept/search?dts.id=257590&fields=concept.local-name.sort(ASC),concept.*,label.*”) only return about 2000 concept local names. How can I change the code so that I can get a complete list so that I can pick and choose what I need?2. I am still trying to find ways to shorten my code to get more factual data for more companies per call (to be under the length of the code that Google restricts). Can I use the concept.id in place of the concept.local-name in the call?
As always, thank you for help!
Tim
-
Thursday, February 21, 2019 at 1:49 PM #124655David TaurielloKeymaster
Use
ENDPOINT.offset(integer)
in the fields= portion of your query to get additional concepts. Something likeconcept.offset(2001)
will show concepts after 2,000 (for Power User and Sole Practitioner Individuals, as well as all Organizational XBRL US Members). This offset works for all endpoints (fact, dts, etc.)See the documentation and this thread for more information: https://xbrl.us/forums/topic/how-to-get-a-sample-of-records-via-xbrl-api-for-evaluation/#post-116618.
Again – because you’re querying for the base taxonomy concepts, the details you return will not change (these are published once), so your best/most effective approach will be to copy the details to a tab (or file) and use that as your reference.
Like other “id” parameters,
concept.id
returns a unique integer that corresponds withconcept.local-name
so it can be used as a reliable substitute. -
Thursday, February 21, 2019 at 4:53 PM #124690Tim BuiParticipant
Thank you, David. I was able to download 17,035 concepts. The next part is to understand these data and select which one to use.
Thank you again for your help
Tim -
Friday, April 19, 2019 at 6:38 PM #133617Tommy CarstensenParticipant
Tim, did you ever manager to create a list of identical concepts? It’s a bit of a mess, when it’s not standardized. I’m surprised the SEC chose to allow companies to use random names as they see fit. It completely defeats the purpose of XBRL.
-
Saturday, April 20, 2019 at 6:35 AM #133655Tim BuiParticipant
Hi Tommy,
I started on the standardization but have not finished it yet. At the beginning, I planned to use data from XBRL US but David Tauriello pointed me to the SEC website where I can get all of the data from all filers more efficiently (https://www.sec.gov/dera/data/financial-statement-data-sets.html). I am importing these data into a SQL Server to do standardization. I am learning the SEC data structure and think I have a way to do better standardization, but I need to test this method further. I would be happy to share my methodology with you if it works. In the meantime, if you want, I can give you what I have done so far, but it is incomplete. I use financial information for investing so I only standardize the items that I think relevant to my work.
Yes, I agree that whichever entity (SEC, AICPA, CFA,…) that allows companies to name their tags (concepts) at-will really weaken the case for XBRL and disadvantage “smaller” financial data users like me. It’s illegal to fudge the numbers but accountants can name the numbers whatever they want and can change these names when it’s convenient–making peer comparison or historical comparison extremely difficult. Data providers such as Bloomberg, Factset, CapitalIQ, Thomson Reuters will be in business for a very long time. -
Saturday, April 20, 2019 at 1:18 PM #133680Tommy CarstensenParticipant
Aye Tim, I think Morningstar and the rest of them will continue to thrive as long as the data is not standardised. I wanted to plot the data over time and across companies within an industry, but that turned out not to be trivial, because there is no requirement for the data to be homogenous over time and across industries. I hope the law and XBRL specifications are changed and data is homogenised going forward to enable small fintech companies to compete.
Here my first attempt to plot data for General Mills over time:
http://midgardinvest.com/10q10k/G/GIS/I shall be watching this thread to learn more about your attempts to standardise the data. I appreciate your efforts on behalf of the community. Thanks!
-
Friday, May 17, 2019 at 6:30 AM #139650D QParticipant
I too am interested in standardizing concepts for investing purposes.
I don’t know SQL but DM me if I can help.
-
-
Friday, May 17, 2019 at 7:03 AM #139657Tim BuiParticipant
Hello DQ, I am still working on standardizing the tags. I think I am on the right track, however this is not a trivial task. I was able do download 87MM (yes million) lines of data from the SEC. And now I am parsing them out using SQL Server. It is slow moving because I have to check and recheck to make sure the data match with the 10Ks and 10Qs. I am happy to give the results to whoever wants them because they are not proprietary data and I am too a beneficiary of communities like XBRL US (David Tauriello at XBRL US has spent a lot of time bringing me up to speed.)
To use this massive amount of data, one will need to use some database for sorting and screening. Microsoft Excel or Access cannot handle this much of data. Maybe we all can put in a request to XBRL US to allow members to contribute to this standardization efforts by creating a depository area on XBRL US pgAdmin.
Commercial data providers do provide their own standardization but depending one’s need, the standardization has to be customized somewhat. Sorry for this oxymoron word of customizing the standardization. But for example, companies reports several types of Account Receivables. There are 948 distinct tags on just AccRec. Most data providers have just 1 line for AccRec. I try to break them down to 4 subcategories: AccRec_Trade_Short_Term, Acc_Rec_Finance_Short_Term, Acc_Rec_Trade_Long_Term and Acc_Rec_Trade_Long_Term. Acc_Rec_Trades are the receivables from regular customers. Acc_Rec_Financing are the receivables from financing activities such as a promissory note coming due or GM financing the dealers’ floorplans. ST or LT determines whether they are in current assets or long term assets. The change of each of these subcategories provides different type of information to the financial readers.
In the meantime, I would highly recommend you to check out the XBRL XL (https://xbrlxl.com/) website created by Jim Truscott. Jim also tries to do standardization. Jim had a demonstration of his Excel API hosted by XBRL US a few months back.
Let’s hope we here from XBRL US on this matter.
Tim-
Wednesday, May 22, 2019 at 4:17 PM #140245David TaurielloKeymaster
To use this massive amount of data, one will need to use some database for sorting and screening. Microsoft Excel or Access cannot handle this much of data. Maybe we all can put in a request to XBRL US to allow members to contribute to this standardization efforts by creating a depository area on XBRL US pgAdmin.
I’ve raised this idea internally; another possibility might be creating some sort of common Google Sheet that holds these standardized terms (taking off from Peter Guldberg’s template for balance sheet)
-
-
Friday, May 17, 2019 at 7:44 AM #139658Tim BuiParticipant
By the way, DQ, There is a company named Intrinio (https://intrinio.com/) that provides standardization on Excel. The prices seem to be very reasonable. I I tested their system out and found that their intereface is very to use. I think their data is suitable for most purposes. I do I own parsing raw data because I wanted to do further subsegments for my own use.
-
Thursday, January 2, 2020 at 11:44 PM #162239Nathan SudermanParticipant
I have been working in python to load all company XBRL filings into a SQL server. Now that I have the data available, I too am running into standardization issues with the tags(concepts). I don’t want to reinvent the wheel if there is already a mapping table that has been created.
Has anyone already standardized/mapped the key tags?
-
Friday, January 3, 2020 at 1:35 PM #162295Tim BuiParticipant
Hi Nathan, I too was able to upload the Financial Data Set into SQL Server, but even after spending a tremendous amount of time trying to standardize the tags, I have failed to come up with a workable solution. There are just way too many inconsistencies and variations in the way companies naming their tags. They name the tags to best describe their particular financial items at that moment in time and change the tag if their situations change. It is difficult to find consistency for the same company over time. It becomes more difficult when comparing tags of companies within the same economic sector or across sectors.
Being an investor, I try to standardize the tags to get the granularity for financial analysis. Not sure what is your goal for standardization, but if you wanted, we can collaborate and share ideas on this issue. Tim
-
Thursday, May 14, 2020 at 10:12 AM #179790Chinmay LaddhaParticipant
Hello Tim and Nathan, I am working on a project where anyone can populate the balance sheet, Income Statement, Cash Flow statement and some financial ratios for 6-7 years data into a google sheet in format/template and parameters of my choice. A python program where I could create a GUI, a user can select the company, parameters he/she want to see, then after selecting these parameters will be visible on the google sheet
-
Thursday, May 14, 2020 at 10:21 AM #179795D QParticipant
Hi Chinmay, how do you solve the problem of standardizing the different tags?
-
-
Thursday, May 14, 2020 at 10:18 AM #179792Tim BuiParticipant
Hi Chinmay, I would love to see your project. While I have just begun to study Python so I am not sure if I can help in coding, but I do understand finance and financial analysis, so if I can assist in anyway, I would want to do that. Thank you!
-
Thursday, May 14, 2020 at 10:48 AM #179799Tim BuiParticipant
Hi DQ, if you have your method to standardize, maybe you can share with us. Here is how I do it, but I am not there yet.
I download the 2020 US GAAP Taxonomy (https://xbrl.us/xbrl-taxonomy/2020-us-gaap/) using the xls format. The tab “Presentation” inside this spreadsheet lists out all of the US GAAP tags (“name” column) and their sources (“definition” column) in hierarchical order as listed by the numbers on columns “depth” and “order”. I exact similar tags and group them the way I want to see them on a the financial statements. However, since there are so many extensions in addition to the standard tags, I am still missing a lot of tags. I do all of the grouping in SQL Server.I get additional extensions by using the data downloaded from XBRL.US xsheet. David Tauriello said there is no source to get all of the extensions in one place as they are company specific.
For grouping, I use names like ca_101_cash, ca_102_mrkt_sec, ca_103_restricted_cash,….
So far I get about 26 tags for the cash group.
CashAndCashEquivalents
CashAndCashEquivalentsAbstract
CashAndCashEquivalentsAtCarryingValueNetOfReclassificationAdjustments
CashAndCashEquivalentsExcludingCashWithBroker
CashAndCashEquivalentsExcludingCashWithBrokerTradingAccounts
CashAndCashEquivalentsRelatingToVIEs
CashCashEquivalents
CashCashEquivalentsFederalFundsSoldAndOtherShortTermInvestments
CarryingValueOfCashAndCashEquivalents
CashAndEquivalentsExcludingAssetsHeldForSale
CashAndCashEquivalentsAtCarryingValueExcludingRestrictedCash
Cash
CashAndCashEquivalentExcludingFinancialServicesCashAndCashEquivalents
cashandcashequivalentsatcarryingvalue
CashAndCashEquivalentsAtCarryingValueIncludingDiscontinuedOperations
CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents
CashEquivalentsAtCarryingValue
ShorttermBankDeposits
CashAndCashEquivalentsFairValueDisclosure
CashAndDueFromBanks
CooperativeCentralBankDeposit
CashAndCashEquivalentsAtCarryingValueExcludingCertificatesOfDeposit
CashandCashEquivalentsExcludingTimeDepositsatCarryingValue
InterestBearingDepositsInBanks
TimeDepositsAtCarryingValue
CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalentsIncludingDisposalGroupAndDiscontinuedOperations-
Tuesday, August 10, 2021 at 3:44 AM #193166Mikko OlkkonenParticipant
Tim,
I am developing my own special purpose standardization for few tags that are relevant for my specific use case. I uploaded my mapping into my github repo. I am interested in similar mappings that I could add into my version. -
Saturday, September 4, 2021 at 7:50 PM #193851Peter W ReedParticipant
I searched github “olkkonen in:name” for your repository. There were no public repositories for me to view.
I decided to have another go at the Excel downloads for 10-K and 10-Q. The approach to obtaining commonality/comparability between filings over time and between companies is context. Context being the sheet and sub-sections within sheets. I’ll let you know if I have success.
In the meantime can you invite me into your private github? Or did I just not find the correct repo? Thank you.
-
Sunday, September 5, 2021 at 7:31 AM #193861Mikko OlkkonenParticipant
Peter, My repo should be public at
https://github.com/molkko/deramongo
-
-
Thursday, May 14, 2020 at 10:51 AM #179800Tim BuiParticipant
Sorry for the typo on my note above. It’s not XBRL.US xSheet. It’s XBRLXL xSheet (https://xbrlxl.com/)
-
Saturday, October 24, 2020 at 7:18 PM #186667Husein KirefuParticipant
I read through the xbrl.org site, and was left with the impression they had created, in junction with FASB & SEC, a standardized way of taging lines of financial data with what they call “concept”. However I could not find a xblr based General Ledger giving these “concept”.
After reading this thread, am I correct in concluding there is no standard IFRS/GAAP/ect standard for pulling data across periods in a company and across companies?
If so, what value is there in xblr for an investor if a company can change/create “concept” to reclassify a ledger item requiring a manual audit of the data for integrity?
-
Monday, October 26, 2020 at 10:35 AM #186684David TaurielloKeymaster
Hi Husein – thanks for taking a look at the information on our site and at XBRL International. It’s not clear from your post what specific data you’re looking for – our Public Filings Database is designed to return all data filed under the SEC’s requirement for public companies since 2009, as well as the base taxonomies published by the FASB and the IFRS. You can use the XBRL API to get started with the data (https://xbrl.us/xbrl-api-community) – we’ve posted documentation and several templates and tools that can help familiarize you with the data set. We also link to taxonomy viewers for the base taxonomies – see the links on the right side of https://xbrl.us/2020-us-gaap.
The extensibility of the business reporting standard for reporting is one of its great strengths. The US SEC’s implementation for public company financials is an ‘open reporting’ environment (see the glossary of our Taxonomy Development Handbook – https://xbrl.us/tdh). In adopting the use of XBRL for public companies, the US SEC acknowledged its responsibility to ‘limit the use of extensions to circumstances where the appropriate financial statement element does not exist in the standard list of tags’ (https://www.federalregister.gov/d/E9-2334/p-520).
We continue to work with both the FASB and the SEC to recommend taxonomy modelling for the US GAAP accounting standard and identify issues related to filing patterns. Limiting the use of extensions has been an on-going part of our discussions.
-
-
Saturday, October 24, 2020 at 10:27 PM #186668Tim BuiParticipant
Hi Husein, XBRL has done a tremendous job in bringing financial reporting data into the 21st century. However, from my perspective, this data reporting mechanism is helpful only to the filers and to the data vendors at this present time. Unless one knows a lot about programming, individual investors like me still do not have an easy way to use this information. Companies have so much leeway in using tags and extensions to describe their particular situations. Without standardization, it is impossible to properly do trend or comparative analysis.
-
Tuesday, April 27, 2021 at 7:36 AM #191252Mikko OlkkonenParticipant
I plan to import https://www.sec.gov/dera/data/financial-statement-data-sets.html data to a MySQL database.
Are there any ready-to-use table definition SQL scripts for forming a suitable SQL database?
-
-
Tuesday, April 27, 2021 at 8:02 AM #191253Tim BuiParticipant
Hi Mikko, I am not a real programmer but here is how I get the Fin Data Set into my Postgresql (I do not use MySQL).
– Within each quarterly release, there is a readme.html file that lists the fields of each of the 4 txt files. I create 4 tables in my Postgresql with these fields using the required datatypes. I use the fields of each table as listed on Section 3 (Organization) as primary keys.
– I then use Python to clean up the data of each of the txt file and save as csv files with ‘~’ as delimiter (not comma because some fields have commas).
– Finally I use the Copy command in Postgresql to import the csv files into Postgres. -
Wednesday, July 7, 2021 at 11:44 AM #192501Matthew BeveridgeParticipant
Tim, Mikko, and others — I have a primitive (so far) python package I am working on to simplify manipulating this data by hosting the dera financials database remotely and standardizing query results into pandas dataframes. If you are interested in collaborating on this, don’t hesitate to reach out. I plan for an alpha release in the next few weeks.
Functionality will be along the lines of:
# import the package from mypackage import fundamentals # get the data form = fundamentals.ten_k('aapl', [2018, 2019]) # manipulate the data debt_cap = form.debt_capitalization() margin = form.gross_margin() roe = form.return_on_equity()
and so on. Results will look like:
print(form)
fy 2018 2019
tag uom
AccountsPayableCurrent USD 5.588800e+10 4.623600e+10
AccountsReceivableNetCurrent USD 2.318600e+10 2.292600e+10
AccruedIncomeTaxesNoncurrent USD 3.358900e+10 2.954500e+10
AccumulatedDepreciationDepletionAndAmortization… USD 4.909900e+10 5.857900e+10
AccumulatedOtherComprehensiveIncomeLossNetOfTax USD -3.454000e+09 -5.840000e+08
… … …
UnrecordedUnconditionalPurchaseObligationBalanc… USD 9.328000e+09 8.211000e+09
UnrecordedUnconditionalPurchaseObligationDueAft… USD 6.600000e+07 1.100000e+08
WeightedAverageNumberDilutedSharesOutstandingAd… shares 4.473200e+07 3.107900e+07
WeightedAverageNumberOfDilutedSharesOutstanding shares 5.000109e+09 4.648913e+09
WeightedAverageNumberOfSharesOutstandingBasic shares 4.955377e+09 4.617834e+09‘print(debt_cap)`
[251 rows x 2 columns]
fy 2018 2019
tag uom
DebtCapitalization ratio 0.707029 0.732692- This reply was modified 3 years, 5 months ago by Matthew Beveridge. Reason: code formatting
- This reply was modified 3 years, 5 months ago by Matthew Beveridge. Reason: code formatting pt 2
- This reply was modified 3 years, 5 months ago by Matthew Beveridge. Reason: include example results
-
Thursday, July 8, 2021 at 1:09 PM #192521Mikko OlkkonenParticipant
Hi Tim, Matthew and others,
I am now reading dera data (and European style ESEF XBRL) to mongoDB databases (i.e. I abandoned SQL). I have developed server functionality for querying the data with somebody. Our server code is mainly typescript, javascript/node.js. I guess that we have produced samewhat similar solution as Matthew has. Loose standardisation/use of the taxonomy/tags is our key problem even if we only need limited number of key items including ProfitLoss, Revenue, Cash, …
Tim, Matthew: where do you collaborate? In Github? -
Thursday, July 8, 2021 at 9:45 PM #192528Matthew BeveridgeParticipant
Tim and Mikko, I generally use github. I can share the repo with you all if you PM me (it’s currently private). Otherwise, I’ll get things in order to make it public and post the link within the next week or so.
-
Sunday, August 22, 2021 at 4:30 AM #193495Peter MillerParticipant
Hi Maththew,
may I ask what is your current state of the Python package? Did you upload it on GitHub?
Kind Regards
-
Monday, July 18, 2022 at 4:25 PM #202180Matthew BeveridgeParticipant
Hi Peter and apologies for the ridiculously late reply — you can find my most recent (yet very out-of-date) version here: https://github.com/mattbev/finpandas.
This is by no means a complete remedy and like Mikko said is very similar in principle to their solution. Additionally, the cloud bucket that this repository relies on has not been updated with new DERA data in many months. If there is sufficient interest I can revisit this, but otherwise, I am inclined to table/archive my version of this project until a later date.
-
Monday, July 18, 2022 at 10:52 PM #202184Peter W ReedParticipant
Thanks Matthew for taking the time to reply. I’m using Alpha Advantage for now. Let me know when you get back to your project on github? You don’t want me to write code – I’m a hack. But I will take the time to test or contribute in other ways. I’m fairly competent at documentation.
-
Wednesday, July 7, 2021 at 12:45 PM #192505Tim BuiParticipant
Hi Matthew, I would love to pitch in and help in anyway I can. After spending months on it, I almost gave up on the project because I couldn’t think through the proper way to standardize the financial items. I am a new student of Python, so hopefully I can learn from your Python program. Please let me know what steps I can do to help.
-
Thursday, July 8, 2021 at 1:17 PM #192522Tim BuiParticipant
Hi Mikko, thank you for reaching out. I would love to see your program. I created this github account.
https://github.com/TimdBui?tab=repositories
Is this what you are asking?Thank you!
-
Sunday, July 11, 2021 at 3:50 AM #192541Mikko OlkkonenParticipant
Tim, I am now following you on github. I have uploaded to my github account a bash script (deramongo.sh) I use for creating/maintaining my mongodb dera database. I have not yet uploaded the node.js server code (the node.js server offers http API for accessing the financials data stored in the mongoDB database).
However, some version of server may be up and running at:
vps-09403655.vps.ovh.net:8198/list returns company names and cik codes
vps-09403655.vps.ovh.net:8198/firm?cik=1503518 returns financials data corresponding to a cik
and so forth
-
-
Sunday, July 11, 2021 at 4:41 PM #192544Tim BuiParticipant
Thank you, Mikko! I just uploaded the files into my Postgres to check it out. I am also reading your DERA Financials data on Github right now. Thanks again
-
Saturday, August 21, 2021 at 11:05 AM #193473Peter MillerParticipant
Hello guys,
I am a Python enthusiast and was trying in the past to clean XBRL data from the official SEC database. What is the current state? I can’t follow the SQL code on Tim’s GitHub. I would appreciate a short explanation.
Kind Regards
-
Saturday, August 21, 2021 at 12:14 PM #193475Tim BuiParticipant
Could you elaborate what you are looking for or trying do to Peter?
-
Saturday, August 21, 2021 at 12:44 PM #193476Peter MillerParticipant
Thank you, Tim, for your replay.
I am trying to clean/standardize the financial statements available on the SEC website. To my knowledge, these financial statements are in XBRL format and as the original question pointed out (to my understanding), the financial statements are not standardized and for several data points there are different key words (Revenues;SalesRevenueNet and so on).
Have you guys found a way to standardize the data (with Python) to make it more useful?
-
Sunday, August 22, 2021 at 3:30 PM #193501Peter W ReedParticipant
I’m using the “XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT” file that is part of the Edgar filing. It has the mapping from gaap tag to common name. The following comes from an Apple 10-Q filing. I use the Python module xmlschema to create a dictionary. The real work comes later cleaning the results for usability and extending this back five years.
us-gaap_SalesRevenueNet_lbl = Revenue, Net
us-gaap_SalesRevenueNet_lbl = Net sales
us-gaap_Liabilities_lbl = Liabilities
us-gaap_Liabilities_lbl = Total liabilities
us-gaap_LiabilitiesAndStockholdersEquity_lbl = Liabilities and Equity
us-gaap_LiabilitiesAndStockholdersEquity_lbl = Total liabilities and shareholders’ equity
us-gaap_LiabilitiesAndStockholdersEquityAbstract_lbl = LIABILITIES AND SHAREHOLDERS’ EQUITY:
us-gaap_LiabilitiesCurrent_lbl = Liabilities, Current
us-gaap_LiabilitiesCurrent_lbl = Total current liabilities -
Sunday, August 22, 2021 at 3:45 PM #193502Peter MillerParticipant
Hi,
thanks a lot.
Is it from your experience possible, that the same “us gaap tag” is assigned to different “common names” when it comes to different companies in different time periods?
-
Sunday, August 22, 2021 at 3:58 PM #193504Peter W ReedParticipant
Yes, even with the same company in different fiscal reporting periods for a given year. I was hoping doing the 10-K mappings would be sufficient for that year. It didn’t work that way for me.
I’m still hoping the XBRL taxonomy files for a given year can provide a universal solution. I don’t have a background in accounting/finance, which may be required to solve this.
-
-
Saturday, August 21, 2021 at 1:09 PM #193482Tim BuiParticipant
Peter, I use 2 steps to do standardization, but I am no where close to the end.
1. I use the US GAAP Classified Financial Statements created by Mr. Charles Hoffman (The Father of XBRL) to create standardized buckets.
2. I get all of the tags from the SEC Financial Statement and Notes Data Sets and manually group the extensions (non GAAP standards) into the above buckets.
XBRL US reply feature does not seems to allow me to paste the website URL, but you can send me a message on Linkedin and I can send you more info.-
Sunday, August 22, 2021 at 3:50 PM #193503Peter W ReedParticipant
Thanks for the tip. I read everything I could find by Mr. Charles Hoffman. The “problem” I have with his terminology tables is they stopped being updated around 2015 (or earlier). Also, using the “XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT” gives me the 10-Q/K mapping for the specific 10-Q/K for the company of interest.
I got the xmlschema library working well. The software is in a pseudo code state – i.e. a hack. When I get it cleaned up and universal I’ll put it on github.
By “Universal” I mean there are at least two variations on the Python dictionary keys used in the Edgar LABEL LINKBASE files.
-
-
Saturday, August 21, 2021 at 1:24 PM #193483Peter W ReedParticipant
There are a couple of contributors like Peter Guldberg who have accounting concept to GAAP name mapping. What I’m creating is accounting concept to GAAP name on a per company basis for their last five years of filings. I’ve found that concepts like revenue can have a different GAAP tag name over time for any given company. This entails creating a concept mapping to multiple GAAP tags. The link label file in the Edgar filing provides all the concept-to-gaap mappings for that company’s 10-Q/K.
-
Saturday, August 21, 2021 at 1:53 PM #193484Peter MillerParticipant
Hello Peter,
< The link label file in the Edgar filing provides all the concept-to-gaap mappings for that company’s 10-Q/K.>
where can i find the “link label file” on the Edgar website?
Kind Regards
-
Sunday, August 22, 2021 at 4:15 PM #193506Peter W ReedParticipant
I use the Edgar search page to get to the list of a company’s 10-Q/K. Hope you recognize this task. On that page is a link to the company’s filings by period. If you click that link you get to the detailed Edgar filing. Below is an example for Apple. That page has the complete AAPL filing for that period. You’ll see the Linkbase file on that website page.
-
Sunday, August 22, 2021 at 4:25 PM #193507Tim BuiParticipant
Peter, The Linkbase file sounds so promising. I use the new Edgar filing page all the time, but I can’t find the Linkbase file any where. Would you please give more detail steps on how to get this file? Thanks!
-
Sunday, August 22, 2021 at 5:52 PM #193513Peter W ReedParticipant
It appears the Edgar website has methods in place to hinder automated programs. Also, I just tried to paste an image into my response. It failed.
The best I can do is provide a url to Apple’s Edgar first page. Open up the 10-K hyperlink. You’ll see a “View all…” button, click there. Finally, click on an individual “Filing” button and you are there. I created a text cheat sheet that makes the clicks easier/faster. The files in the bottom half of the final page contains the linkbase, schema (WDSL), and the xml file version of the 10-K/Q. The XBRL! file is there too (click on it if you’ve not seen this before).
-
Sunday, August 22, 2021 at 5:37 PM #193512Peter W ReedParticipant
I just discovered the link I pasted in my response gets you to the Edgar page for 10-K/Q. Sorry, it is most likely done on purpose to foil bots. Use the link and then drill down to the 10-K/Q list. From that page you’ll click on individual filings. “Filing” is the name of the button to click.
https://www.sec.gov/edgar/browse/?CIK=320193&owner=exclude Edgar landing page for AAPL
-
Sunday, August 22, 2021 at 6:03 PM #193515Tim BuiParticipant
Got it, Peter. Thank you for your instruction. I see several XML files in the lower half named XBRL TAXONOMY EXTENSION CALCULATION LINKBASE DOCUMENT, XBRL TAXONOMY EXTENSION DEFINITION LINKBASE DOCUMENT, XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT. Thank you again!
-
Sunday, August 22, 2021 at 6:22 PM #193516Peter W ReedParticipant
import pandas as pd
import xmlschema
import requestsr = requests.get(xml_url,headers=headers)
xs = xmlschema.XMLSchema11(c)
xmlTmp_xbrl = xs.to_dict(d)The Python module I’ve had the best success with is “xmlschema”. The Schema (.xsd) file “c” is the r.text from browsing to that file. “d” is the r.text for the Linkbase, or XML Extract file. The result is a hierarchical dictionary.
I had BeautifulSoup working a few months back, but it was messy due to all of the XBRL variations I found. The dictionary is easier/cleaner to use.
-
Sunday, August 22, 2021 at 6:33 PM #193517Tim BuiParticipant
Thanks for sending the Python instruction, Peter. It’s wonderful. This will help me a lot.
-
Monday, August 23, 2021 at 11:29 AM #193531Peter W ReedParticipant
Glad it was helpful. Below is a cut and paste from a DataFrame I created using BeautifulSoup. It didn’t transfer well and not all columns are shown. The index is the gaap tag “us-gaap:treasurystockacquiredaveragecostpershare”. For this tag there are two common names used in the COST 10-Q/K from 2016 to 2021.
There is a total of 140 rows in the DataFrame for COST. This is the intersection of all common names used throughout the study period, 2016-2021. My intention is to make each record complete and self-describing (eleven columns).
Index Value Duration EndDate Ticker Type Period CmmNam0 CmnNam1
us-gaap:treasurystock
acquiredaveragecostpershare 151 D 2016-11-20 COST 10Q Q1 Average price per share Treasury Stock Acquired, Average Cost Per Share -
Monday, August 23, 2021 at 1:17 PM #193537Peter MillerParticipant
Hello Peter,
From my understanding, the created dictionary shows the US-GAAP tag but not the common name (Verbose Label). To link these, it would be necessary to have both.
Kind Regards
-
Monday, August 23, 2021 at 2:00 PM #193539Peter W ReedParticipant
The code below extracts the common name and the associated gaap tag name. The dictionary is “xmlTmp_xbrl” from the linkbase file. I drop two record types I’m not interested in. By creating two lists I’m able to easily create two dictionaries. One with gaap tag as the index (df2) and the second with the common name as the index (df1).
Note, this code is for a COST 10-Q filing. The keys could be different for different stocks.
gaapTag = []
commonName = []
for one in xmlTmp_xbrl[‘labelLink’][0][‘label’]:
if re.search(“us-gaap_”, one[‘@xlink:label’]):
if one[‘$’].endswith(“]”):
continue
elif one[‘@xlink:label’].endswith(“Abstract_lbl”) or one[‘@xlink:label’].endswith(“TextBlock_lbl”):
continue
else:
gaapTag.append(one[‘@xlink:label’])
commonName.append(one[‘$’])
—–
df1 = pd.DataFrame(gaapTag, index=commonName, columns=[‘GAAP_Tag’])
df2 = pd.DataFrame(commonName, index=gaapTag, columns=[‘Common_Name’]) -
Monday, August 23, 2021 at 5:05 PM #193542Peter W ReedParticipant
I hope my code snip is clear (the dictionary is for AAPL, not COST). Let me know if you have any questions about the code?
I went back to the standards document – “http://www.xbrl.org/specification/xbrl-2.1/rec-2003-12-31/xbrl-2.1-rec-2003-12-31+corrected-errata-2013-02-20.html”. Section 5.2.2.2 addresses the labels. The first paragraph for that section is cut and pasted below:
“Although each taxonomy defines a single set of elements representing a set of business reporting Concepts, the human-readable XBRL documentation for those concepts, including labels (strings used as human-readable names for each concept) and other explanatory documentation, is contained in a resource element in the label Linkbase. The resource uses the @xml:lang attribute to specify the language used (via the XML standard lang attribute) and an optional classification of the purpose of the documentation (via a role attribute).”
-
Tuesday, August 24, 2021 at 9:38 AM #193556Tim BuiParticipant
I am grateful for your sending the codes, Peter! I am out of town, but will study it when I get back. Thanks again!
-
Tuesday, August 24, 2021 at 8:15 PM #193567Peter W ReedParticipant
Be forewarned the snip I provided was only tested for “AAPL” 2014. I’m now testing for “AAPL” 2015 I found the labels have changed. I provide below the salient change that needs to be made to the program. I ran into this when I was using BeautifulSoup. I found the number of variations is small but it caused my program to look like spaghetti. This is why I moved over to xmlschema and a well formed dictionary.
2014: xmlTmp_xbrl[‘labelLink’][0][‘label’]
2015: xmlTmp_xbrl[‘link:labelLink’][0][‘link:label’]
-
-
Saturday, August 21, 2021 at 2:04 PM #193485Tim BuiParticipant
Peter, Every book has an ISBN number, but financial tags are in text form. Not having numbers and the names of tags changing over time and not comparable across companies making standardization very difficult–at least for people who lack programming skills like me.
Peter Gulberg is great. Another person who might be able to help you is Jim Truscott (xbrlxl.com). Both Peter Gulberg and Jim Truscott helped me a lot on standardization. I recommend you to check out Jim’s XBRLXL website on this topic.
On standardization, the issue for me is how detail one wants to get. For example, one might look at Total Account Receivables, but I want to break AR further down to AccRec-Trade (from customers), Acc_Rec_NonTrade (from other sources such as tax refund), Acc_Rec_Finance_Oper.
-
Sunday, August 22, 2021 at 4:08 PM #193505Peter W ReedParticipant
Thanks for the tip on Jim Truscott. I started writing some scripts to populate Peter Gulberg spread sheets. But quickly saw I’d run into the same common name, XBRL tag name problem. I think I have resolved my LinkBase scripts to create. I’m anxious to begin constructing Machine Learning programs which is my primary goal.
Have you looked at the excel sheet in the XBRL taxonomy Excel workbook calling out depth for each gaap tag? I hope to use that sheet to ensure features (gaap tags) are independent in a financial sense of the word.
-
Wednesday, August 25, 2021 at 1:15 PM #193597Mikko OlkkonenParticipant
Peter, Is the goal of your machine learning program to find tags corresponding to certain higher level concept such as Revenues or Free Cash Flow. That is basically what I am trying to do as follows:
1) I am turning financial statement data (https://www.sec.gov/dera/data/financial-statement-data-sets.html) into a local mongo database . For this I am using my deramongo.sh script that can be found in my github.
2) I output all tag-commonname pairs from my local mongo database by using my script mapping.js (sample output map3.txt of about one million tag-common name pairs is on my github)
3) from that map3.txt I can query any tag. My algorithm searches all common names corresponding to that given tag. Next it searches for new tags corresponding to each of the common names found in the previous stage. Next new common names are searched for all found tags etc. For example, if I query Revenues tag, this “fuzzy logic” algorithm converges to synonym tags like Revenue, SalesRevenueNet, SalesRevenueGoodsNet, RevenueFromSaleOfGoods and many others some of which are unfortunately clearly of poor quality i.e. not really synonyms for Revenues.
Anyway, I think that this way I can produce adequate buckets for some tens of concepts that are of interest to me. I am very interested in hearing if your machine learning program can do something like this. Maybe my (almost complete) inventory of tag-common name pairs (map3.txt) is relevant input to your machine learning program?
-
-
Wednesday, August 25, 2021 at 10:06 PM #193606Peter W ReedParticipant
Your approach is more sophisticated than mine. I’ll look at your github this weekend. On the Edgar page for a company’s 10-Q/K filing there is a Schema file (.xsd) and five xml files. I use the Python module ‘xmlschema’ to parse the linkbase file for common name to gaap tag mapping. The result is a dictionary with about 500 records, most of little interest that need to be cleaned. Below is one record of 500 in the dictionary. I use tuples for file storage.
(‘total net sales’, ‘lab_us_gaap_revenuefromcontractwithcustomerexcludingassessedtax’)My ML algorithm interest isn’t in the creation of the mapping. InsteadI’ve been using Clustering to analyze large groups of stocks, like the EFT QQQ. I need a method to gather a large number of stocks’ common attributes (“Net Sales”) over a long period. The investment strategy I use and want to improve upon is selecting the “best” twelve stocks that make up an ETF like QQQ.
-
Wednesday, September 8, 2021 at 7:13 AM #193908Peter MillerParticipant
Hi guys,
since our goal is to create standardized financial statements, I wonder if it wouldn’t be advantageous to use the “Bulk data” zip-file which can be downloaded on the SEC website. It might contain in json format all the financial data which is available for all companies.
Isn’t it easier to map these common names to standardized names instead of linking the xbrl tags to standardized names? What are your thoughts?
-
Wednesday, September 8, 2021 at 12:47 PM #193918Peter W ReedParticipant
I don’t understand the distinction between Common names and standardized names. I use the “XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT” that comes with each 10-Q/K submittals. A reason I use this xml file is due to custom labels. For Example, “lab_us-gaap_ComprehensiveIncomeNetOfTax” is mapped to “COMPREHENSIVE INCOME ATTRIBUTABLE TO COSTCO” in their latest 10-K filing.
The work Mikko Olkkonen is doing on github looks promising for standardization. Note: Mikko there is a broken url in your readme. I’ll submit a comment on github later.
What I’ve found with the bulk data and API material is timeliness. Its seems about a quarter out of date.
Back to your question on standardization of common name to gaap tag. Below is a snip from and API call I submitted. The mapping is here, but it is the same mapping found in the LABEL LINKBASE DOCUMENT file.
{“cik”:909832,”taxonomy”:”us-gaap”,”tag”:”GrossProfit”,”label”:”Gross Profit”,”description”:”Aggregate revenue less cost of goods and services sold or operating expenses directly attributable to the revenue generation activity.”
x-
Tuesday, September 14, 2021 at 1:36 PM #194068Tim BuiParticipant
Hello Peter Reed and Peter Miller, I am not a programmer (I am learning Python and XML so I can understand Peter Reed’s codes) so I don’t really use the JSON files that the SEC provided in its CompanyFacts zip folder. The way I understand it, we have to define or create our own standardization, depending on the level of detail we want. The SEC only provides the “common name” as shown on the financial statements. Due to my limited programming skills, rather than using the JSON files, currently I download the monthly Financial Statement and Notes Data Set (https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html) and import those files into Postgresql. However, eventually it seems that I need to learn JSON so I can use those daily JSON files to update data.
-
-
Tuesday, September 14, 2021 at 2:27 PM #194070Peter W ReedParticipant
My experience is that any json downloads like the one in your posting can be treated like a dictionary. I looked at the page you provided and that’s what I saw was a hierarchical Python dictionary.
I apologize for this next statement – “please don’t take any of the code I’ve provided as anything more than pseudo-code. It is not a final product, and I may choose to drop that line of inquiry and go in a different direction.”
I hit a real dilemma yesterday. I found that knowledge of context (“contextRef”) is required. A 10-K I was parsing I associated common name with a us-gaap tag name. The problem was I found that common name to tag pairing occurred twice for two entries. This can be seen in the Edgar supplied excel file. A human reader can understand the double entries but not a simple software program. The “contextRef” in the XBRL file shows the second pair is part of a segment of the 10-K. The first pair is in the main body (not a segment) of the filing.
-
Tuesday, September 14, 2021 at 3:03 PM #194071Tim BuiParticipant
Hello Peter Reed (sorry I have to use your last name as well because there are 2 Peters on this discussion),
Could you please elaborate the point ” I found that knowledge of context (“contextRef”) is required”? I am not familiar with “contextRef”. Are you thinking about the debit/credit notation of each of the tag?
-
Friday, September 24, 2021 at 3:33 PM #194308Peter W ReedParticipant
I believe I found the solution. There are tables for the 10-K/Q reports on Edgar that are not immediately seen using a browser. Below is my description for retrieving the reports. I assume knowledge of web scraping via BeautifulSoup, request.urlopen(), and regex (re). Paste the URL into your browser to see the table. You can use the “inspect” function of the browser to see the parameters to scrape: Here is a summary –
URL for reports:
f”https://www.sec.gov/Archives/edgar/data/{cik}/{accessionNoHyphen}/
{rptNum}.htm”
where:
cik: 0000909832 # With leading zeros removed, e.g. 909832
SEC Accession No.: 0000909832-18-000022
accessionNoHyphen: 000090983218000022
Report Number- rptNum: “R{1-9+}”Scraping this report will yield: (1) Common Name, (2) XBRL tag name, (3) Numerical value for the common name entry, (4) Credit type (debt/credit, (5) Period type – duration or instant, (6) Definition for the XBRL tag.
When I great solid code to share I’ll put it on github. This may take a few weeks.
- This reply was modified 3 years, 2 months ago by Peter W Reed. Reason: URL too long for a single line
-
Saturday, September 25, 2021 at 6:04 AM #194325Peter MillerParticipant
Hello Peter Reed,
First, I would like to thank you, that you haven’t given up yet. I was investing my last two weeks of my holidays into the bulk data, which contain the .json files for all the companies.
Before I comment on your last posting, I would like to share some possible problems on the bulk data/.json files:
1:
The .json file seem to be incomplete sometimes, when multiple sub-terms are summed up to one term. For instance, in the last PG 10-K Filing the sub-terms “Buildings”, “Machinery and equipment”, “Land” and “Construction in progress” are having to by current knowledge all the US-GAAP tag “us-gaap:PropertyPlantAndEquipmentGross”. This US-GAAP tag is also used for the summary for these sub-terms named “TOTAL PROPERTY, PLANT AND EQUIPMENT”. In the final .json file I could only find the summed up value, consequently under the US-GAAP tag “PropertyPlantAndEquipmentGross”. The values of the sub-terms are not provided, probably because they share the US-GAAP tag.
2:
It might be furthermore a problem, that the filling classification (10-K, 10-Q) is sometimes not correct. I have seen data points, where between the start and end date was only 3 Months, but the filling was classified as a 10-K filing. That was a problem when no start date was reported and only an end date. I solved this issue, partly, by iterating through every US-GAAP tag and searching for filings where a start and end date was provided. When the time delta between the start and end was than >335 days and smaller than 395 days, then the month of the end date was considered as the fiscal year end of that specific year. But as you can imagine, some companies didn’t report a start and end date at any of the us-gaap tags in the .json file, and consequently no fiscal year-end month could be calculated.All in all, I am extremely disappointed that some sub-terms, which are used to calculate a data point, are not reported in the .json files. If someone has a solution for that, please share.
Regarding your last posting:
1:
The tables at the links you have provided are not providing any information in which entity the numbers are displayed. So, for instance, in some tables the numbers are provided in millions and in others in thousands.
2:
These links aren’t available for all the fillings in the past, which are in the .json files.In pandas 1.30 the read_xml function was included. I have no experience with .xml files, but maybe this function can be used to extract us-gaap-tag, common name and entity of all numbers included in the official XBRL file.
-
Saturday, September 25, 2021 at 1:05 PM #194331Peter W ReedParticipant
Thanks Peter for your observations. Your frustration with consistency is why I dropped getting data from Yahoo and other such sites. I thought if I went to the source I’d be better off. I also tried the two Python modules for Yahoo financials, but found they sometimes displayed bad information too.
Couple of items:
– Yes, the SEC reports I pointed to have a problem. The problem is they display “Segment” information as well as the desired “not segmented” information. The way I’ve dropped that information is naïve keeping only the first instance. The XBRL “htm.xml” file identifies segmented and not segment data. I hope we don’t have to use that file too, thanks for the warning.
-I’ve had the best success with xml parsing by using the xmlschema module. Using the .xsd file as the schema definition file.
– I’ve always assumed that shares are in 000’s and money is in 000000’s. In the reports I mentioned in my last post the “Data Type” for each tag is defined with “xbrli:SharesItemType” or “xbrli:monetaryItemType”. I was thinking that would identify the units too. I’ll look out for the exceptions.
– I’ve only been using a small subset of companies to test with so I haven’t seen the odd behaviors you mention. Mostly I’m only interested in % change. I’m hoping those oddities smooth out by normalizing with %. -
Saturday, October 16, 2021 at 6:13 PM #194878Peter W ReedParticipant
Apologize my last reply to you was in error. I said that I hadn’t seen a change in units for money or shares. The excel based html I’m using from the edgar site indeed has such unit changes. The good news is the units are documented in by the index column. Here are the two variations I have run across – (1) “Condensed Consolidated Balance Sheets – USD ($) $ in Millions”, and (2) “Condensed Consolidated Balance Sheets – USD ($)”. My logic needs to use the absence of a “millions” unit to mean there will be six zeros that need to be stripped. The units for shares follows the same logic.
It has taken me a long time to get the excel-based Edgar data to work well. A lot of data cleansing was required. Here is a sample page –
url =”https://sec.gov/Archives/edgar/data/0000909832/000090983221000008/R{x}.htm”
where {x} takes on values from 1 to 8. -
Thursday, October 7, 2021 at 10:26 PM #194679Peter W ReedParticipant
I decided to drop the parsing of the XBRL file and instead concentrate upon the scraping of the reports I mentioned in another post. To answer your question now I want to show you the way the context is shown in those reports. Below is a table (DataFrame) that comes from the report. One table for each XBRL element within the report. The 4th row is the same information as in the “contextRef”. This attribute isn’t needed for parsing the report. The value of the PeriodType is in processing. The 10-K report’s ‘duration’ parameters are a summation of all four quarters. If you want to know the 4th quarter value for that parameter, subtract the first three quarters. The 10-K ‘instant’ parameters are a snapshot of the company as of the end date, no processing should be made. Hope this helps!
<p>
0 1
0 Name: us-gaap_CommonStockDividendsPerShareDeclared
1 Namespace Prefix: us-gaap_
2 Data Type: num:perShareItemType
3 Balance Type: na
4 Period Type: duration
</p>
-
-
Tuesday, September 14, 2021 at 4:57 PM #194074Peter W ReedParticipant
I thought about your json task. I haven’t tried this command before but I like Pandas as a data structure – pandas.io.json.read_json. This command should load your file into a DataFrame. It is a large file. I don’t know if that will be an issue for you.
On contextRef – In 10-k/Q filings on Edgar there is included a file titled “Extracted XBRL instance document”. Below is a link to a Costco 10-K XBRL file. Below that link is an extract from the XBRL file showing the Document Type (“10-K”). You can see the attributes including the contextRef element. There is a second contextRef in the file, see the second example. The difference between the two is identified by the timeframe – ‘D20190902-20200830’ and ‘I20200929’. The duration is for gaap tags like net sales which accumulates over time 09/02/2019 to 08/30/2020. The instant gaap tag is for attributes like “Assets”. It indicates a snapshot in time of the company’s asset, etc..
https://www.sec.gov/Archives/edgar/data/909832/000090983220000017/cost-20200830_htm.xml
<dei:DocumentType contextRef=”i66090f23f2724bb78bc8f6b7c7081361_D20190902-20200830″ id=”id3VybDovL2R…-0507782e341b”>10-K</dei:DocumentType>
<dei:EntityCommonStockSharesOutstanding contextRef=”i270a3479d6ec49b29cb9fa2ac0b615bb_I20200929″ decimals=”INF” id=”id3VybDovL2R…-97e8-f700a530d6e0″ unitRef=”shares”>441228027</dei:EntityCommonStockSharesOutstanding>
-
Tuesday, September 14, 2021 at 6:56 PM #194075Peter W ReedParticipant
I had forgotten that I looked at the data set file before. Here is a code snippet that could help you. dicAcc is a python dictionary.
Note the parameter “headers”. Python url libraries put Python as the user-agent. You need to overwrite that, else Edgar will reject your request. I also remember having problems due to the file sizes. I broke the Edgar data file into five segmented files. At the bottom is the code I used to parse the five files (A,B,C,D). compress[“AAPL”] returns Apple’s cik.Get file-
# Get the accession number to Ticker mapping file from Edgar
targetUrl = “https://www.sec.gov/files/company_tickers.json”
r = requests.get(targetUrl, headers=headers)
dicAcc = r.json()
print(r.status_code)headers = {
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0”
}Parse file A:
cik = str(compress[‘AAPL’])
aaplLst = []
#
with open(filePathA, “r”) as f:
for line in f:
if re.search(cik, line):
aaplLst.append([line.rstrip().replace(‘\t’,’,’)]) -
Monday, October 11, 2021 at 11:11 AM #194763David TaurielloKeymaster
Hi Tim and all on this thread – thanks for a very healthy discussion on this topic. I hope you’ll join our free webinar later this month, Working with SEC Data for Analysts for tips & strategies in working to create normalized views of US GAAP data.
-
Saturday, January 22, 2022 at 8:20 PM #197110Sammeer S RaawatParticipant
Hi Tim Bui (initiator of the thread) and all,
Following is the list of all TAGS, LAbels and their occurrences in 20210930 for revenues
In my humble opinion, it is best to download the data available in flat files and then work on it. All the relationships and description of all elements is very well defined and documented.
TAG_NUM PLABEL COUNT(*)
RevenueFromContractWithCustomerExcludingAssessedTax Revenue 774
Revenues Total revenues 742
Revenues Revenues 480
RevenueFromContractWithCustomerExcludingAssessedTax Revenues 464
RevenueFromContractWithCustomerExcludingAssessedTax Total revenue 431
Revenues Revenue 406
Revenues Total revenue 394
RevenueFromContractWithCustomerExcludingAssessedTax Total revenues 355
RevenueFromContractWithCustomerIncludingAssessedTax Revenue 132
RevenueFromContractWithCustomerExcludingAssessedTax Net revenues 125
Revenues Total Revenues 115
RevenueFromContractWithCustomerExcludingAssessedTax Net revenue 108
RevenueFromContractWithCustomerExcludingAssessedTax Sales 102
Revenues Sales 85
RevenueFromContractWithCustomerIncludingAssessedTax Revenues 83
Revenues REVENUES 66
Revenues Net revenues 64
Revenues REVENUE 62
Revenues Net revenue 57
RevenueFromContractWithCustomerIncludingAssessedTax Total revenue 57
RevenueFromContractWithCustomerIncludingAssessedTax Total revenues 57
Revenues Total Revenue 46
Revenues TOTAL REVENUES 43
RevenueFromContractWithCustomerExcludingAssessedTax REVENUE 41
ProceedsFromSaleOfAvailableForSaleSecuritiesDebt Sales 38
RevenueFromContractWithCustomerExcludingAssessedTax REVENUES 36
RevenueFromContractWithCustomerExcludingAssessedTax Total Revenues 32
RevenueFromContractWithCustomerExcludingAssessedTax Total Revenue 29
Revenues TOTAL REVENUE 23
RevenueFromContractWithCustomerIncludingAssessedTax Net revenues 14
Revenues NET REVENUES 14
RevenuesNetOfInterestExpense Net revenue 12
RevenuesNetOfInterestExpense Net revenues 12
RevenueFromContractWithCustomerIncludingAssessedTax Net revenue 11
Sales Sales 11
RevenueFromContractWithCustomerIncludingAssessedTax REVENUE 10
RevenueFromContractWithCustomerIncludingAssessedTax Sales 10
Revenues SALES 10
Revenues Net Revenue 9
RevenueFromContractWithCustomerExcludingAssessedTax NET REVENUE 8
RevenueFromContractWithCustomerExcludingAssessedTax Net Revenues 8
Revenues Revenues: 8
RegulatedAndUnregulatedOperatingRevenue Total revenues 8
RevenueFromContractWithCustomerExcludingAssessedTax Net Revenue 6
Revenues NET REVENUE 6
RevenueFromContractWithCustomerExcludingAssessedTax TOTAL REVENUE 6
RevenueFromContractWithCustomerIncludingAssessedTax Net Revenue 5
RevenueFromContractWithCustomerIncludingAssessedTax Total Revenue 5
RevenueFromContractWithCustomerExcludingAssessedTax NET REVENUES 4
RevenueFromContractWithCustomerIncludingAssessedTax Net Revenues 4
Revenues Revenue: 4
RevenueFromContractWithCustomerIncludingAssessedTax REVENUES 4
SalesRevenue Sales 4
SellingExpense Sales 4
RevenueFromContractWithCustomerIncludingAssessedTax TOTAL REVENUES 4
Revenues REVENUE: 3
GrossProfit NET REVENUE 2
GrossProfit Net Revenue 2
GrossProfit Net revenue 2
NetRevenues Net revenue 2
RevenueFromContractWithCustomerIncludingAssessedTax NET REVENUE 2
GrossProfit Net Revenues 2
NetFinancingRevenuesExcludingOtherRevenues Net revenues 2
Revenues Net Revenues 2
RevenuesNetOfInterestExpense Net Revenues 2
SalesTypeLeaseRevenue Net revenues 2
InterestAndDividendIncomeOperating Revenue 2
PremiumsEarnedNet Revenue 2
RevenueFromCollaborativeArrangementExcludingRevenueFromContractWithCustomer Revenue 2
RevenueRevenueAdjustmentsFromContractWithCustomerExcludingAssessedTax Revenue 2
RevenuesNetOfInterestExpense Revenue 2
RevenueFromContractWithCustomerExcludingAssessedTax Revenue: 2
InvestmentBankingRevenue Revenues 2
OperatingLeaseLeaseIncome Revenues 2
OperatingLeaseLeaseIncomeLeasePayments Revenues 2
RevenuesfromOtherParties Revenues 2
RevenuesIncludingFederalExciseTax Revenues 2
RevenuesNetOfRealizedAndUnrealizedGainsLossesOnInvestments Revenues 2
RevenueFromContractWithCustomerExcludingAssessedTax Revenues: 2
ProceedsFromCustomers Sales 2
ProceedsFromSaleOfMortgageBackedSecuritiesMbsCategorizedAsTrading Sales 2
RevenueNotFromContractWithCustomerOther Sales 2
SaleRevenueNet Sales 2
SellingAndMarketingExpense Sales 2
RevenueFromContractWithCustomerExcludingAssessedTax Sales revenue 2
Revenues Sales revenue 2
GrossProfit TOTAL REVENUE 2
RevenueFromCollaborativeArrangementExcludingRevenueFromContractWithCustomer Total revenue 2
RevenuesNetOfInterestExpense Total revenue 2
TotalRevenue Total Revenue 2
TotalRevenue Total revenue 2
TotalRevenuesIncludingRevenueGeneratedByVariableInterestEntities Total revenue 2
CostOfRevenue Total revenues 2
InterestAndFeeIncomeLoansAndLeasesHeldInPortfolio Total revenues 2
RevenueFromContractWithCustomerExcludingAssessedTax TOTAL REVENUES 2
RevenueFromContractWithCustomerIncludingAssessedTax Total Revenues 2
RevenueNotFromContractWithCustomerOther TOTAL REVENUES 2
RevenuesExcludingInterestAndDividends Total revenues 2
RevenueFromContractWithCustomerExcludingAssessedTax REVENUES: 1
ProceedsFromBankLoanParticipations Sales 1
ProceedsFromLimitedPartnershipInvestments Sales 1
ProceedsFromSaleAndMaturityOfAvailableForSaleSecurities Sales 1
ProceedsFromSaleOfEquitySecuritiesFvNi Sales 1
ProceedsfromSaleofEquitySecuritiesFVNI Sales 1
ProceedsFromSaleOfEquitySecuritiesWithoutReadilyDeterminableFairValues Sales 1
ProceedsFromSaleOfEquitySecuritiesWithReadilyDeterminableFairValues Sales 1
ProceedsFromSaleOfShortTermInvestments Sales 1
ProceedsFromSaleOfTradingSecuritiesHeldforinvestment Sales 1
RevenueFromContractWithCustomerIncludingAssessedTax SALES 15746
Regards
Sammeer
-
Sunday, January 23, 2022 at 1:09 PM #197128Tim BuiParticipant
Hi Sammeer, Thank you for sharing the list of tags for the revenue category. I have been trying off and on to create a list of standardized financial items without success because there are soo many variations in the names of the tags, as your list shows 110 ways of describing revenues. I think the accounting authorities allow companies too much flexibility in naming their tags to describe their particular financial situations. The liberal descriptions are useful in describing the uniqueness of items of individual companies , it is extremely difficult to do aggregation for peer comparisons. Individual investors like me who lack sophisticated programming skills still are not able to take advantage of the digitization of financial data.
-
Sunday, January 23, 2022 at 8:03 PM #197135Peter W ReedParticipant
I’m in alignment with Tim. I think it has more to do with experience in accounting than programming skills. The material Charles Hoffman has written on use of XBRL is a couple of skill levels above me. After two years of trying I gave up and decided to use the freely available normalized financials available from Yahoo, Morningstar, Macrotrends, Alpha Advantage, etc.. Their information has some flaws for example the reporting dates can be off by days – they tend to use end of month instead of the true date. I believe these outlets use the same source since I see the same flaw show up in multiple providers.
Good luck!
-
Monday, January 24, 2022 at 3:59 AM #197148Mikko OlkkonenParticipant
Hi, I guess that I am more in alignment with Sammeer’s approach. I am interested in only twenty or so of the main items (Revenue, cashflow from operations, cash, equity, …). Therefore, standardisation in my case is doable with some combination of helper scripts and manual work. My approach leads to information that has smaller scope and lower quality than the information you can find at Yahoo, Morningstar etc. However, licences by Yahoo etc tend to prevent liberal usage of information extracted from their sites. Therefore, crawling the information from their sites is not applicable in my case even if automated extraction from their sites could be technically doable.
-
Monday, January 24, 2022 at 12:46 PM #197153Peter W ReedParticipant
Mikko thanks for your thoughts. The biggest difference is our use of the information. I can appreciated your need for unfettered open-source property. Frankly I’m a little disappointed the SEC hasn’t already done this. My use case is private individual use.
I don’t know if I shared this. I know I can “easily” get the mapping from gaap terminology to ‘normal’ terminology. Using this mapping to my advantage is another thing. I realized what I created was no better than subscribing to Yahoo. Except Yahoo is less complicated to maintain and cleanse.
Edgar produces excel-like reports for every 10-K/10-Q. The url below is the base (part 1 + part2). The parameters in the link are easily gained. Where {report} is {R1, R2, R3, …,} R1 contains all of the reports details like report type {10-K, 10-Q}. Give the link a try. Here is a sample for AAPL part2 – 0000320193/000032019321000056/R1.htm
part 1: sec.gov/Archives/edgar/data/
part 2: {cik}/{accNoHyph}/{report}.htm
-
-
Monday, January 24, 2022 at 4:46 PM #197160Tim BuiParticipant
Hi Peter, could you please share the steps that you do to map the gaap tags to your terminology? I need to learn to do that in Postgresql.
The level or depth of standardized or classified financial items is very much depending on what one tries to seek. For example. I would like to see further breakdown of total debt into corporate debt, mortgages, or capitalizes debt tied to leased assets. While all debts are obligations, the nature of these debt says a lot about the company’s financial risks and financing strategy.
Thank you. Peter
-
Wednesday, January 26, 2022 at 1:02 PM #197257Peter W ReedParticipant
I believe you will find this reference of value. Alphavantage is who I decided to use for my personal purposes. I found of particular interest the GAAP Taxonomy Mappings.
Similar concepts reference material -
Wednesday, January 26, 2022 at 1:12 PM #197258Tim BuiParticipant
Hi Peter, when I click on that “Similar concepts reference material” nothing happens. Is it a website?
I have to come up with my own standardized list, however, I need to learn how to groups the tags so that I can assign to the standardized items in Postgresql. I am just looking for a methodology on how to do mapping of tags to standardized item. Thank you, Peter -
Wednesday, January 26, 2022 at 1:14 PM #197259Tim BuiParticipant
I found this Peter ‘http://www.xbrlsite.com/2014/Reference/Mapping.pdf’
-
Wednesday, January 26, 2022 at 1:22 PM #197260Peter W ReedParticipant
Thanks Tim. An excellent reference too.
To respond to your reply earlier this week. Here is a short explanation for my approach to getting terminology mapping and 10-Q/K results. I hope to get the code clean enough to post on a public site, github.
The first part of all the URLs below is “www.sec.gov/”. My approach requires 4 steps. Once programmed it isn’t as ugly as the description reads I hope.
1) “cgi-bin/browse-edgar?action=getcompany&CIK={ticker}&type={Report-Type}”
2) Scrape the web page from step #1 for the CIK and Accession Number(s). Replace the hyphens in the Accession Number with null, “”. The result I call “accNoHyph”
3) The general syntax of the url for the company’s reports is: “Archives/edgar/data/{cik}/{accNoHyph}/{report}.htm”.
a) For example, Victory Secret 10-Q Statement of Equity is r = “Archives/edgar/data/1856437/000185643721000018/R6.htm”4) Everything else is accomplished using BeautifulSoup. I assume you have familiarity with that module.
a) first action after downloading page in (3) is: td_find = r.find_all(“td”, {“class”: “pl”})b) Next search and match with (i) re.search(re.compile(r’defref_us-gaap_(\w*)’),str(x)):. Then (ii) match = re.search(re.compile(r’defref_us-gaap_(\w*)’),str(x)). The group() function for re is used to get the “us-gaap” tag.
exception to (b) The attributes on the cover page (R1) in XBRL terminology is “dei”. All other reports {R2-R7} contain “us-gaap” attributes.
c) A sample result for (4b) symbol VSCO is below, note both concept terminology and XBRL terminology for this element is present:
Total Comprehensive Income (Loss)
A child tag has the result for “Total Comprehensive Income (Loss)” which is (358).
Notes for each numbered item above. Remember “www.sec.gov/” is the prefix:
1) Victoria Secret URL is “/cgi-bin/browse-edgar?action=getcompany&CIK=VSCO&type=10-Q”
2) Scrape the web page from step #1 for the CIK and Accession Number(s). Replace the hyphens in the Accession Number with null, “”. The result I call “accNoHyph”. The Accession numbers reference a specific quarter or annual report.
3) Victory Secret report on Statement of Equity is “Archives/edgar/data/1856437/000185643721000018/R6.htm”
The {report} attribute is R1 to R7. R1 is the 10Q/K cover page. The remaining reports {R2-R7} are the cash flow, balance, and income statements. When you process R1 you need to modify the search in step 4 from “us-gaap” to “us-dei”.4) When you process R1 you need to modify the search in step 4 from “us-gaap” to “dei”. “dei” is the XBRL standard tag prefix for company info, e.g. “dei_DocumentPeriodEndDate”.
5) The header row returned in step 4 contains the name of the report, for example “… Income Statement…”. And very importantly the phrase ‘in millions’. Two styles of numbers are used those with and those without the millions trailing six zeros.
-
Wednesday, January 26, 2022 at 1:30 PM #197261Peter W ReedParticipant
-
Wednesday, January 26, 2022 at 1:30 PM #197262Tim BuiParticipant
Thank you for the detailed explanation, Peter! I am going through the steps now so learn now. I am grateful!
-
Thursday, November 10, 2022 at 5:01 PM #205327AnonymousInactive
I haven’t been succesfull with my standarization attempts but made this basic infrastructure to make quick queries and try different approaches, maybe someone finds it useful: https://github.com/gmzi/edgarQ
-
Friday, November 11, 2022 at 12:36 PM #205332Peter W ReedParticipant
gaston thank you for the GitHub program it looks like a very good tool. I wish I had seen it 2.5 years ago. I wrote a set of modules to scrape the same information from Edgar. If my program becomes too unwieldy to maintain, I’ll switch over to your downloading style.
One comment on standardization. It is impossible IMHO if it is also desired to work over all time. I’ve found instances where the GAAP concept changed over time (not certain if “concept” is the right term). This is not to say we can’t apply NLP/ML to bridge the terms.
A site I’ve found useful in addition to the XBRL taxonomy files (Excel version) is CalcBench – “https://www.calcbench.com/home/standardizedmetrics.” Try it with any gaap tag. Note – I have no conflict of interest with the company.
-
Tuesday, November 15, 2022 at 12:32 PM #205423AnonymousInactive
Thanks Peter!! I see calcbench provides a lot of guidance about which sections to look at. I’ll check how far can I get in the free tier, but the search function seems really powerful
-
-
Tuesday, January 3, 2023 at 2:45 PM #206908Peter MillerParticipant
Hi guys,
is it somehow possible that we create a chat group (like Discord or something else)?
There we could discuss approaches to standardize financial data. I feel like everyone has tried something different and we have never pooled our knowledge. In any case, I have invested over 500 hours in this topic and some failures could have been avoided if I had discussed my approach beforehand.
That being said, I’m sure the “ultimate approach” will require manual assignment of “tag keys” (which will have to be created first). This alone is a team effort.
Also, I think we can support each other from a technical point of view. I’ve only had Python experience in pandas for the last three years. I am not very good in reading XBRL documents.
I’m not up to date on what the most secure and convenient chat platform is these days.
Does anyone have any suggestion?
-
Tuesday, January 3, 2023 at 9:49 PM #206930Peter W ReedParticipant
Hi Peter,
I think a chat room for this discussion topic is a good idea. If our goal is to build software that automates the process, Github comes to mind. Github has a discussion feature (I’ve not used).
With that said, I’ve abandoned the Esperanto approach. I’ve decided to leverage gaap terminology. The indexes of my DataFrames are gaap tags – e.g., us-gaap:TreasuryStockSharesAcquired. Instead of traditional concept grouping such as “Balance Sheet” aggregation, I’m using the “calculation” sheet in the XBRL taxonomy file (.xls). Here is a link for the 2021 2021_xbrl_taxonomy. It has a column for GAAP tag and its associated common name.
-
Wednesday, January 4, 2023 at 1:03 PM #207001Peter MillerParticipant
Okay guys, I have started a GitHub discussion. I am new to GitHub but please join and tell me your thoughts.
https://github.com/ThePythonDude/XBRL-financial-statement-standardization/discussions/1
-
Thursday, January 5, 2023 at 10:01 PM #207102Peter W ReedParticipant
Hi Peter, I commented on your github post. As I scrolled through the posts here I recognized lost opportunities. So many ideas have been shared. It is time we capture them in an open-source environment like github.
-
Thursday, December 21, 2023 at 5:14 PM #215190Nathan SudermanParticipant
Hi I wanted to see if anyone has completed the standardization/grouping of similar financial concepts? I have built out a Python parser and loaded a MySQL DB, but am having difficulty analyzing the data due to the dissimilarity in names being used.
-
Thursday, December 21, 2023 at 6:08 PM #215193Peter W ReedParticipant
Unfortunately, the GAAP taxonomy is maintained to hone concepts, not similarity over the years (period of focus). As improvements are made GAAP tags are deprecated and new tags are created. The common labels (e.g. ‘Total Revenue’) are fluid, plus I’ve seen up to four variations for a given gaap tag in a single 10-K.
The GAAP Taxonomy is published yearly. Here is a link to the current year’s – https://xbrl.us/home/filers/sec-reporting/taxonomies/
The common labels are published in an XML file with each company submission. I use BeautifulSoup, not an XML parser.
-
Saturday, December 23, 2023 at 2:58 PM #215206Peter W ReedParticipant
I was in a rush when I wrote my last post on 12/21/23. I should have mentioned an approach to a universal tag to extract all of the common name labels for a gaapTag. There are a countable number – no more than four that I have found.
For Intel, a common label for 2018 and 2022 is “Net Revenue”. The gaap tags for those years are different SalesRevenueNet and RevenueFromContractWithCustomerExcludingAssessedTax.
The underlying definitions for the two gaap tags are different. I don’t have an accounting background to understand the nuances. If you were to look at Yahoo Finance all you would see is the common label “Net Revenue” for both years. Python’s NLP might help too. The first sheet in the SEC Taxonomy has a definition for each GAAP tag. Perhaps NLP could help unmangle things.
-
-
AuthorPosts
- You must be logged in to reply to this topic.
Search Forums
Recent Topics
Recent Replies
-
Spreadsheet Commands: Query Worksheet and Query Workbook Query Workbook 1 month, 3 weeks ago
-
Query for multiple dimensions 3 months, 3 weeks ago
-
Query for multiple dimensions 4 months ago
-
Showdata function 6 months, 2 weeks ago
-
Showdata function 6 months, 2 weeks ago
Documentation & Discussion
- Get started with Google Sheets OR
- Get started with Microsoft Excel OR
- Get access to as-filed data from us for other tools or your own app
- XBRL API Interactive Documentation
- Ten Tips & Techniques
- The XBRL API
- XBRL Data Community
- 2024 US GAAP Taxonomy Viewer
- Live support - Monday, 3:30 - 4:30 PM ET
Who's using this free data?
API Use 2024 || API Use 2023 || API Use 2022 || API Use 2021 || API Use 2020 || API Use 2019 || API Use 2018
Join XBRL US
- Individual Options - Basic, Power User & Sole Practitioner
- For Your Team - Startup, Non-Profit, Academic & Corporate options
- Member Benefits Comparison Table
Using the XBRL API with the Public Filings Database
Unless otherwise agreed to in writing, any and all use of the XBRL API to authenticate and retrieve data from the XBRL US Database of Public Filings implies user consent and agreement with the XBRL US API Agreement. If you are unable to agree to these terms, do not use the XBRL API.
To use the XBRL API outside of Google Sheets, your account needs to be provisioned for OAuth2 access.